Transfer Learning for Node Regression Applied to Spreading Prediction

by   Sebastian Mežnar, et al.

Understanding how information propagates in real-life complex networks yields a better understanding of dynamic processes such as misinformation or epidemic spreading. The recently introduced branch of machine learning methods for learning node representations offers many novel applications, one of them being the task of spreading prediction addressed in this paper. We explore the utility of the state-of-the-art node representation learners when used to assess the effects of spreading from a given node, estimated via extensive simulations. Further, as many real-life networks are topologically similar, we systematically investigate whether the learned models generalize to previously unseen networks, showing that in some cases very good model transfer can be obtained. This work is one of the first to explore transferability of the learned representations for the task of node regression; we show there exist pairs of networks with similar structure between which the trained models can be transferred (zero-shot), and demonstrate their competitive performance. To our knowledge, this is one of the first attempts to evaluate the utility of zero-shot transfer for the task of node regression.



There are no comments yet.


page 15

page 16


Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction

In this paper, we introduce zero-shot cost models which enable learned c...

Zero-shot Learning with Class Description Regularization

The purpose of generative Zero-shot learning (ZSL) is to learning from s...

Adversarial Learning for Zero-Shot Stance Detection on Social Media

Stance detection on social media can help to identify and understand sla...

Analysis and Prediction of NLP Models Via Task Embeddings

Task embeddings are low-dimensional representations that are trained to ...

SNoRe: Scalable Unsupervised Learning of Symbolic Node Representations

Learning from real-life complex networks is a lively research area, with...

Multinational Address Parsing: A Zero-Shot Evaluation

Address parsing consists of identifying the segments that make up an add...

Visualizing and Understanding Sum-Product Networks

Sum-Product Networks (SPNs) are recently introduced deep tractable proba...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spreading of information or disease spreading are examples of common phenomena of spreading. Modeling the spreading process and spreading prediction has many practical and potentially life-saving applications, including the creation of better strategies for stopping the spreading of misinformation on social media or stopping an epidemic. Further, companies can analyze spreading to create better strategies for marketing their product [Guille2013diffusion, nowzari2016analysis]. Spreading analysis can also be suitable for analysis of e.g., fire spreading, implying large practical value in terms of insurance cost analysis [kacem2017small]. Analysis of spreading is commonly studied via extensive simulations [kesarev2018parallel], exploiting ideas from statistical mechanics to better understand both the extent of spreading, as well as its speed [dong2018studies].

While offering high utility, reliable simulations of spreading processes can be expensive when performed on larger networks. This issue can be addressed by employing machine learning-based modeling techniques [wu2020comprehensive]. The contributions of this work are multi-fold and can be summarized as follows.

  1. We propose an efficient network node regression algorithm, named CaBoost, which achieves state-of-the-art performance for the task of spreading prediction against strong baselines such as graph neural networks.

  2. We demonstrate that machine learning-based spreading prediction can be utilized for fast screening of potentially problematic nodes, indicating that this branch of methods is complementary to the widely adopted simulation-based approaches.

  3. We investigate to what extent the models, learned on some network A are transferable to a network B, and what type of structural features preserve this property the best. We demonstrate that transfer learning for node regression is possible, albeit only across topologically similar networks.

This work extends the paper ‘Prediction of the effects of epidemic spreading with graph neural networks’ [meznar2020spreading] from the Complex Networks 2020 conference by testing more approaches, using more simulation data, and using different centralities. Additionally, this work tests if centrality-based features can be used for zero-shot transfer learning.

The remainder of this work is structured as follows. Section 2 presents the related work which led to the proposed approach. In Section 3 we present the proposed methodology, where we re-formulate the task, present centrality data used to create features for our learners and show how we approached transfer learning. In Section 4 we present the datasets, the experimental setting, interpretation of the predictions, and the results of the empirical evaluation and transfer learning. We conclude the paper with the discussion in Section 5.

2 Related work

This section presents the relevant related work. It starts by discussing the notion of contagion processes, followed by an overview of graph neural networks and transfer learning.

2.1 Analysis of spreading processes

The study of spreading processes on networks is a lively research endeavor [nowzari2016analysis]. Broadly, spreading processes can be split into two main branches, namely, the simulation of epidemics and opinion dynamics. The epidemic spreading models can be classical or network-based. The classical models are, for example, systems of differential equations that do not account for a given network’s topology. Throughout this work, we are interested in extensions of such models to real-life network settings.

One of the most popular spreading models extended to networks is the Susceptible - Infected - Recovered (SIR) model [Kermack1927Epidemics]. The spread of the pandemic in the SIR model is dictated by parameters known as the infection rate and

known as the recovery rate. Nodes in this model can have one of three states (Susceptible, Infected, Recovered). SIR assumes that if a susceptible node comes into contact with an infected one during a generic iteration, it becomes infected with probability

. In each iteration after getting infected, a node can recover with probability (only transitions from S to I and from I to R are allowed).

Other related models include, for example, SEIR, SEIS, SWIR111Where S-Susceptible, I-Infected, R-Recovered, E-Exposed, and W-Weakened.. Further, one can also study the role of cascades [Kempe2003cascades] or the Threshold model [Granovetter1979threshold].

2.2 Machine learning on networks

Learning by propagating information throughout a given network has already been considered by the approaches such as label propagation [zhu2002learning]. However, in recent years, approaches that jointly exploit both the adjacency structure of a given network alongside features assigned to its nodes are becoming prevalent in the network learning community. The so-called graph neural networks have re-surfaced with the introduction of the Graph Convolutional Networks (GCN) [kipf2016semi]; an idea where the normalized adjacency matrix is directly multiplied with the feature space and effectively represents a neural network layer. Multiple such layers can be stacked to obtain better approximation power/performance. One of the most recent methods from this branch are the Graph Attention Networks [velickovic2018graph], an extension of GCNs with the notion of neural attention. Here, part of the neural network focuses on particular parts of the adjacency space, offering more robust and potentially better performance.

Albeit being in widespread use, graph neural networks are not necessarily the most suitable choice when learning solely from the network adjacency structure. For such tasks, methods such as node2vec [grover2016node2vec], SGE [sge], SNoRe [meznar2020snore], and DeepWalk [Perozzi2014deepwalk] were developed. This branch of methods corresponds to what we refer to as structural representation learning. In our work, we focus mostly on learning this type of representations using network centrality information.

Note that although graph neural networks are becoming the prevailing methodology for learning from feature-rich complex networks, it is not clear whether they perform competitively to the more established structural methods if the feature space is derived solely from a given network’s structure.

2.3 Transfer learning

The main bottleneck of spreading effect prediction are the expensive simulations. While the number of simulations can be reduced by using machine learning, computation of a large fraction of them might still be infeasible on larger networks. One of the solutions for this problem is transfer learning. Transfer learning can be performed in two ways, by fine-tuning a pretrained model (few-shot learning) or by using a model trained on a related problem (zero-shot learning). In our work, we focus on zero-shot learning.

In recent years, new approaches were proposed for transfer learning on networks. These mainly fine-tune pretrained graph neural networks to solve the proposed problem. An example of this is prediction of optoelectronic properties of conjugated oligomers [lee2021transfer]

, where the graph neural network is trained on short oligomers and then fine-tuned using 100 long ones. The results showed that the fine-tuned model performed much better and needed only a small sample of extra data to improve performance by 37%. Another approach tries to predict traffic congestion of a network with a small amount of historical data by training a recurrent neural network on a traffic network with a lot of historic data 

[Mallick2020TransferLW]. Zero-shot learning is less popular in the network setting, but few approaches that exist closely follow the paradigms of zero-shot learning [wang2019zerosurvey]. One such example  [Kampffmeyer_2019_CVPR] proposes a Dense Graph Propagation module that adds direct links between distant nodes in the knowledge networks to exploit their hierarchical structure.

When transferring knowledge between different networks, one must carefully craft features that are independent of a single network and represent global node characteristics. Because of this, embedding methods such as node2vec [grover2016node2vec] and SNoRe [meznar2020snore] cannot be used since they learn node representation through node indices encountered during random walks. On the other hand, transfer learning with graph neural networks is an active field of research with most approaches fine-tuning the pretrained models.

3 Proposed methodology

After introducing the task of spreading prediction, a brief methodology overview is presented. This is followed by presenting the creation of target variables, extraction of node features from the network, training of machine learning models, and the transfer of models between the networks.

3.1 Task formulation

In the case of pandemic, the intensity of disease spreading can be summarized in three values: the maximum number of infected people (the peak), the time it takes to reach the maximum number of infected people, and the total number of infected people. Let us argue why these three values are important. Knowing the maximum number of infected people during a pandemic will help us to better prepare for the crisis, as it provides a good estimation of how many resources (e.g., hospital beds) will need to be allocated to patients. In another example domain, companies might want to create marketing campaigns on platforms such as Twitter and target specific users to reach a certain number of retweets that are needed to become trending. The time needed to reach the peak is important e.g., to estimate the high time for developing a cure for the disease, or e.g., to estimate the high time for stopping the spread of misinformation on social media. Finally, the total number of infected nodes during an epidemic is important for assessing the damage, or in another example, for estimating how many computers were infected by some malware.

In our work, we focus on predicting the maximum number of infected nodes and the time needed to reach this number. We create target data by simulating epidemics from each node with the SIR diffusion model and identify the number of nodes, as well as the time when the maximum number of nodes are infected. We aggregate the generated target data by taking the mean values for each node. Finally, we normalize the data in the following way. We divide the maximum number of infected nodes by the number of nodes in the observed network. This normalization is suitable since the maximum number of infected nodes can not exceed the number of nodes in the network. The upper bound for the time when the maximum number of nodes are infected does not exist. Because of this, we divide the time when the maximum number of nodes are infected with the maximum from the observed data. In practice we might not have this maximum, so a suitable number must be chosen. Furthermore, in practice, we create the target data as described above and save the number by which we divide. After prediction, we multiply the prediction with this number to find the ‘real-world’ equivalent.

3.2 Methodology overview

The overview of the proposed methodology is presented in Figure 1. The methodology is composed of three main steps: input data creation, training the model, and transfer learning.

Figure 1: Overview of the proposed methodology.

In the first step, we create target variables and node features from the starting network. Target variables are created by running simulations on the starting network followed by their summarization and processing. While target creation can be performed in the same way on each network, feature creation depends on the learner. Embeddings and feature extraction methods generate features stored in the tabular format, which is then used for model training. Graph neural networks are on the other hand end-to-end, meaning they learn features and regression at the same time.

In the second step, we train a regression model using the extracted features and the target variables we created. We use this model to predict the target variables for the unknown nodes.

The model learned in the second step is used in the last step to predict target variables of nodes from a new (unknown) network. For such predictions, we first extract features from the new network and then use them on the pretrained model. These predictions transfer knowledge from the first network to the nodes of the second one.

3.3 Training data creation

The first part of the methodology addresses input data generation. Intuitively, the first step simulates epidemic spreading from individual nodes of a given network to assess both the time required to reach the maximum number of infected nodes, as well as the maximum number itself. In this work, we leverage the widely used SIR model [Kermack1927Epidemics] to simulate epidemics, formulated as follows.

where S represents the number of susceptible, R the number of recovered and I the number of infected individuals. Spreading is governed by input parameters and . The SIR model is selected due to many existing and optimized implementations that are adapted from systems of differential equations to networks [Guille2013diffusion]. We use NDlib [ndlib] to simulate epidemics based on the SIR diffusion model.

Target data creation results in two real values for each node. We attempt to predict these two values. The rationale for the construction of such predictive models is, they are potentially much faster than simulating multiple spreading processes for each node (prediction time is the bottleneck) and can give more insight into why some nodes are more ‘dangerous’.

The predictive task can be formulated as follows. Let represent the considered network. We are interested in finding the mapping from the set of nodes to the set of real values that represent e.g., the maximum number of infected individuals if the spreading process is started from a given node . Thus, corresponds to node regression.

3.4 Learning on the same network: Prediction with simulation data

The models we considered can broadly be split into two main categories: graph neural networks and propositional learners. The main difference between the two is that the graph neural network learners, such as GAT [velickovic2018graph] and GIN [xu2018powerful] simultaneously exploit both the structure of a network, as well as node features, whilst the propositional learners take as input only the constructed feature space (and not the adjacency matrix). As an example, the feature space is passed throughout the GIN’s structure via the update rule that can be stated as:

where MLP

corresponds to a multilayer perceptron,

a hyperparameter,

the node ’s representations at layer and the -th node’s neighbors. We test both graph neural networks and propositional learners as it is to our knowledge not clear whether direct incorporation of the adjacency matrix offers any superior performance, as the graph neural network models are computationally more expensive. The summary of considered learners is presented in Table 1.

Input Learner Method description
GAT Graph Attention Networks
GIN Graph Isomorphism Networks

node2vec + XGBoost

node2vec-based features + XGBoost
node2vec + features + XGBoost node2vec-based features + centrality based features + XGBoost
SNoRe + XGBoost SNoRe-based features + XGBoost
SNoRe + features + XGBoost SNoRe-based features + centrality based features + XGBoost
CaBoost XGBoost trained solely on centrality based features
Table 1: Summary of the considered learners with descriptions, where denotes the adjacency matrix and the feature matrix.

As the considered complex networks do not possess node attributes, we next discuss which features, derived solely from the network structure were used in the considered state-of-the-art implementations of GAT [velickovic2018graph] and GIN [xu2018powerful], or concatenated to an embedding generated using node2vec [grover2016node2vec] or SNoRe [meznar2020snore] for use in XGBoost. Further, we also test models where only the constructed structural features are considered, as well as a standalone method capable of learning node representations, combined with the XGBoost [xgboost]classifier. In this work, we explore whether centrality-based descriptions of nodes are suitable for the considered learning task. The rationale for selecting such features is that they are potentially fast to compute and entail global relation of a given node w.r.t. the remaining part of the network. The centralities, computed for each node, are summarized in Table 2. These centralities are then normalized and concatenated to create features used with some learners. In Section 4.3 we refer to the XGBoost model trained with these features as CaBoost, which is one of the contributions of this work.

Degree centrality [rodrigues2019network]
The number of edges of a given node.
Eigenvector centrality [rodrigues2019network]
Importance of the node, where nodes are
more important if they are connected to
other important nodes. This can be
calculated using the eigenvectors of the
adjacency matrix.
PageRank [Page1999ThePC]
Probability that a random walker
is at a given node.
Average Out-degree
The average out-degree of nodes
encountered during random walks
of mean length .
Number of second neighbors
Number of nodes that are neighbors to
neighbors of a given node. This number
is between and .
Table 2: Summary of the centralities considered in our work.

During model training, we minimize the mean squared error (MSE) between the prediction and the observed state , which is defined for the -th node as follows:

In Section 4.3 we use the rooted mean squared errors (RMSE), defined as follows:

to present the results instead of the mean squared error because they are easier to interpret, as the average difference in percentage of nodes between the prediction and the real value.

3.5 Transfer learning from other networks

In Section 3.4 we use centrality data to create the features used for model training. Since these features represent global characteristics of nodes and not the specific relations between them (as for example in node2vec or SNoRe), they have the advantage of being transferable between different networks. This gives us the ability to train a model on one network and use it for prediction on another network.

In this work, we use the approach outlined in the following paragraphs to train and test a regression model for transfer learning. We will use the term training network to highlight the network used for training the model, and test network as the network composed of nodes used in prediction of spreading effects.

First we create simulations with nodes from the training network as patients zero, and create target variables as shown in Sections 3.1 and 3.3. After this, we create centrality based features and use them to train the CaBoost model from Section 3.4. To predict target variables of the test network, we generate its centrality based features and use them with the previously trained model.

In Section 4.4 we use the following methodology to benchmark the performance of transfer learning models. First we create target variables and centrality based features of all networks. Then we normalize the features and use all instances to train one CaBoost model for each network. Transfer learning scores are then calculated for each model and each (different) network as the RMSE between the predicted values and target variables. We use five-fold cross-validation as the baseline score for each network.

We represent the performance of transfer learning as a heatmap. The columns of the heatmap represent the test networks, while the rows represent the training network used to create the model. The values on the diagonal represent the RMSE values of the five-fold cross-validation. The other values represent the transfer learning score divided by the baseline score. The values can be interpreted as the decrease in performance if we use a model trained on another network instead of five-fold cross-validation.

4 Empirical evaluation

In this section, we present the baselines and datasets used for evaluation and show the empirical results of the approaches outlined in Section 3. We also present how predictions from SNoRe+features model can be explained with SHAP [lundberg2017shap].

4.1 Baselines for regression

We compared the results of proposed method to the following five baselines:

Random baseline

creates an embedding of size with random numbers drawn from . We use this embedding as the input data for the XGBoost model.

node2vec [grover2016node2vec]

learns a low dimensional representation of nodes that maximizes the likelihood of neighborhood preservation using random walks. During testing, we use the default parameters.

SNoRe [meznar2020snore]

learns an interpretable representation of nodes based on the similarity between their neighborhoods. These neighborhoods are created with short random walks. During testing, we use the default parameters.

GAT [velickovic2018graph]

includes the attention mechanism that helps learn the importance of neighboring nodes. In our tests, we use the implementation from PyTorch Geometric 


GIN [xu2018powerful]

learns a representation that can provably achieve the maximum discriminative power. In our tests, we use the implementation from PyTorch Geometric [Fey2019ptg].

For comparison we also add the averaged simulation error. We calculate this error with the RMSE formula, where and is the mean absolute difference between simulation results and their mean value. This baseline corresponds to the situation, where only a single simulation would be used to approximate the expected value of multiple ones (the goal of this work).

4.2 Experimental setting

We used the following datasets for testing222Available at Hamsterster [soc-hamsterster], Advogato [massa2009bowling], Wikipedia Vote [leskovec2010signed], FB Public Figures [rozemberczki2019gemsec], and HEP-PH [leskovec2007hepph] taken from the Network Repository website [nr]. Some basic statistics of the networks we used can be seen in Table 3. Two of the networks used during testing are visualized in Figure 2. The network nodes in this figure are colored based on the values of the target variables.

Name Nodes Edges Components
Percentage of nodes
in largest component
Wikipedia Vote [leskovec2010signed] 889 2914 1 1
Hamsterster [soc-hamsterster] 2426 16630 148 0.82
Advogato [massa2009bowling] 6551 43427 1441 0.77
FB Public Figures [rozemberczki2019gemsec] 11565 67114 1 1
HEP-PH [leskovec2007hepph] 12008 118521 278 0.93
Table 3: Basic statistics of the networks used for testing.
Figure 2: Visualization of Advogato (left) and Hamsterster (right) networks. The color represents the target value we get when spreading starts from a given node. Color on Advogato dataset represents the maximum number of infected nodes while on Hamsterster dataset time until the maximum number of infected nodes is reached is shown. Blue colors represent low values while red ones represent high ones. Since nodes with similar centrality values have similar characteristics, these nodes should be colored similarly.

We used the following approach to test the proposed method as well as the baselines mentioned in Section 4.1. We created the target data by simulating ten epidemics starting from each node of every dataset. We created each simulation using the SIR diffusion model from the NDlib [ndlib] Python library with parameters and . We then created the target variables by identifying and aggregating the maximum number of infected nodes and the time when this happens. We used these variables to test the methods using five-fold cross-validation. We used XGBoost [xgboost] with default parameters as the regression model with proposed features based on the mentions centralities, the random baseline, SNoRe [meznar2020snore], SNoRe+centrality features, node2vec [grover2016node2vec]

, and node2vec+centrality features baselines. Baselines GIN and GAT were trained for 200 epochs using the Adam optimizer 


. Since GIN and GAT are primarily used for node classification, we changed the output layer to a ReLu 

[nair2010relu] layer, so they perform regression.

4.3 Results of models trained with simulation data

The results of the evaluation described in Section 4.2 are presented in Tables 456, and 7. Tables 4 and 5 show the results on all the nodes from the network, while Tables 6 and 7 only on the nodes from the network’s largest component. The results show that the learners significantly outperform the random baseline and the averaged simulation error, especially when predicting effects on networks with several components. Models CaBoost, node2vec+features, and SNoRe+features perform significantly better than others and all use centrality-based features to train the XGBoost model. These best approaches achieve RMSE scores around 0.05, which corresponds to an error of around 5% of nodes on average.

The results for the prediction of the maximum number of infected nodes on the whole network are shown in Table 4. The results show that SNoRe+features model has the lowest RMSE on most networks, but that this is mostly because of the centrality based features, since all the learners that use them give similar results. We also see that graph neural networks perform poorly, on most networks only beating the random baseline. This might be because we use features extracted from network and small amount of training data. It is also worth mentioning that the three best performing models perform significantly better than the averaged simulation error and that the node embedding methods node2vec and SNoRe perform much worse when used without the centrality based features.

Dataset Advogato Hamsterster HEP-PH FB Public Figures Wikipedia Vote
CaBoost 0.0519 ( 0.0045) 0.0429 ( 0.0116) 0.0481 ( 0.0017) 0.0521 ( 0.0007) 0.0600 ( 0.0020)
GAT 0.1748 ( 0.0072) 0.1534 ( 0.0024) 0.1761 ( 0.0019) 0.0594 ( 0.0010) 0.0608 ( 0.0013)
GIN 0.0646 ( 0.0238) 0.0712 ( 0.0597) 0.1753 ( 0.0806) 0.0579 ( 0.0015) 0.2076 ( 0.2531)
Random 0.3156 ( 0.0024) 0.2915 ( 0.0029) 0.2107 ( 0.0012) 0.0625 ( 0.0003) 0.0732 ( 0.0039)
SNoRe 0.1743 ( 0.0057) 0.1591 ( 0.0053) 0.1611 ( 0.0048) 0.0600 ( 0.0003) 0.0667 ( 0.0032)
SNoRe+features 0.0515 ( 0.0044) 0.0438 ( 0.0114) 0.0467 ( 0.0018) 0.0514 ( 0.0004) 0.0597 ( 0.0009)
node2vec 0.0673 ( 0.0054) 0.0841 ( 0.0143) 0.0835 ( 0.0031) 0.0575 ( 0.0005) 0.0690 ( 0.0021)
node2vec+features 0.0574 ( 0.0037) 0.0431 ( 0.0114) 0.0494 ( 0.0031) 0.0515 ( 0.0004) 0.0590 ( 0.0010)
Simulation error 0.0644 0.0576 0.0796 0.0982 0.1064
Table 4: Cross-validation results for maximum number of infected nodes on the whole network.

Table 5 shows the performance results for prediction of time needed to reach the maximum number of infected nodes on the whole network. We see that SNoRe+features model performs the best overall. This is probably due to features that represent both the similarity between neighborhoods of nodes and their global characteristics. The results also show that GIN and GAT are not suitable for such a task since they often perform much worse than some other learners (especially GAT) and in some cases worse than the simulation error.

Dataset Advogato Hamsterster HEP-PH FB Public Figures Wikipedia Vote
CaBoost 0.0571 ( 0.0032) 0.0540 ( 0.0037) 0.0459 ( 0.0012) 0.0448 ( 0.0005) 0.0647 ( 0.0027)
GAT 0.1411 ( 0.0021) 0.1027 ( 0.0012) 0.0971 ( 0.0006) 0.0642 ( 0.0008) 0.0760 ( 0.0021)
GIN 0.0782 ( 0.0274) 0.0766 ( 0.0162) 0.0717 ( 0.0142) 0.0497 ( 0.0035) 0.0701 ( 0.0069)
Random 0.2073 ( 0.0013) 0.1209 ( 0.0014) 0.1095 ( 0.0003) 0.0817 ( 0.0004) 0.0992 ( 0.0019)
SNoRe 0.1463 ( 0.0033) 0.1007 ( 0.0022) 0.0904 ( 0.0006) 0.0646 ( 0.0011) 0.0753 ( 0.0073)
SNoRe+features 0.0557 ( 0.0038) 0.0545 ( 0.0014) 0.0451 ( 0.0006) 0.0434 ( 0.0004) 0.0641 ( 0.0037)
node2vec 0.0758 ( 0.0014) 0.0824 ( 0.0039) 0.0634 ( 0.0016) 0.0590 ( 0.0009) 0.0845 ( 0.0018)
node2vec+features 0.0602 ( 0.0032) 0.0600 ( 0.0035) 0.0467 ( 0.0015) 0.0440 ( 0.0004) 0.0638 ( 0.0016)
Simulation error 0.0840 0.0906 0.0839 0.0847 0.1178
Table 5: Cross-validation results for time when maximum number of infected nodes is reached on the whole network.

Similarly to Table 4, Table 6

shows the prediction scores for the maximum number of infected nodes on the largest component of the network. Results for networks Wikipedia vote and FB Public Figures are the same, since they have only one component. Contrary to scores on the whole network, scores on the biggest component show that node2vec+features performs the best overall. We also see that the random baseline performs much better on the single component than on the whole network. This is because the maximum number of infected nodes is usually smaller in smaller components, which makes the mean value of target data smaller and the variance higher. Because of high variance of target data, the random baseline predicts scores with higher error since the range of predictions is bigger.

Dataset Advogato Hamsterster HEP-PH FB Public Figures Wikipedia Vote
CaBoost 0.0556 ( 0.0011) 0.0437 ( 0.0017) 0.0496 ( 0.0004) 0.0521 ( 0.0007) 0.0600 ( 0.0020)
GAT 0.0668 ( 0.0088) 0.0455 ( 0.0014) 0.0536 ( 0.0014) 0.0594 ( 0.0010) 0.0608 ( 0.0013)
GIN 0.0651 ( 0.0039) 0.0566 ( 0.0057) 0.1147 ( 0.0355) 0.0579 ( 0.0015) 0.2076 ( 0.2531)
Random 0.0622 ( 0.0007) 0.0513 ( 0.0014) 0.0556 ( 0.0005) 0.0625 ( 0.0003) 0.0732 ( 0.0039)
SNoRe 0.0588 ( 0.0005) 0.0520 ( 0.0014) 0.0551 ( 0.0003) 0.0600 ( 0.0003) 0.0667 ( 0.0032)
SNoRe+features 0.0552 ( 0.0007) 0.0447 ( 0.0008) 0.0482 ( 0.0004) 0.0514 ( 0.0004) 0.0597 ( 0.0009)
node2vec 0.0592 ( 0.0008) 0.0504 ( 0.0012) 0.0520 ( 0.0002) 0.0575 ( 0.0005) 0.0690 ( 0.0021)
node2vec+features 0.0548 ( 0.0010) 0.0437 ( 0.0016) 0.0489 ( 0.0003) 0.0515 ( 0.0004) 0.0590 ( 0.0010)
Simulation error 0.0975 0.0769 0.0883 0.0982 0.1064
Table 6: Cross-validation results for maximum number of infected nodes on the biggest component of the network.

Table 7 shows the prediction score of time needed to reach the maximum number of infected nodes on the biggest component of the network. As with the other results, CaBoost, node2vec+features, and SNoRe+features give the best performance on all dataset. Compared to results in Table 5, we see that the difference between the random baseline and other learners is smaller and that the random baseline results are in some cases only around 50% worse than the best performing learner. Interestingly, the random baseline gives better results overall than the averaged simulation error. This is probably because spreading is ‘highly’ stochastic and simulations can end before spreading begins. In such case, the averaged simulation error increases significantly, while the random baseline is not affected much since the model is trained with already processed target data.333If we chose a random value as the prediction for the node, the result would be much worse.

Dataset Advogato Hamsterster HEP-PH FB Public Figures Wikipedia Vote
CaBoost 0.0529 ( 0.0018) 0.0442 ( 0.0020) 0.0436 ( 0.0006) 0.0448 ( 0.0005) 0.0647 ( 0.0027)
GAT 0.0790 ( 0.0063) 0.0883 ( 0.0031) 0.0643 ( 0.0008) 0.0642 ( 0.0008) 0.0760 ( 0.0021)
GIN 0.0614 ( 0.0016) 0.0536 ( 0.0058) 0.0538 ( 0.0118) 0.0497 ( 0.0035) 0.0701 ( 0.0069)
Random 0.0855 ( 0.0010) 0.0907 ( 0.0014) 0.0845 ( 0.0003) 0.0817 ( 0.0004) 0.0992 ( 0.0019)
SNoRe 0.0702 ( 0.0011) 0.0680 ( 0.0026) 0.0651 ( 0.0005) 0.0646 ( 0.0011) 0.0753 ( 0.0073)
SNoRe+features 0.0517 ( 0.0014) 0.0454 ( 0.0015) 0.0425 ( 0.0003) 0.0434 ( 0.0004) 0.0641 ( 0.0037)
node2vec 0.0708 ( 0.0010) 0.0702 ( 0.0018) 0.0557 ( 0.0006) 0.0590 ( 0.0009) 0.0845 ( 0.0018)
node2vec+features 0.0519 ( 0.0017) 0.0471 ( 0.0030) 0.0431 ( 0.0002) 0.0440 ( 0.0004) 0.0638 ( 0.0016)
Simulation error 0.0941 0.0789 0.0816 0.0847 0.1178
Table 7: Cross-validation results for time when maximum number of infected nodes is reached on the biggest component of the network.

We can see that predictions with the proposed learners on all datasets give better results than a single simulation. This shows that such models are useful because they can estimate the joint distribution of spreading across multiple simulations which is better than a random simulation run.

4.4 Results of transfer learning

In this section, we show the results of transfer learning between the presented networks. The results are represented in the form of a heatmap where the values on the diagonal represent the baseline RMSE of five-fold cross-validation and the non-diagonal values the RMSE of the dataset in the column with the model trained on the dataset in the row. This error is divided by the baseline score and thus shows how much worse the score we get from transfer learning is when compared with the score we get with the five-fold cross-validation.

The transfer learning results for the prediction of the maximum number of infected nodes can be seen on the heatmap in Figure 3. We can see that most errors are 1–3 times higher than the baseline. The two major exceptions are the results of Advogato dataset with the FB Public Figures model and the result of Wikipedia Vote network with the FB Public Figures model. The 5.4 times higher RMSE on the FB Public Figure dataset is probably caused by the big difference between the number of components, since the large number of components lowers the highest number of infected nodes. It is interesting to see that the FB Public Figures model works better than the baseline for the Wikipedia network. This is probably because both networks have similar structure but Wikipedia vote has less nodes and thus less training data. These results show that transfer learning between two topologically similar networks is possible without additional data and can yield good results.

Figure 3: Heatmap with transfer learning results for predictions of maximum number of infected nodes.

On the heatmap in Figure 4 we see transfer learning results for prediction of the time needed to reach the maximum number of infected nodes. We can see that overall these results are better than those in Figure 3 and that dataset Advogato performs much worse with other models. This is probably because Advogato has 1441 components while the other networks have significantly less. We can also see that FB Public Figures and Wikipedia vote datasets give good predictions (below 2 times worse) with all the models, especially in the case where the error is the same as with the baseline.

Figure 4: Heatmap with transfer learning results for predictions of time when maximum number of nodes were infected.

The results of transfer learning can be better explained with the distribution plot of target values shown in Figure 5. The first row shows the distribution of the maximum number of infected nodes. We can see that the distributions of FB Public Figures and Wikipedia vote are very similar. This reflects the results, where Wikipedia vote network performs better with the FB Public Figures model than with the five-fold cross-validation. We also see that the Advogato and FB Public Figures networks have very different distributions. This matches the results since the transferred model performs very poorly.

Figure 5: Distribution of target values for the maximum number of infected nodes and the time when this happens. The first row shows the target values for the number of infected nodes while the second one the time when this happens. The -axis represents the value of the target variable. On the other hand the values of the axis represents the density at some value.

The second row shows the distribution of time needed to reach the maximum number of infected nodes. As with the maximum number of infected nodes these distributions also show that the distribution is closely related to how well the model performs. We see that the distribution of target values on the Advogato network vastly differs from the distributions on other networks and that this reflects the results where transfer learning models have higher RMSE. Similarly, the distributions of datasets FB Public Figures, HEP-PH, and Wikipedia vote are similar and have transfer learning results that do not differ much from the five-fold cross-validation results.

4.5 Interpretation of a prediction

Figure 6: An example of a model explanation for an instance using the SNoRe+features model. Blue arrows indicate how much the prediction is lowered by some feature value, while the red ones indicate how much it is raised. Prediction starts at models expected value 0.799 and finishes at 0.918. Features and their values are shown on the left. The visualization shows for example that the prediction rose from 0.788 to 0.918 because of the low value of Eigenvector centrality.

We can explain predictions using model explanation approaches such as SHapley Additive exPlanations (SHAP) [lundberg2017shap, vstrumbelj2014explaining]. SHAP is a game-theoretic approach for explaining classification and regression models. The algorithm perturbs subsets of input features to take into account the interactions and redundancies between them. The explanation model can then be visualized, showing how the feature values of an instance impact a prediction.

An example of such explanation is shown in Figure 6 using the model SNoRe+features. We can see that the prediction is impacted mostly by the eigenvector centrality, node 1696, number of second neighbors, and the degree centrality. We can also see that a very small value of eigenvector centrality raises the prediction value and that the low values of the number of second neighbors and the Degree centrality lower it. This is expected because the low value of eigenvector centrality usually tells that the node is not that ‘important’ and is in a neighborhood with many nodes. Similarly it is expected that the low value of degree centrality and low number of second neighbors lower the prediction because having less nodes gives a smaller chance of infection. Lastly, the high similarity between neighborhoods of node 1696 and the instance we try to predict lowers the prediction.

5 Discussion and conclusions

In this paper, we showcase that machine learning methods can be used for fast estimation of epidemic spreading effect from a given node. We show that by reformulating the task as node regression, we can obtain realistic estimates much faster than by performing computationally expensive simulations, even though such simulations are initially used to fine-tune the machine learning models. Further, employment of predictive modeling instead on relying on a single simulation also shows promising results. We also demonstrate that transfer learning can be used to predict spreading effects between networks with similar characteristics without big performance loss.

We show that while graph neural networks outperform the random baseline and sometimes give us great results, centrality scores and embedding feature representation methods coupled with XGBoost mostly outperform them. We also see that machine learning models might overall give a more accurate representation of an epidemic than data gathered from a small number of simulations. This makes the machine learning approach faster and more reliable, while also giving an interpretation of why a node was predicted as it was. Further, this paper demonstrates the complementarity between the accepted simulation-based spreading modeling and fast machine learning based screening in data-scarce regimes.

A crucial part of our work shows that transfer of knowledge between networks is possible. This implies that our features capture characteristics that are crucial and transferable between different networks. Since we derive features for models from centralities that are explainable, machine learning models can be used to study which characteristics of the networks play a crucial role in epidemic spreading and how they affect it.

An obvious limitation of the proposed task is that the spreading is probabilistic and even the best classifiers might make significant errors. Similarly, when observing prediction results of the maximum number of infected nodes one must be careful since we predict an average outcome from some node and not the true maximum. This gives us the ability to predict which nodes are most ‘dangerous’ as the patient zero. When trying to predict an outcome of an epidemic that has already spread, one must adjust data accordingly and get rid of simulations where epidemics have not spread.

In further work, we plan to research different centralities and algorithms to better describe network structure and achieve more accurate results. The proposed approach lowers the amount of simulations needed to create good approximations, but the approach might still not be scalable on larger graphs. In the future, we would like to develop methods to further reduce the number of simulations needed, making the solution more scalable. Another area of our interest is the ability to solve such tasks by using unsupervised algorithms. Finally, as the current work focused on the node-level aspects, we believe that similar ideas could be adopted to model more higher order structures and their spreading potential, including convex skeletons and communities.


The work of the last author (BŠ) was funded by the national research agency (ARRS)’s grant for junior researchers. The work of other authors was supported by the Slovenian Research Agency (ARRS) core research program P2-0103, and research project N2-0078 (financed under the ERC Complementary Scheme).