1 Introduction
The pervasive integration of mobile devices and the InternetofThings in everyday life has created an expanding interest in processing their collected data Wang et al. (2018); Li et al. (2018); Lim et al. (2020). However, traditional data mining techniques require communication, storage and processing resources proportional to the number of devices and have raised data control and privacy concerns. An emerging alternative is to mine data at the devices gathering them with protocols that do not require costly or untrustworthy central infrastructure. One such protocol is gossip averaging, which averages local model parameters across pairs of devices during training (Subsection 2.2).
In this paper we tackle the problem of classifying points of a shared feature space when each one is stored at the device generating it, i.e. each device accesses only its own point but all devices collect the same features. For example, mobile devices of decentralized social media users could classify user interests based on locally stored content features, such as the bagofwords of sent messages, and given that some device users have disclosed their interests. We further consider devices that are nodes of peertopeer networks and communicate with each other based on underlying relations, such as friendship or proximity. In this setting, social network overlays coincide with communication networks. However, social behavior dynamics (e.g. users going online or offline) could prevent devices from communicating ondemand or at regular intervals. Ultimately, with who and when communication takes place can not be controlled by learning algorithms.
When network nodes corresponding to data points are linked based on social relations, a lot of information is encapsulated in their link structure in addition to data features. For instance, social network nodes one hop away often assume similar classification labels, a concept known as homophily McPherson et al. (2001); Berry et al. (2020). Yet, existing decentralized learning algorithms do not naturally account for information encapsulated in links, as they are designed for structured networks of artificially generated topologies. In particular, decentralized learning often focuses on creating custom communication topologies that optimize some aspect of learning Koloskova et al. (2019) and are thus independent from data. Our investigation differs from this assumption in that we classify data points stored in decentralized devices that form unstructured communication networks, where links capture realworld activity (e.g. social interactions) unknown at algorithm design time, as demonstrated in Figure 1.
If a centralized service performed classification, Graph Neural Networks (GNNs) could be used to improve the accuracy of base classifiers, such as ones trained with gossip averaging, by accounting for link structure (Subsection 2.1). But, if we tried to implement GNNs with the same decentralized protocols, connectivity constraints would prevent nodes from timely collecting latent representations from neighbors, where these representations are needed to compute graph convolutions. To tackle this problem, we propose working with decoupled GNNs, where network convolutions are separated from base classifier training and organized into graph diffusion components. Given this architecture, we either use pretrained base classifiers or train those with gossip protocols and realize graph diffusion in peertopeer networks by developing an algorithm whose fragments run on each node and converge at the same predictions as graph diffusion while working under uncontrolled irregular communication initiated by device users. Critically, this algorithm allows online modification of base predictions it diffuses and hence base classifiers can be trained while their outcomes are being diffused. Thus, all components of implemented decoupled GNNs—both base classifier training and graph diffusion processes—run at the same time and eventually converge at the desired results.
Our contribution lies in introducing a decentralized setting for classifying peertopeer network devices and in porting to this centralized GNN components that induce accuracy improvements. Given existing methods of training or deploying base classifiers in peertopeer networks, our decentralized algorithm, called p2pGNN, is used to classify nodes of simulated networks under uncertain availability and with higher accuracy than base classifiers. To our knowledge, our approach is the first that considers communication links themselves useful to be useful for the learning task, i.e. in networks where communication topology is retrieved from the real world instead of being imposed on it. To support link mining operations, we introduce a novel organization of prediction primitives that facilitates decentralized graph diffusion and theoretically show fast convergence to similar prediction quality as centralized architectures. Furthermore, we experimentally verify that our approach successfully takes advantage of graph diffusion components and reaches closely matches the classification accuracy of fully centralized computations.
2 Background
2.1 Graph Neural Networks
Graph Neural Networks (GNNs) are a machine learning paradigm in which links between data samples
^{1}^{1}1In our setting, there is 11 correspondence between samples and graph nodes. are used to improve the predictions of base neural network models Wu et al. (2020). In detail, samples are linked to form graphs based on realworld relations, and information diffusion schemes smooth —e.g. average— latent attributes across graph neighbors before transforming them with dense layers and nonlinear activations to new representations to be smoothed again. This is repeated either ad infinitum or for a fixed number of steps to combine original representations with structural information. Although propagation is similar to decentralized learning in that nodes work independently, transformation parameters are shared and learned across all nodes.GNN architectures tend to suffer from oversmoothing if too many (e.g. more than two) smoothing layers are employed. However, using few layers limits architectures to propagating information only few hops away from its original nodes. Mitigating this issue often involves recurrent links to the first latent representations, which lets GNNs achieve at least the same theoretical expressiveness as graph filters Klicpera et al. (2018); Chen et al. (2020). In fact, it has been argued that the success of GNNs can in large part be attributed to the use of recurrency rather than endtoend training of seamless architectures Huang et al. (2020)
. As a result, recent works have introduced decoupled architectures that achieve the same theoretical expressive power as endtoend training by training base statistical models (such as twolayer perceptrons) to make predictions, and smoothing the latter through graph edges.
In this work, we build on the FDiffscale prediction smoothing proposed by Huang et al. Huang et al. (2020)
, which diffuses the base predictions and respective errors of base classifiers to all graph nodes using a constrained personalized PageRank that retains training node labels. Then, a linear tradeoff between errors and predictions is calculated for each node and the outcome is again diffused with personalized PageRank to make final predictions. This process generalizes to multiclass predictions by replacing node values propagated by personalized PageRank with vectors holding prediction scores, where initial predictions are trained by the base classifier to minimize a crossentropy loss. This architecture is discussed in more detail in Subsection
3.2.2.2 Decentralized Learning
Decentralized learning refers to protocols that help pools of devices learn statistical models by accounting for each other’s data. Conceptually, each device holds its own autonomous version of the model, and training aims to collectively make those converge to being similar to each other and to a centralized training equivalent, i.e. to be able to replicate wouldbe centralized predictions locally.
Many decentralized learning practices have evolved from distributed learning, which aims to speed up the time needed to train statistical models by splitting calculations among many available devices, called workers. Typically, workers perform computationally heavy operations, such as gradient estimation for subsets of training data, and send these to a central infrastructure that orchestrates the learning process. A wellknown variation of distributed learning occurs when data batches are split across workers apriori, for example because they are gathered by these, and are sensitive in the sense that they cannot be directly presented to the orchestrating service. This paradigm is called federated learning and is often realized with the popular federated averaging (FedAvg) algorithm
McMahan et al. (2017). FedAvg performs several learning epochs in each worker before sending parameter gradients to a server that uses the average across workers to update a model and send it back to all of them.
By definition, distributed and federated learning train one central model that is fed back to workers to make predictions with. However, gathering gradients and sending back the model requires a central service with significantly higher throughput than individual workers to simultaneously communicate with all of them and orchestrate learning. To reduce the related infrastructure costs and remove the need for a central authority, decentralized protocols have been introduced to let workers directly communicate with each other. These require either constant communication between workers or a rigid (e.g. ringlike) topology and many communication rounds to efficiently learn Lian et al. (2018); Luo et al. (2020); Zhou et al. (2020). Most decentralized learning practices have evolved to or are variations of gossip learning, where devices exchange and average (parts of) their learned parameters with random others Hu et al. (2019); Savazzi et al. (2020); Hegedűs et al. (2021); Danner and Jelasity (2018); Koloskova et al. (2019).
3 A PeertoPeer Graph Neural Network
3.1 Communication Protocol
In Section 1 we argued that, if communication links between peertopeer nodes correspond to realworld relations, GNNs can improve node classification accuracy. At the same time, peertopeer networks often suffer from node churn, power usage constraints, as well as virtual or physical device mobility that cumulatively make communication channels between nodes irregularlyavailable. In this work, we assume that linked nodes keep communicating over time without removing links or introducing new ones, though links can become temporarily inactive. We expect this assumption to hold true in social networks that evolve slowly, i.e., in which user interactions are many times more frequent than link changes, which from the perspective of link mining can be viewed as static relational graphs.
To provide a framework in which peertopeer nodes learn to classify themselves, we first specify a communication protocol of information exchanges. To do this, we consider static adjacency matrices with elements , where links indicate communication channels of uncertain availability. These matrices are not fully observable by decentralized nodes (the latter are at most aware of their corresponding rows and columns), but would be the input to centralized GNNs. Uncertainty is encoded with timeevolving communication matrices with nonzero elements indicating exchanges through the corresponding links:
To simplify the rest of our analysis, and without loss of generality, we adopt a discrete notion of time that orders the sequence of communication events. We stress that realworld time intervals between consecutive timestamps could vary.
To exchange information through channels represented by timeevolving communication matrices, we use the broadly popular sendreceiveacknowledge communication protocol: devices (in our case, these are nodes of the peertopeer network) are equipped with identifiers and operations , and that respectively implement message generation, receiving message callbacks that generate new message to send back and acknowledging that sent messages have been received. Expected usage of these operations is demonstrated in Algorithm 1 in the form of a simulation.
3.2 GNN Architecture
Let base classifiers with parameters be trained on a set of labeled nodes to produce estimations of onehot label encodings from feature vectors , i.e. is the predicted label for each node with feature vector . Regardless of how classifiers are trained, peertopeer network links can be leveraged to improve predictions. For neural networks, this is achieved by transforming them to GNNs that incorporate graph convolutions in multilayer parameterbased transformations both during training and during predictions.
Unfortunately, graph convolutions smooth latent representations across neighbor nodes but uncertain availability means that there are only two realistic options to implement smoothing in peertopeer networks: either a) the last retrieved representations are used, or b) node features and links from many hops away are stored locally for indevice computation of graph convolutions. In the first case, convergence to equivalent centralized model parameters could be slow, since learning would impact neighbor representations only during communication. In the second case, multilayer architectures aiming to broaden node receptive fields from many hops away would end up storing most network links and node features in each node.
To avoid these shortcomings, we build on the decoupled GNNs outlined in Subsection 2.1, which separate the challenges of training base classifiers with leveraging network links to improve predictions. To mathematically manipulate decoupled GNN primitives, we organize base predictions of node features , where rows hold the features of nodes , into matrices with rows holding the predictions of respective feature rows . Predicted classes would be obtained by . If classifiers are trained for the features and labels of node sets , we build on the FDiffscale decoupled GNN’s description Huang et al. (2020), which we transcribe as:
(1) 
where is the unit matrix, masked adjacency matrices with elements (these are not symmetric) prevent diffusion from affecting training nodes, , is a diagonal matrix of node degrees and ,
are hyperparameters. The above formula comprises two graph diffusion operations of the following form:
(2) 
The operation performed first (the one inside the parenthesis) sets so that only the personalization of training nodes is diffused through the graph. The second operation sets and becomes equivalent to constraining the personalized PageRank scheme Page et al. (1999); Tong et al. (2006) with normalized communication matrix so that it preserves original node predictions assigned to training nodes . Effectively, it is equivalent to restoring training node scores after each power method iteration , where each iteration step is a specific type of graph convolution. The representations to be diffused by the two operations are training node errors and a tradeoff between diffused errors and node predidictions respectively.
We stress that, although the abovedescribed decoupled GNN architecture already exists in the literature, supporting its diffusion operation in peertopeer networks requires the analysis we present in the rest of this section.
3.3 PeertoPeer Personalized PageRank
If matrix row additions are atomic node operations, implementing the graph diffusion of Equation 1 in peertopeer networks with uncertain availability is reduced to implementing two versions of Equation 2’s constrained personalized PageRank; one with parameters and one with parameters . Thus, we focus on implementing personalized PageRank variations.
Previous works have computed personalized (for which columns are normalized vectors of ones) or nonpersonalized PageRank in peertopeer networks by letting peers hold fragments of the network spanning multiple nodes and merging these when peers communicate Parreira et al. (2006); Zhang et al. (2020); Bahmani et al. (2010); Hu and Lau (2012). Unlike these, in our setting peers coincide with nodes and merging network fragments could require untenable bandwidths that grow proportionally to network size as merged subnetworks are continuously exchanged between peers. Instead, we devise a new computational scheme that is lightweight in terms of communication.
Iterative synchronized convolutions require node neighbor representations at intermediate steps. However, an early work by Lubachevsky and Mitra Lubachevsky and Mitra (1986) showed that, for nonpersonalized PageRank, decentralized schemes holding local estimations of earliercomputed node scores (or, in the case of graph diffusion, vectors) converge at the same point as centralized ones as long as communication intervals are bounded. This motivates us to similarly iterate personalized PageRank by using the last communicated neighbor representations to update local nodes. In this subsection we mathematically describe this scheme and show that it converges to the same point as its centralized equivalent with a linear rate (which corresponds to an exponentially degrading error) and even if personalization evolves over time but still converges at nearlinear rates. Notably, keeping older representations is not a viable solution to calculate graph convolutions when these are entangled with representation transformations, but employing decoupled GNNs lets us separate learning from diffusion.
To set up a decentralized implementation of personalized PageRank, we introduce a theoretical construct we dub decentralized graph signals that describes decentralized operations in peertopeer networks while accounting for personalization updates over time, in case these are trained while being diffused. Our structure is defined as matrices with multidimensional vector elements (in our case is the number of classes) that hold in devices the estimate of device representations. Rows are stored on devices and only crosscolumn operations are impacted by communication constraints.
We now consider a scheme that updates decentralized graph signals at times per the rules:
(3) 
where are timeevolving representations of nodes . The first of the above equations describes node representation exchanges between devices based on the communication matrix, whereas the second one performs a local update of personalized PageRank estimation given the last updated neighbor estimation that involves only data stored on devices . Then, Theorem 1 shows that the main diagonal of the decentralized graph signal deviates from the desired node representations with an error that converges to zero mean with linear rate. This weak convergence may not perfectly match centralized diffusion. However, it still guarantees that the outcomes of the two correlate in large part correlate.
Theorem 1.
If is bounded and converges in distribution with linear rate, the elements of
are independent discrete random variables with fixed means and
then converges in distribution to satisfying Equation 2 with linear rate.Proof.
Without loss of generality, we will show this result for , for which , as more training nodes only add constraints to the diffusion scheme that force it to converge faster.
Let us denote as and vectors with elements and , where
is the expected value operation. Then, since the communication rate is fixed and an independent random variable for each edge, it holds that:
which, for a communication matrix and yields the solution
that is unique, given that for eigenvalues
of when it holds that (given the properties of doubly stochastic and Markovian matrices) and hence the corresponding eigenvalues of become , which make it invertible. Hence, the unique solution necessarily coincides with the solution .For the same quantities, we can see that the convergence rate would be equal to or faster than the convergence rate if all communications took place with probability
where is the communication matrix. Thus, we consider a communication matrix whose nonzero elements are from with probability and analyse the latter to find the slowest possible convergence rate. In this setting, we obtain the recursive formula:where . Thus, denoting as the spectral radius of , where is the spectral radius of the matrix it holds that:
where is the convergence rate of . Thus, for and , we calculate the behavior as to obtain the linear convergence rate . ∎
Algorithm 2 realizes Equation 1 for peertopeer network nodes that communicate with social neighbors under the SendReceiveAcknowledge protocol. We implement the protocol’s operations, node initialization given prediction vectors and target labels and the ability to update predictions. Nodes are initialized per initialize(, ), where the last argument is a vector of zeroes for nontraining nodes. We implement graph diffusion with decentralized graph signals predictions and errors, where the former uses the outcome of the latter. Predictions that is, the decentralized graph signal’s main diagonal are stored in .prediction. There are two hyperparameters: that determines the diffusion rate and that tradesoff errors and predictions. Importantly, given linear or faster convergence rates for base classifier predictions, Theorem 1 yields linear convergence in distribution for errors and hence for the incode variable combined of each node. Therefore, from the same theorem, predictions also converges linearly in distribution.
4 Experiments
4.1 Datasets and Simulation
To compare the ability of peertopeer learning algorithms to make accurate predictions, we experiment on three datasets that are often used to assess the quality of GNNs Shchur et al. (2018)
; the Citeseer
Namata et al. (2012), Cora
Sen et al. (2008) and Pubmed Namata et al. (2012) social graphs. Preprocessed versions of these are retrieved from the programming interface of the publicly available Deep Graph Library Wang et al. (2019) and comprise the node features, class labels and trainingvalidationtest splits summarized in Table 1. In practice, the class labels of training and validation nodes would have been manually provided by respective devices (e.g. submitted by their users) and would form the ground truth to train base models.Dataset  Nodes  Links  Node Features  Class Labels  Training  Validation  Test 

Citeseer  3,327  9,228  3,703  6  120  500  1,000 
Cora  2,708  10,556  1,433  7  140  500  1,000 
Pubmed  19,717  88,651  500  3  60  500  1,000 
We use these datasets to simulate peertopeer networks with the same nodes and links as in the dataset graphs and fixed probabilities for communication through links at each time step, uniformly sampled from the range . To speed up experiments, we further force nodes to participate in only one communication at each time step by randomly determining which edges to ignore when conflicts arise; this way, we are able to use threading to parallelize experiments by distributing time step computations between available CPUs (this is independent of our decentralized setting and its only purpose is to speedup simulations). We obtain classification accuracy of test labels after 1000 time steps (all algorithms converge well within that number) and report its average across five experiment repetitions. Similar results are obtained for communication rates sampled from different range intervals. Experiments are available online^{2}^{2}2https://github.com/MKLabITI/decentralizedgnn Apache License, Version 2.0. and were conducted on a machine running Python 3.6 with 64GB RAM (they require at least 12GB available to run) and 32x1.80GHz CPUs.
4.2 Base Classifiers
Experiments are conducted on three base classifiers:
0.5em1MLP
– A multilayer perceptron often employed by GNNs
Klicpera et al. (2018); Huang et al. (2020). This consists of a dense twolayer architecture starting from a transformation of node features intodimensional representations activating ReLU outputs and an additional dense transformation of the latter whose softmax aims to predict onehot encodings of classification labels.
0.5em1LR
– A simple multilabel logistic regression classifier whose softmax aims to predict onehot encodings of classification labels.
0.5em1Label – Classification that repeats training node labels. If no diffusion is performed, this provides random predictions for test nodes.
MLP and LR are trained towards minimizing the crossentropy loss of known node labels with Adam optimizers Kingma and Ba (2014); Bock et al. (2018). We set learning rates to respectively, which is often used for training on similarlysized datasets, and maintain the default momentum parameters proposed by the optimizer’s original publication. For MLP, we use dropout for the dense layer to improve robustness and for all classifies we L2regularize dense layer weights with penalty. We do not perform hyperparameter tuning, as further protocols would be needed to make peertopeer nodes learn a common architecture optimal for a set of validation nodes. For FDiffscale hyperparameters, we select a personalized PageRank restart probability often used for graphs of several thousand nodes and error scale parameter
, where the latter is selected so that it theoretically satisfies a heuristic of perfectly reconstructing the class labels of training nodes.
4.3 Compared Approaches
We experiment with the following two versions of MLP and LR classifiers, which differ with respect to whether they are pretrained and deployed to nodes or learned via gossip averaging. In total, experiments span 2 MLP + 2 LR + Label = 7 base classifiers.
0.5em1Pretrained – Training classifiers on the training node set in a centralized architecture over epochs. For faster training, we perform early stopping if validation set loss has not decreased for epochs. In practice, this version of classifiers could take the form of a service (e.g. a web service) that trains classifiers from submitted device labels and hosts the result for all devices to retrieve.
0.5em1Gossip – Fully decentralized gossip averaging, where each node holds a copy of the base classifier and parameters are set to the average between pairs of communicating nodes. Since no stopping criterion can be enforced, both training and validation nodes contribute to training the (fragments of) the base classifier. The simulated devices corresponding to those nodes perform a gradient step on a local instance of the Adam optimizer every time they are involved in a communication. If training data were independent and identically distributed and a structured topology was forced upon devices, this approach could be considered a stateoftheart baseline in terms of accuracy, as indicated by the theoretical analysis of Koloskova et al. Koloskova et al. (2019) and experiment results of Niwa et al. Niwa et al. (2020). However, our setting involves an unstructured peertopeer topology, where nodes are connected based on homophily and the efficacy of this practice is uncertain. We also consider the Label classifier as natively Gossip, as it does not require any centralized infrastructure.
For all base classifiers, we report: a) their vanilla accuracy, b) the accuracy of passing base predictions through the FDiffscale scheme of Equation 1, as implemented via p2pGNN operations presented in Algorithm 2, and c) the accuracy of passing base predictions through a centralized implementation of FDiffscale with the same hyperparameters. For the last scheme, we report diffusion improvements only for pretrained models, as it makes little sense to perform decentralized training by centralized diffusion.
Finally, given that training does not depend on diffusion, we perform the latter by considering both training and validation node labels as known information. That is, both types of nodes form the set of of our analysis. Ideally, p2pGNN would leverage the homophilous node communications to improve base accuracy and tightly approximate fullycentralized predictions. In this case, it would become a decentralized equivalent to centralized diffusion that works under uncertain communication availability and does not expose predictive information of devices to devices other than communicating graph neighbors.
4.4 Results
In Table 2
we compare the accuracy of base algorithms vs. their augmented predictions with the decentralized p2pGNN and a fully centralized implementation of FDiffscale. We remind that the last two schemes implement the same architecture and differ only on whether they run on peertopeer networks or not respectively. We can see that, in case of pretrained base classifiers, p2pGNN diffusion successfully improves their accuracy scores by wide margins, i.e. 7%47% relative increase. In fact, the improved scores closely resemble the ones of centralized diffusion, i.e. with less than 3% relative decrease, for the Citeseer and Cora datasets. In these cases, we consider our peertopeer diffusion algorithm to have successfully decentralized diffusion components. On the Pubmed dataset, centralized schemes are replicated less tightly (this also holds true for simple Label propagation), but there is still substantial improvement compared to pretrained base classifiers.
On the other hand, results are mixed when we consider base classifiers trained via gossip averaging. Before further exploration, we remark that MLP and LR outperform their pretrained counterparts in large part due to a combination of training with larger sets of node labels (both training and validation nodes) and “leaking” the graph structure into local classifier fragment parameters due to nonidentically distributed node class labels. However, after diffusion is performed, accuracy does not reach the same levels as pretrained base classifiers—in fact, in the Citesser and Cora datasets homophilous parameter training reduces the diffusion of classifier fragment parameters to the diffusion of class labels. This indicates that classifier fragments tend to correlate node features with graph structure and hence additional diffusion operations do not provide only new information. Characteristically, the linear nature of LR makes its base gossiptrained and p2pGNN versions nearidentical. Since this issue systemically arises from gossip training shortcomings, we leave its mitigation to future research.
Base  p2pGNN  Fully Centralized GNN  

Citeseer  Cora  Pubmed  Citeseer  Cora  Pubmed  Citeseer  Cora  Pubmed  
Pretrained  
MLP  52.3%  54.9%  70.9%  67.8%  81.5%  76.0%  69.0%  84.0%  81.2% 
LR  59.4%  58.7%  72.2%  70.5%  82.0%  77.3%  70.3%  85.7%  81.5% 
Gossip  
MLP  63.1%  66.3%  74.9%  61.3%  80.8%  78.0%       
LR  61.8%  79.9%  78.7%  61.4%  80.8%  78.7%       
Labels  15.9%  11.6%  22.0%  61.1%  80.8%  71.5%  61.5%  78.9%  78.6% 
Overall, experiment results indicate that, in most cases, p2pGNN successfully applies GNN principles to improve base classifier accuracy. Importantly, although neighborbased gossip training of base classifiers on both training and validation nodes outperforms models pretrained on only training nodes (in which case validation nodes are used for early stopping), decentralized graph diffusion of the latter exhibits the highest accuracy across most combinations of datasets and base classifiers.
5 Conclusions and Future Work
In this work, we investigated the problem of classifying the nodes of unstructured peertopeer networks under communication uncertainty and proposed that homophilous communication links can be mined with decoupled GNN diffusion to improve base classifier accuracy. We thus introduced a decentralized implementation of diffusion, called p2pGNN, whose fragments run on decentralized devices and mine network links as irregular peertopeer communication takes place. Theoretical analysis and experiments on three simulated peertopeer networks from labeled graph data showed that combining pretrained (and often gossiptrained) base classifiers with our approach successfully improves their accuracy at similar degrees to fully centralized decouple graph neural networks.
For future work, we aim to improve gossip training to let it account for the nonidentically distributed spread of data across graph nodes. We are also interested in addressing privacy concerns and societal biases in our approach and explore automated hyperparameter selection.
Acknowledgements
This work was partially funded by the European Commission under contract numbers H2020951911 AI4Media and H2020825585 HELIOS.
References
 Fast incremental and personalized pagerank. arXiv preprint arXiv:1006.2880. Cited by: §3.3.
 Going beyond accuracy: estimating homophily in social networks using predictions. arXiv preprint arXiv:2001.11171. Cited by: §1.
 An improvement of the convergence proof of the adamoptimizer. arXiv preprint arXiv:1804.10587. Cited by: §4.2.
 Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725–1735. Cited by: §2.1.
 Token account algorithms: the best of the proactive and reactive worlds. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp. 885–895. Cited by: §2.2.
 Decentralized learning works: an empirical comparison of gossip learning and federated learning. Journal of Parallel and Distributed Computing 148, pp. 109–124. Cited by: §2.2.
 Decentralized federated learning: a segmented gossip approach. arXiv preprint arXiv:1908.07782. Cited by: §2.2.
 Localized algorithm of community detection on largescale decentralized social networks. arXiv preprint arXiv:1212.6323. Cited by: §3.3.
 Combining label propagation and simple models outperforms graph neural networks. arXiv preprint arXiv:2010.13993. Cited by: §2.1, §2.1, §3.2, §4.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: §2.1, §4.2.
 Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478–3487. Cited by: §1, §2.2, §4.3.

Learning iot in edge: deep learning for the internet of things with edge computing
. IEEE network 32 (1), pp. 96–101. Cited by: §1. 
Asynchronous decentralized parallel stochastic gradient descent
. In International Conference on Machine Learning, pp. 3043–3052. Cited by: §2.2.  Federated learning in mobile edge networks: a comprehensive survey. IEEE Communications Surveys & Tutorials 22 (3), pp. 2031–2063. Cited by: §1.
 A chaotic asynchronous algorithm for computing the fixed point of a nonnegative matrix of unit spectral radius. Journal of the ACM (JACM) 33 (1), pp. 130–150. Cited by: §3.3.
 Prague: highperformance heterogeneityaware asynchronous decentralized training. In Proceedings of the TwentyFifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 401–416. Cited by: §2.2.
 Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §2.2.
 Birds of a feather: homophily in social networks. Annual review of sociology 27 (1), pp. 415–444. Cited by: §1.
 Querydriven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Vol. 8. Cited by: §4.1.
 Edgeconsensus learning: deep learning on p2p networks with nonhomogeneous data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 668–678. Cited by: §4.3.
 The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §3.2.
 Efficient and decentralized pagerank approximation in a peertopeer web search network. In Proceedings of the 32nd international conference on Very large data bases, pp. 415–426. Cited by: §3.3.
 Federated learning with cooperating devices: a consensus approach for massive iot networks. IEEE Internet of Things Journal 7 (5), pp. 4641–4654. Cited by: §2.2.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §4.1.
 Fast random walk with restart and its applications. In Sixth international conference on data mining (ICDM’06), pp. 613–622. Cited by: §3.2.
 Deep learning towards mobile applications. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp. 1385–1393. Cited by: §1.
 Deep graph library: a graphcentric, highlyperformant package for graph neural networks. arXiv preprint arXiv:1909.01315. Cited by: §4.1.
 A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. Cited by: §2.1.
 A decentralized pagerank based content dissemination model at the edge of network. International Journal of Web Services Research (IJWSR) 17 (1), pp. 1–16. Cited by: §3.3.
 Communicationefficient decentralized machine learning over heterogeneous networks. arXiv preprint arXiv:2009.05766. Cited by: §2.2.