1. Introduction
Graph representation learning has advanced greatly in the recent years and has drawn attention in both academia and industry because GNNs are expressive, flexible, robust and scalable.
Many recent GNN models learn the projections from node neighborhoods using different sampling, aggregation and transformations (Grover and Leskovec, 2016; Perozzi et al., 2014; Kipf and Welling, 2017a; Hamilton et al., 2017; Xu et al., 2019; You et al., 2019; Rossi et al., 2020). GNNs have been adopted in various academic and industrial settings, such as link prediction (Ying et al., 2018b), proteinprotein interaction (Shen et al., 2021), community detection (Wu et al., 2019), and recommender systems (Wang et al., 2018; Ying et al., 2018a). Furthermore, Zhu et al. (2019); Ying et al. (2018a); Lerer et al. (2019); Ma et al. (2018) develop faulttolerant and distributed systems to apply graph neural networks (GNNs) to large graphs.
These GNN models focus on learning a single encoder projecting graph substructures to a representation embedding. However, lowdimensional node embeddings sometimes fail to capture all the highdimensional information about node neighborhoods. For example, in a social network, users may connect to distinct neighbors who share different common interests and thus the semantic meaning of edges varies (Yang et al., 2019). Most prior works aim to capture the diverse signals of node neighborhoods by increasing the embedding dimension. However, it is challenging to encode the local multimodal information to one embedding space in many cases. Multihead attention bridges the gap to some extent by learning different attention heads for different neighbors (Veličković et al., 2018). One promising alternative is to learn multiple representations where each embedding captures a specific aspect of the rich information about node neighborhoods. By projecting to multiple lowdimensional node embedding spaces, we find it extremely promising when the internode affinity correlates with the interembedding similarity in one or more subspaces instead of the entire space.
Network machine learning applications usually consider exceptionally heterogeneous feature from node, topological signals, various neighborhoods and communities where there exists an ensemble of latent semantics under the network features
(Yang et al., 2019). Here we propose boostingbased GNNs, which automatically learn projections to multiple lowdimensional embedding spaces from the highdimensional graph contents and determine the focus of each embedding space according to the node neighborhoods.1.1. Preliminaries
Static Graph Despite that the concrete formulation of graph problems varies, we use a quadruplet
(1) 
to denote the holistic information about a graph, where and are the sets of all nodes and edges respectively. They are endowed by node features and edge features .
Dynamic Graph Recent works investigated deeply into the representation learning in dynamic graphs and corresponding variations of GNNs (Rossi et al., 2020; Kumar et al., 2019) which consider both the evolution of in terms of cardinality and feature updates and . For simplicity, we assume is the universe of all nodes that ever exist during the trajectory of the dynamic network; and features are timestamped .
Embedding Space An embedding space
is a vector space onto which we project nodes
. Let denote a similarity measure between two embeddings.EncoderDecoder framework We let denote the neighborhood^{2}^{2}2A hop neighborhood considering all the nodes and edges within a radius of edges w.r.t is a widely used definition.around node , which comprises nodes, edges, node features and edge features. Conceptually an encoder projects nodes to an embedding space whereas the actual model maps neighborhood to an embedding
. The concrete choices of decoders vary in different applications. For example, in nodelevel supervised learning with labels
, our goal is to optimize encoder and decoderWhereas for pairwise utility prediction, we are usually accompanied with an internode utility labels and the goal reduces to searching the best encoder and decoder^{3}^{3}3
Sometimes decoders do not contain trainable parameters (e.g. dot product, cosine similarity), then the optimization is only searching the encoder parameter space.
Although GNNs can also be applied to other tasks, in this paper, our discussion covers the above two categories of applications.
1.2. Multimodal embedding spaces
Figure 0(a) illustrates a toy social network example where game players have preferences of genres of games and the edges represent the friendship. As shown in figure 0(a), user 0 and 1 established a friendship connection through their common interests of racing game; while user 5 and 6 are friends because of strategy games.

Multiple embedding spaces Our goal is to learn multiple encoders on such that in whereas in because they are close in the social network via different ”semantics”. As shown in figure 0(b), we seek to learn two projections such that their embedding similarities are guaranteed in different spaces.

Node decoder is used in node classification or recommendation tasks. The objective is to learn encoder (and decoder) for nodelevel labels. In the multi embedding setup, we consider the following form of node decoders:

Internode proximity is determined by multiple embedding spaces. In a supervised link prediction setup, the final prediction is based on decoders to ”combine” the similarities between two nodes in multiple embedding spaces.
Unfortunately, as mentioned in Pal et al.(Pal et al., 2020), the complex and noisy nature of graph data renders such assumption we made in the toy example, the existence of explicit edge ontology, unrealistic.
2. Related Work
Graph Embedding Models, such as GCN (Kipf et al. (Kipf and Welling, 2017b)), GraphSAGE (Hamilton et al. (Hamilton et al., 2018)), GAT (Velickovic et al. (Veličković et al., 2018)), can encode the subgraph to a vector space . However, they only map the subgraph to one single embedding space instead of multiple embedding spaces.
MultiEmbedding Models, such as PinnerSage (Pal et al. (Pal et al., 2020)) and others (Weston et al. (Weston et al., )), try to learn multimodal embeddings via clustering method which is expensive. Moreover, it requires additional empirical inputs regarding number of clusters, similarity measure and pretrained high quality embeddings which presumably capturing rich multimodal signals.
Temporal Network Embedding Models, such as Jodie (Kumar et al. (Kumar et al., 2019)), TGAT (Xu et al. (Xu et al., 2020)), TGN (Rossi et al. (Rossi et al., 2020)), are designed for dynamic graphs, mapping from time domain to the continuous differentiable functional domain. Our boosting method can also be implemented onto dynamic graphs, collaborating with those temporal network embedding models.
3. Present Work
In this work, we present an Adaboosting based meta learner for GNNs (AdaGNN) that is both model and task agnostic. AdaGNN leverages a sequential boosting training paradigm that allows multiple different sublearners to coexist. Each embedding space ideally preserves unique internode similarity information. In this section, we mainly discuss the theoretical rationale behind our intuition regarding the advantages over singleembedding space using a nodelevel context as an example^{4}^{4}4In this work, we only experimented with node recommendation, link prediction and multi task learning. Moreover, our approach works in a joint training fashion that diminishes the prerequisite of pretrained embeddings.
3.1. Problem Definition
Taking static graph or dynamic graph in Eq. 1 as input, we project the neighborhood of each node onto embedding space such that embedding space fully encodes the labels, as defined below.
Definition 3.1 ().
Given probabiltiy threshold , embedding space fully encodes labels iff.
Inspired by the toy example in Figure 1, we assume that labels are affected by different aspects of neighborhoods. Then we can find an embedding space such that labels affected by one specific aspect is encoded in this embedding space while labels not affected by this aspect isn’t encoded, as defined below.
Definition 3.2 ().
Diffusion induced labels is a subset of labels such that can be decoded from one embedding space and can’t be decoded from this embedding space.
Embedding space is specifically related to the diffusion induced labels . Consequently we use instead of as labels in the training of encoder , decoder .
We further assume that labels are partitioned by diffusion induced labels . Then the graph learning problem is modified into multiembedding learning as follows:
Problem Given a graph , learn a set of embedding spaces and corresponding encoders, decoders by reducing the discrepancy between decoder outputs and diffusion induced labels .
3.2. Problem Context
In this section, nodelevel labels are used as example to justify the intuition behind multimodal embedding spaces. Other cases are easy to replicate, such as using in link prediction, where denotes the edge existence. We assume that there exists an ideal embedding space such that node embeddings encode all the latent signals:
Definition 3.3 ().
Embedding space is defined as ideal embedding space on the whole graph iff. it fully encodes all labels:
(2) 
where represents nodelevel vector label, is the node embedding of and
is a probability threshold.
Assume vector space is spanned by a set of orthonormal basis , then the node embeddings are linear combinations of these basis vectors with coefficients:
We further assume that only a small subset of basis vectors have correlation with label . The idea behind this assumption is that most of the volume of the highdimensional cube is located in its corners. When , most of the coefficients are close to zero after normalization. Let be the set of basis vectors which have nontrivial correlation with label , then :
(4) 
and spans a vector space . For nodes satisfying Eq. 4, they are encoded in and we define them as . For nodes not satisfying Eq. 4, a new vector space and set can be obtained using the above procedure. Note that the intersection between and is not necessarily empty. Repeat until all nodes are encoded. A set of vector space are obtained:
(5)  
(6) 
and
(7) 
where is the node embedding in embedding space .
In existing works, encoders and decoders parameterize and using neural networks. In this paper we introduce a new approach: modeling and . When , the new approach obtains a large advantage and a new embedding space can always be found to increase the overall accuracy before the ideal embedding space is achieved. Moreover, increasing the number of embedding spaces can bound the generalization gap, which is discussed in Section 3.5.
3.3. Multiple Embeddings Generation
Inspired by the procedure above of finding new embedding space and the weightupdating ability of boosting, we adopt AdaBoosting as the meta learner with homogeneous GNNs as weak learners to encode underlying relations from weighted labels, which circumvents the difficulties in clustering method(Pal et al., 2020). Each learner has a GNN encoder that projects onto one embedding space . From a highlevel perspective, we learn such embedding projections in an iterative fashion: each training data point is associated with a weight based on the previous weak learner’s (GNN) error and this weight encourages the next weak learner to focus on the data points with misclassified labels from the previous learner.
Lemma 3.4 ().
Given a learner with corresponding embedding space , there exists a new embedding space capturing different information from the existing embedding space until the number of misclassified labels is zero.
Proof.
For this learner and embedding space , if the number of misclassified labels is nonzero, there exists at least one misclasssified label. The relation in this label is not preserved, so there exists a new embedding space preserving this relation using procedure in Section 3.2
. Note that new embedding space may not be able to encode all the classified labels in previous learner. ∎
The ”artificial” diffusion induced labels are created by boosting weights. For link prediction, if the number of misclassified edges is nonzero, i.e., edges . We can generate new diffusion induced labels only including misclassified edges by setting the weights of classified edges to zero. In experiments, we increase the weights of misclassified edges to some extent until at least one previously misclassified edge is classified correctly in next weak learner instead. If it can’t be achieved, boosting will stop.
Lemma 3.5 ().
Embedding space of current and next learner constructed as above capture different information .
Proof.
For current learner, if its training error is nonzero (otherwise we stop boosting), there exists at least one misclassified label. We then increase the weight of this data point for the next learner so that the label will be classified correctly. preserves the latent relation that affects this label, while does not. ∎
Therefore, there will always exist a new embedding space capturing different information from the original embedding space until the number of misclassified labels is zero and the embedding spaces can be found by increasing the weights of misclassified labels.
It remains unresolved about what the optimal weight updating rule is, and it depends on the definition of optimality. The most common definition is to achieve the lowest misclassification error rate. Usually it is assumed that the training data are i.i.d. samples from an unknown probability distribution. Then we can derive the boosting algorithm based on different misclassification error rate. In Zhu et al.
(Zhu et al., 2006), they proposed two algorithms: SAMME and SAMME.R, and proved that the two algorithms minimize the misclassification error rate for discrete predictions and realvalued confidencerated predictions respectively. More generally, one can use gradient boosting, which is left for future work.
Lemma 3.6 ().
For AdaBoost, adding a new learner with weights as in Algorithm 1 will minimize the misclassification error.
Proof.
Zhu et al.(Zhu et al., 2006)
proved this for SAMME and SAMME.R algorithm using a novel multiclass exponential loss function and forward stagewise additive modeling. ∎
Theorem 3.7 ().
in problem definition can be found using boosting and it minimizes the misclassification error.
3.4. Graph Neural Network
GNNs use the graph structure, node features and edge features to learn a node embedding for a node . Modern GNNs follow a neighborhood aggregation strategy, where we iteratively update the representation of a node by aggregating representations of its neighbors. After iterations of aggregation, a node’s representation captures the structural information within its hop network neighborhood. Formally, the th layer of a GNN is
where is the node embedding of with the message of hop network neighbors, is the message aggregated from hop neighbors and we initialize . In this work, attention mechanism or mean pool is used to aggregate the neighbor representations. Aggregated messages are further combined with self embeddings using fully connected layers. We will drop superscripts and use to represent node embeddings in the following discussion.
Following encoderdecoder framework, node neighborhood is projected to a vector space which is then decoded for different tasks: a) Link Prediction, where edge existence is predicted using the similarity between two node embeddings; b) Node Recommendation, where the nodelevel labels are predicted only using the information of neighbors excluding themselves; c) MultiTask Learning, where link prediction and node recommendation are trained together using the same encoder.
For link prediction task, we have
where is the similarity between two node embeddings , label represents the edge existence and is the binary cross entropy loss function with corresponding edge weights . Negative sampling is included.
For node recommendation task, we have
where is the vector label and are node weights. Note that and refer to weights in different tasks and thus are not related.
We next introduce AdaGNN, using link prediction task as an example. AdaGNN includes a series of GNNs, called weak learners. Each weak learner is trained on weighted sample points and the weights used in present learner depend on the training errors from the previous learner.
Mathematically, the th weak learner can be written as
Given weak learners, the similarity between two nodes for the metalearner is
where , .
3.5. Generalization Error of AdaGNN
The calculation of generalization error for AdaGNN is mainly based on the generalization error of boosting from Schapire et al.(Bartlett et al., 1998) and the VC dimension of GNN from Scarselli et al.(Scarselli et al., 2018). We start with the definitions of the functional space of boosting GNNs.
Definition 3.8 ().
Let be the functional space of GNN encoders from graph to embedding space , be the encoders and be the uniform decoder for all encoders, then we define be the set of a weighted average of weak learners from :
and be the set of unweighted averages over weak learners from :
where .
Any projection can be associated with a distribution over defined by the coefficients . In other word, any such defines a natural distribution over where we draw function with probability . By choosing elements of independently at random according to this distribution and take an unweighted sum, we can generate an element of . Under such construction, we map each to a distribution over . That is, a function distributed according to can be sampled by choosing independently at random according to the coefficients and then defining .
The key property about the relationship between and is that each is completely determined by the other. Obviously is determined by because we defined it that way, but is also completely determined by as follows:
(8) 
Theorem 3.9 ().
Let be a distribution over , and let be a sample of node pairs chosen independently at random according to . Suppose the baseclassifier space has VCdimension , and let . Assume that . Then with probability at least over the random choice of the training set , every weighted average projection satisfies the following bound for all :
(9) 
Proof.
Using function constructed as above, this proof follows the same idea in Schapire et al.(Bartlett et al., 1998). ∎
We have proved that the generalization error bound of AdaGNN depends on the number of training data and the VCdimension of GNN. For onelayer GNN such as GCN and GraphSage, the VCdimension has been calculated by Scarselli et al(Scarselli et al., 2018). The generalization error is plotted and discussed in Section 4.3.3.
3.6. Training
A single iteration of training in AdaGNN involves two parts: a) training a weak learner; b) boosting the label weights. For weak learners, we use different types of GNNs, including GraphSage, GAT for static graph and TGN for dynamic graph. For boosting, we use two different algorithms: a) SAMME.R (R for Real) algorithm (Zhu et al., 2006)
, it uses weighted class probability estimates rather than hard classifications in the weightupdating and prediction combination, which leads to a better generalization and faster convergence. b)
AdaBoost.R2 algorithm (Drucker, 1997), it uses bootstrapping, making it less prone to overfitting. The boosting algorithm using SAMME.R for link prediction is shown in Algorithm 1. AdaBoost.R2 or nodelevel tasks are similar.We focus on the following variations of AdaGNN that are based on different combinations of underlying model, boosting algorithm and decoder types:

AdaGNN AdaBoostbased GNN, such as AdaSage, AdaGAT, using SAMME.R algorithm.

AdaGNNb AdaBoostbased GNN with bootstrapped training data, using AdaBoost.R2 algorithm.

AdaGNNnn AdaBoostbased GNN with uniform nonlinear decoder. We concatenate the node embeddings from all embedding spaces to be the new node embeddings and feed this new embeddings to uniform decoder.
4. Experiments
We test AdaGNN on three tasks and four real world social networks from diverse background applications.

Twitch social networks(Rozemberczki et al., 2019) Useruser networks where nodes correspond to Twitch users and edges to mutual friendships. Node features are games liked. The associated tasks are link prediction of whether two users have mutual friendships, node recommendation that recommends games each user like and multitask learning.

Wikipedia(Kumar et al., 2019) Userpage networks where nodes correspond to Wikipedia users, Wikipedia pages and edges to one user editing one page. The associated tasks are future link prediction of whether one user will edit one page in the future.

Linkedin Useruser networks where nodes correspond to Linkedin users and edges to mutual friendships. The associated task is link prediction of whether two users have mutual friendships. Future link prediction task is left for future work.
We also consider both transductive and inductive tasks w.r.t whether nodes are observed in training dataset. However it is worth to note that both our baseline models and AdaGNN variations are inductive in nature. For baseline models, we consider a few popular and representative stateoftheart models for static graph, GraphSage((Hamilton et al., 2018)), GAT((Veličković et al., 2018)); as well as stateofthe art model for dynamic graph, TGN((Rossi et al., 2020)). By comparing the performance with baselines in 4.2 and experimentally, AdaGNN shows advantages as follows:

AdaGNN outperforms baselines on all datasets for all tasks, especially when the information of neighborhoods is rich.

Multiple embedding spaces can capture different information, outperform single embedding space with higher dimension.

AdaGNN is robust to the number of training data.
4.1. Experimental Settings
In addition to test link prediction and node classification separately, we also experiment on multi task learning (as shown in Figure 2). Each training data point’s weight will be updated based on the final combined predicting error, which makes the errors of two tasks comparable and gives highdegree nodes that are susceptible to high errors relatively higher weights in node recommendation tasks. The scalability and time complexity of AdaGNN is highly coupled with the underlying model. It is obvious that AdaGNN linearly growth with number of weak learners w.r.t time complexity. We also find that weak learners suffice for all datasets.
Hyperparameters For training of GNNs, including GraphSage, GAT, TGN, we use the Adam optimizer with a batch size of . The learning rate is selected in , the number of heads is selected in , the number of neighbors is selected in and the number of layers is fixed as . For boosting, the learning rate is selected in . The number of negative samples is equal to positive samples except Linkedin datasets, where we have more positive samples. We do a simple grid hyperparameter tuning for both AdaGNN and baseline models, the best performance metrics are reported in the next section.
4.2. Performance Comparison with Baselines
Wikipedia  Movielens  
Static graph method  
GraphSage  
GAT  
AdaSage  
AdaGATb  
AdaGAT  
AdaGATn  
Dynamic graph method  
TGN  /  
AdaTGN  / 
Twitch  

Task  Link Prediction  Recommendation  
GraphSage  
GAT200  
GAT1024  
GAT3170  
AdaSage  
AdaGAT200  
AdaGAT1024  
AdaGAT3170  
AdaGATb  
AdaGATn200  
AdaGATn1024  
AdaGATn3170 
Table 2 presents the results for future link prediction on dynamic graph. AdaGNN outperforms the baselines in both transductive (2nd, 4th, 6th column) and inductive settings (3rd, 5th column) . One interesting model is AdaSage. For this model, we only aggregate the information from neighbor. The time complexity of this model is very low compared to the expensive dynamic graph method, such as TGN which needs to use RNN to update the memory at each training step. This simple model with boosting still achieves the accuracy around . It gives us the idea that one can use boostingbased shallow models to achieve comparable accuracy using much shorter time and much smaller GPU memory.
Noticeably, AdaGNN performs better on Movielens dataset than Wikipedia dataset. The reason is that AdaGNN can utilize users’ nonrepetitive interaction behavior. In Wikipedia dataset, 69% users keep editing the same page for the whole time domain, and 84% users consecutively edit the same page. By 72% chance, there is only one unique neighbor in node neighborhoods for a fixed sample size of . In this case, the information of node neighborhoods is not very rich and projecting this one single neighbor onto multiple embedding spaces will not bring a dramatic increase. Movielens dataset, on the other hand, has 0% users rating the same movie for the whole time domain and there are always more than one neighbors in node neighborhoods. In this case, the information of neighborhoods is rich and highdimensional, so projecting neighborhoods onto multiple embedding spaces gives a large improvement. The difference between Wikipedia and Movielens datasets tells us that the idea of multiple embedding spaces in general performs better when the information of node neighborhoods is rich.
Table 2 presents the results on multitask learning, including link prediction and node recommendation, in transductive (2nd, 4th column) and inductive settings (3rd, 5th column) . GAT200, GAT1024, GAT3170 represent GAT with embedding dimension 200, 1024, 3170 respectively. AdaGNN clearly outperforms the baselines by a large margin in both transductive and inductive settings for all dimensions, especially for the recommendation task.
4.3. Model Analysis
In this section, we further verify the multiple embedding spaces experimentally, including

Do multiple lowdimensional embedding spaces perform better than one single highdimensional embedding space, even when the dimension of single space is equal to the sum of weak learners’ dimensions?

Does increasing the number of embedding spaces bound the generalization error as we predict theoretically?

Do multiple embedding spaces capture different information in real experiment?

Is AdaGNN robust towards limited training data?
4.3.1. HighDimensional Embedding Space
For highdimensional information of node neighborhoods, we mentioned before that it can be projected onto either one single highdimensional embedding space or multiple lowdimensional embedding spaces. In Figure 3, we discuss the performance of single highdimensional embedding space and multiple lowdimensional embedding spaces by varying the embedding dimension from around 200 to 1800, and computing the average precision for static baselines and AdaGAT for different embedding dimensions on the future link prediction task of Movielens dataset. The effect on other tasks and datasets is similar. We observe that multiple embedding spaces outperform one single embedding space for all different embedding dimensions. Furthermore, AdaGAT even outperforms baselines when the embedding dimension of baselines and the sum of all subspace dimensions are almost equal. In these cases, it is difficult to find one single embedding space such that all pairs of linked nodes are close to each other and it’s better to use multiple embedding spaces.
4.3.2. Generalization error
We use the margin theory to analyze the generalization errors of AdaGNN. Margins can be considered as a measure of confidence. Here we define the margin as , where represents the existence of edges or not, represents the predicted similarity between two nodes and is the threshold. The margin is negative when the prediction is incorrect and positive otherwise. Similar to (Scarselli et al., 2018), the margin reaches minimum when the model predicts incorrectly with high confidence, maximum when the model predicts correctly with high confidence, and close to zero when the prediction has low confidence. For example, given , the margin reaches minimum if , maximum if , and close to zero if . Therefore, margin is a measure of correctness and confidence. When the prediction is based on a clear and substantial majority of the base classifiers, the margin will be positively large, corresponding to greater confidence in the predicted labels.
In Section 3.5, we showed the existence of the generalization error theorem when the training and test errors are measured using the margin metric. Therefore, under margin metric, we can bound the test error by looking at the training error and the generalization error bound.
We visualize the effect of boosting on the margins by plotting their distribution, as in Figure 4. The first two rows are the average precision scores and errors for different number of weak learners on Twitch, Wikipedia, Movielens datasets respectively. The last row contains the margin distribution graph. We can observe that boosting aggressively pushed up the training data with small or negative margin, so the number of training data with small or negative margin decreased, which leads to smaller training and testing errors. The generalization error bound in Eq. 9 in terms of number of samples and VCdimension has no explicit dependence on the number of weak learners. In the second row of Figure 4 the observed generalization errors in reality also stays roughly constant with respect to the number of weak learners.
4.3.3. Embedding Visualization
For different weak learners, we want that it projects the node neighborhood onto different embedding spaces, capturing different similarities. In order to visualize it, we plot node embeddings in different spaces, as in Figure 5. We use tSNE (van der Maaten and Hinton, 2008) to visualize the highdimensional embeddings in 2d space. In Figure 5, node 17 are the neighbors of node 0 and we plot their embeddings in four different embedding spaces. In embedding space 1, node 7 is closest to node 0 and node 1 is farthest from node 0, while node 1 is closest to node 0 and node 7 is farthest from node 0 in embedding space 4. Similarly, node 6 is closest to node 0 in embedding space 2 while it is far away from node 0 in embedding space 3. From these observations, we can see that for the same pair of nodes, their similarity is different from space to space. It experimentally verifies that each embedding space captures different information and preserves different similarities about node neighborhoods in this case.
Combine the results in Figure 3 and 5, we can experimentally validate the necessity of multiple embedding spaces. In some case, it’s hard to find one single embedding space such that two linked nodes are always close to each other. Therefore, we can use multiple embedding spaces and only require parts of linked nodes are close to each other in each embedding space, which is easier to achieve. Then we can combine the similarities in all embedding spaces to predict the existence of edges.
4.3.4. Robustness to Limited Training Data
One challenge of learning on social networks is the lack of highquality data. Therefore, an ideal model should be efficient in leveraging limited training data. In this experiment, we validate the robustness of AdaGAT to limited training data in Figure 6. We vary the ratio of training data from to on Movielens dataset. The numbers of validation and testing data are always equal. We can see that the AdaGAT always outperforms the baselines.
5. Conclusion
In this work, we introduce a novel approach to automatically project the node neighborhoods onto multiple lowdimensional embedding spaces using boosting method and develop AdaGNN based on this approach. We theoretically and experimentally analyze the effectiveness and robustness of AdaGNN, especially the advantages of multiple embedding spaces over single embedding space. We demonstrate that AdaGNN can achieve great performance when the information of node neighborhoods is rich and the dimension of the ideal embedding space is large. Our work envisions the novel application of multiple embedding spaces and boosting method in graph neural network, opens up a direction along the idea that preserves the similarities between nodes in different spaces in the field of social networks and also leaves us several future questions to think about, including how to further reduce the information leakage between embedding spaces, how to further narrow down the focus of each embedding space and how to further measure the richness of node neighborhoods.
References
 Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (5), pp. 1651 – 1686. External Links: Document, Link Cited by: §3.5, Theorem 3.9.
 Improving regressors using boosting techniques. Cited by: §3.6.
 Node2vec: scalable feature learning for networks. In Proc. of KDD, pp. 855–864. Cited by: §1.
 Inductive representation learning on large graphs. In Proc. of NIPS, pp. 1024–1034. Cited by: §1.
 Inductive representation learning on large graphs. External Links: 1706.02216 Cited by: §2, §4.
 The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 21606455, Link, Document Cited by: 3rd item.
 SemiSupervised Classification with Graph Convolutional Networks. In Proc. of ICLR, Cited by: §1.
 Semisupervised classification with graph convolutional networks. External Links: 1609.02907 Cited by: §2.
 Predicting dynamic embedding trajectory in temporal interaction networks. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450362016, Link, Document Cited by: §2, 2nd item.
 Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278. Cited by: §1.1.
 PyTorchbiggraph: A largescale graph embedding system. CoRR abs/1903.12287. External Links: Link Cited by: §1.
 Towards efficient largescale graph neural network computing. CoRR abs/1810.08403. External Links: Link Cited by: §1.
 PinnerSage. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450379984, Link, Document Cited by: §1.2, §2, §3.3.
 DeepWalk: online learning of social representations. In Proc. of KDD, pp. 701–710. Cited by: §1.
 Temporal graph networks for deep learning on dynamic graphs. External Links: 2006.10637 Cited by: §1.1, §1, §2, §4.
 Multiscale attributed node embedding. External Links: 1909.13021 Cited by: 1st item.
 The vapnik–chervonenkis dimension of graph and recursive neural networks. Neural Networks 108, pp. 248–259. External Links: ISSN 08936080, Document, Link Cited by: §3.5, §3.5, §4.3.2.
 NPIgnn: predicting ncrna–protein interactions with deep graph neural networks. Briefings in Bioinformatics. Cited by: §1.
 Visualizing data using tsne. Journal of Machine Learning Research 9 (86), pp. 2579–2605. External Links: Link Cited by: §4.3.3.
 Graph attention networks. External Links: 1710.10903 Cited by: §1, §2, §4.
 Billionscale commodity embedding for ecommerce recommendation in alibaba. In Proc. of KDD, pp. 839–848. Cited by: §1.
 [22] () Nonlinear latent factorization by embedding multiple user interests [extended abstract]. Cited by: §2.
 A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link Cited by: §1.
 Inductive representation learning on temporal graphs. External Links: 2002.07962 Cited by: §2.
 How powerful are graph neural networks?. In Proc. of ICLR, Cited by: §1.

Relation learning on social networks with multimodal graph edge variational autoencoders
. External Links: 1911.05465 Cited by: §1, §1. 
Graph convolutional neural networks for webscale recommender systems
. CoRR abs/1806.01973. External Links: Link Cited by: §1.  Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983. Cited by: §1.
 Positionaware graph neural networks. In Proc. of ICML, pp. 7134–7143. Cited by: §1.
 Multiclass adaboost. Statistics and its interface 2, pp. . External Links: Document Cited by: §3.3, §3.3, §3.6.
 AliGraph: A comprehensive graph neural network platform. CoRR abs/1902.08730. External Links: Link, 1902.08730 Cited by: §1.
Comments
There are no comments yet.