# AdaGNN: A multi-modal latent representation meta-learner for GNNs based on AdaBoosting

As a special field in deep learning, Graph Neural Networks (GNNs) focus on extracting intrinsic network features and have drawn unprecedented popularity in both academia and industry. Most of the state-of-the-art GNN models offer expressive, robust, scalable and inductive solutions empowering social network recommender systems with rich network features that are computationally difficult to leverage with graph traversal based methods. Most recent GNNs follow an encoder-decoder paradigm to encode high dimensional heterogeneous information from a subgraph onto one low dimensional embedding space. However, one single embedding space usually fails to capture all aspects of graph signals. In this work, we propose boosting-based meta learner for GNNs, which automatically learns multiple projections and the corresponding embedding spaces that captures different aspects of the graph signals. As a result, similarities between sub-graphs are quantified by embedding proximity on multiple embedding spaces. AdaGNN performs exceptionally well for applications with rich and diverse node neighborhood information. Moreover, AdaGNN is compatible with any inductive GNNs for both node-level and edge-level tasks.

## Authors

• 1 publication
• 2 publications
01/03/2022

### Two-level Graph Neural Network

Graph Neural Networks (GNNs) are recently proposed neural network struct...
12/30/2021

### Motif Graph Neural Network

Graphs can model complicated interactions between entities, which natura...
05/14/2021

### Meta-Inductive Node Classification across Graphs

Semi-supervised node classification on graphs is an important research p...
08/31/2021

### Position-based Hash Embeddings For Scaling Graph Neural Networks

Graph Neural Networks (GNNs) bring the power of deep representation lear...
03/18/2020

### Few-Shot Graph Classification with Model Agnostic Meta-Learning

Graph classification aims to perform accurate information extraction and...
01/21/2021

### Knowledge-Preserving Incremental Social Event Detection via Heterogeneous GNNs

Social events provide valuable insights into group social behaviors and ...
05/06/2021

### Learning Neighborhood Representation from Multi-Modal Multi-Graph: Image, Text, Mobility Graph and Beyond

Recent urbanization has coincided with the enrichment of geotagged data,...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Graph representation learning has advanced greatly in the recent years and has drawn attention in both academia and industry because GNNs are expressive, flexible, robust and scalable.

Many recent GNN models learn the projections from node neighborhoods using different sampling, aggregation and transformations (Grover and Leskovec, 2016; Perozzi et al., 2014; Kipf and Welling, 2017a; Hamilton et al., 2017; Xu et al., 2019; You et al., 2019; Rossi et al., 2020). GNNs have been adopted in various academic and industrial settings, such as link prediction (Ying et al., 2018b), protein-protein interaction (Shen et al., 2021), community detection (Wu et al., 2019), and recommender systems (Wang et al., 2018; Ying et al., 2018a). Furthermore, Zhu et al. (2019); Ying et al. (2018a); Lerer et al. (2019); Ma et al. (2018) develop fault-tolerant and distributed systems to apply graph neural networks (GNNs) to large graphs.

These GNN models focus on learning a single encoder projecting graph substructures to a representation embedding. However, low-dimensional node embeddings sometimes fail to capture all the high-dimensional information about node neighborhoods. For example, in a social network, users may connect to distinct neighbors who share different common interests and thus the semantic meaning of edges varies (Yang et al., 2019). Most prior works aim to capture the diverse signals of node neighborhoods by increasing the embedding dimension. However, it is challenging to encode the local multi-modal information to one embedding space in many cases. Multi-head attention bridges the gap to some extent by learning different attention heads for different neighbors (Veličković et al., 2018). One promising alternative is to learn multiple representations where each embedding captures a specific aspect of the rich information about node neighborhoods. By projecting to multiple low-dimensional node embedding spaces, we find it extremely promising when the inter-node affinity correlates with the inter-embedding similarity in one or more sub-spaces instead of the entire space.

Network machine learning applications usually consider exceptionally heterogeneous feature from node, topological signals, various neighborhoods and communities where there exists an ensemble of latent semantics under the network features

(Yang et al., 2019). Here we propose boosting-based GNNs, which automatically learn projections to multiple low-dimensional embedding spaces from the high-dimensional graph contents and determine the focus of each embedding space according to the node neighborhoods.

### 1.1. Preliminaries

Static Graph Despite that the concrete formulation of graph problems varies, we use a quadruplet

 (1) G=(V,E,{vi∈RdV},{ei,j∈RdE})

to denote the holistic information about a graph, where and are the sets of all nodes and edges respectively. They are endowed by node features and edge features .

Dynamic Graph Recent works investigated deeply into the representation learning in dynamic graphs and corresponding variations of GNNs (Rossi et al., 2020; Kumar et al., 2019) which consider both the evolution of in terms of cardinality and feature updates and . For simplicity, we assume is the universe of all nodes that ever exist during the trajectory of the dynamic network; and features are timestamped .

Embedding Space An embedding space

is a vector space onto which we project nodes

. Let denote a similarity measure between two embeddings.

Encoder-Decoder framework We let denote the neighborhood222A -hop neighborhood considering all the nodes and edges within a radius of -edges w.r.t is a widely used definition.around node , which comprises nodes, edges, node features and edge features. Conceptually an encoder projects nodes to an embedding space whereas the actual model maps neighborhood to an embedding

. The concrete choices of decoders vary in different applications. For example, in node-level supervised learning with labels

, our goal is to optimize encoder and decoder

 argmin{ENC},{DEC% }Evi∈V[L(yi,{DEC}({ENC}(Gi)))].

Whereas for pairwise utility prediction, we are usually accompanied with an inter-node utility labels and the goal reduces to searching the best encoder and decoder333

Sometimes decoders do not contain trainable parameters (e.g. dot product, cosine similarity), then the optimization is only searching the encoder parameter space.

 argmin{ENC},{DEC% }E(vi,vj)∈E[L(yi,j,{DEC}({ENC}(Gi),{ENC}(Gj)))].

Although GNNs can also be applied to other tasks, in this paper, our discussion covers the above two categories of applications.

### 1.2. Multi-modal embedding spaces

Figure 0(a) illustrates a toy social network example where game players have preferences of genres of games and the edges represent the friendship. As shown in figure 0(a), user 0 and 1 established a friendship connection through their common interests of racing game; while user 5 and 6 are friends because of strategy games.

• Multiple embedding spaces Our goal is to learn multiple encoders on such that in whereas in because they are close in the social network via different ”semantics”. As shown in figure 0(b), we seek to learn two projections such that their embedding similarities are guaranteed in different spaces.

• Node decoder is used in node classification or recommendation tasks. The objective is to learn encoder (and decoder) for node-level labels. In the multi embedding setup, we consider the following form of node decoders:

 {DEC}node:{K⋃k=1Zk}→RdY
• Inter-node proximity is determined by multiple embedding spaces. In a supervised link prediction setup, the final prediction is based on decoders to ”combine” the similarities between two nodes in multiple embedding spaces.

 {DEC}pairwise:{K⋃k=1Zk}×{K⋃k=1Zk}→R

Unfortunately, as mentioned in Pal et al.(Pal et al., 2020), the complex and noisy nature of graph data renders such assumption we made in the toy example, the existence of explicit edge ontology, unrealistic.

## 2. Related Work

Graph Embedding Models, such as GCN (Kipf et al. (Kipf and Welling, 2017b)), GraphSAGE (Hamilton et al. (Hamilton et al., 2018)), GAT (Velickovic et al. (Veličković et al., 2018)), can encode the subgraph to a vector space . However, they only map the subgraph to one single embedding space instead of multiple embedding spaces.

Multi-Embedding Models, such as PinnerSage (Pal et al. (Pal et al., 2020)) and others (Weston et al. (Weston et al., )), try to learn multi-modal embeddings via clustering method which is expensive. Moreover, it requires additional empirical inputs regarding number of clusters, similarity measure and pre-trained high quality embeddings which presumably capturing rich multi-modal signals.

Temporal Network Embedding Models, such as Jodie (Kumar et al. (Kumar et al., 2019)), TGAT (Xu et al. (Xu et al., 2020)), TGN (Rossi et al. (Rossi et al., 2020)), are designed for dynamic graphs, mapping from time domain to the continuous differentiable functional domain. Our boosting method can also be implemented onto dynamic graphs, collaborating with those temporal network embedding models.

## 3. Present Work

In this work, we present an Ada-boosting based meta learner for GNNs (AdaGNN) that is both model and task agnostic. AdaGNN leverages a sequential boosting training paradigm that allows multiple different sub-learners to co-exist. Each embedding space ideally preserves unique inter-node similarity information. In this section, we mainly discuss the theoretical rationale behind our intuition regarding the advantages over single-embedding space using a node-level context as an example444In this work, we only experimented with node recommendation, link prediction and multi task learning. Moreover, our approach works in a joint training fashion that diminishes the prerequisite of pre-trained embeddings.

### 3.1. Problem Definition

Taking static graph or dynamic graph in Eq. 1 as input, we project the neighborhood of each node onto embedding space such that embedding space fully encodes the labels, as defined below.

###### Definition 3.1 ().

Given probabiltiy threshold , embedding space fully encodes labels iff.

 P(Y|Z)≥τ.

Inspired by the toy example in Figure 1, we assume that labels are affected by different aspects of neighborhoods. Then we can find an embedding space such that labels affected by one specific aspect is encoded in this embedding space while labels not affected by this aspect isn’t encoded, as defined below.

###### Definition 3.2 ().

Diffusion induced labels is a subset of labels such that can be decoded from one embedding space and can’t be decoded from this embedding space.

Embedding space is specifically related to the diffusion induced labels . Consequently we use instead of as labels in the training of encoder , decoder .

We further assume that labels are partitioned by diffusion induced labels . Then the graph learning problem is modified into multi-embedding learning as follows:

Problem Given a graph , learn a set of embedding spaces and corresponding encoders, decoders by reducing the discrepancy between decoder outputs and diffusion induced labels .

### 3.2. Problem Context

In this section, node-level labels are used as example to justify the intuition behind multi-modal embedding spaces. Other cases are easy to replicate, such as using in link prediction, where denotes the edge existence. We assume that there exists an ideal embedding space such that node embeddings encode all the latent signals:

###### Definition 3.3 ().

Embedding space is defined as ideal embedding space on the whole graph iff. it fully encodes all labels:

 (2) ∀vi∈V P(yi|zi)≥τ,

where represents node-level vector label, is the node embedding of and

is a probability threshold.

Assume vector space is spanned by a set of orthonormal basis , then the node embeddings are linear combinations of these basis vectors with coefficients:

In Eq. 2, given node

, write node embedding in orthonormal basis and use Bayes’ theorem:

 (3)

We further assume that only a small subset of basis vectors have correlation with label . The idea behind this assumption is that most of the volume of the high-dimensional cube is located in its corners. When , most of the coefficients are close to zero after normalization. Let be the set of basis vectors which have nontrivial correlation with label , then :

 (4) P(yi|zi)∼P(yi,∑dn=1zi,f(n)bf(n))P(∑dn=1zi,f(n)bf(n))=P(yi|d∑n=1zi,f(n)bf(n)),

and spans a vector space . For nodes satisfying Eq. 4, they are encoded in and we define them as . For nodes not satisfying Eq. 4, a new vector space and set can be obtained using the above procedure. Note that the intersection between and is not necessarily empty. Repeat until all nodes are encoded. A set of vector space are obtained:

 (5) Y =∪Kk=1Yk, (6) Zideal =Z1⊕Z2⊕Z3⊕⋯⊕ZK,

and

 (7) ∀vi∈V ∃k, s.t. yi∈Yk, P(yi|zki)≥τ,

where is the node embedding in embedding space .

In existing works, encoders and decoders parameterize and using neural networks. In this paper we introduce a new approach: modeling and . When , the new approach obtains a large advantage and a new embedding space can always be found to increase the overall accuracy before the ideal embedding space is achieved. Moreover, increasing the number of embedding spaces can bound the generalization gap, which is discussed in Section 3.5.

### 3.3. Multiple Embeddings Generation

Inspired by the procedure above of finding new embedding space and the weight-updating ability of boosting, we adopt AdaBoosting as the meta learner with homogeneous GNNs as weak learners to encode underlying relations from weighted labels, which circumvents the difficulties in clustering method(Pal et al., 2020). Each learner has a GNN encoder that projects onto one embedding space . From a high-level perspective, we learn such embedding projections in an iterative fashion: each training data point is associated with a weight based on the previous weak learner’s (GNN) error and this weight encourages the next weak learner to focus on the data points with misclassified labels from the previous learner.

###### Lemma 3.4 ().

Given a learner with corresponding embedding space , there exists a new embedding space capturing different information from the existing embedding space until the number of misclassified labels is zero.

###### Proof.

For this learner and embedding space , if the number of misclassified labels is nonzero, there exists at least one misclasssified label. The relation in this label is not preserved, so there exists a new embedding space preserving this relation using procedure in Section 3.2

. Note that new embedding space may not be able to encode all the classified labels in previous learner. ∎

The ”artificial” diffusion induced labels are created by boosting weights. For link prediction, if the number of misclassified edges is nonzero, i.e., edges . We can generate new diffusion induced labels only including misclassified edges by setting the weights of classified edges to zero. In experiments, we increase the weights of misclassified edges to some extent until at least one previously misclassified edge is classified correctly in next weak learner instead. If it can’t be achieved, boosting will stop.

###### Lemma 3.5 ().

Embedding space of current and next learner constructed as above capture different information .

###### Proof.

For current learner, if its training error is nonzero (otherwise we stop boosting), there exists at least one misclassified label. We then increase the weight of this data point for the next learner so that the label will be classified correctly. preserves the latent relation that affects this label, while does not. ∎

Therefore, there will always exist a new embedding space capturing different information from the original embedding space until the number of misclassified labels is zero and the embedding spaces can be found by increasing the weights of misclassified labels.

It remains unresolved about what the optimal weight updating rule is, and it depends on the definition of optimality. The most common definition is to achieve the lowest misclassification error rate. Usually it is assumed that the training data are i.i.d. samples from an unknown probability distribution. Then we can derive the boosting algorithm based on different misclassification error rate. In Zhu et al.

(Zhu et al., 2006)

, they proposed two algorithms: SAMME and SAMME.R, and proved that the two algorithms minimize the misclassification error rate for discrete predictions and real-valued confidence-rated predictions respectively. More generally, one can use gradient boosting, which is left for future work.

###### Lemma 3.6 ().

For AdaBoost, adding a new learner with weights as in Algorithm 1 will minimize the misclassification error.

###### Proof.

Zhu et al.(Zhu et al., 2006)

proved this for SAMME and SAMME.R algorithm using a novel multi-class exponential loss function and forward stage-wise additive modeling. ∎

###### Theorem 3.7 ().

in problem definition can be found using boosting and it minimizes the misclassification error.

###### Proof.

It’s straightforward by Lemma  3.4,  3.5 and  3.6. ∎

### 3.4. Graph Neural Network

GNNs use the graph structure, node features and edge features to learn a node embedding for a node . Modern GNNs follow a neighborhood aggregation strategy, where we iteratively update the representation of a node by aggregating representations of its neighbors. After iterations of aggregation, a node’s representation captures the structural information within its -hop network neighborhood. Formally, the -th layer of a GNN is

 a(l)i={AGGREGATE}({z(l−1)j, vj∈N(vi)}),z(l)i={COMBINE}(z(l−1)i, a(l)i),

where is the node embedding of with the message of -hop network neighbors, is the message aggregated from -hop neighbors and we initialize . In this work, attention mechanism or mean pool is used to aggregate the neighbor representations. Aggregated messages are further combined with self embeddings using fully connected layers. We will drop superscripts and use to represent node embeddings in the following discussion.

Following encoder-decoder framework, node neighborhood is projected to a vector space which is then decoded for different tasks: a) Link Prediction, where edge existence is predicted using the similarity between two node embeddings; b) Node Recommendation, where the node-level labels are predicted only using the information of neighbors excluding themselves; c) Multi-Task Learning, where link prediction and node recommendation are trained together using the same encoder.

 si,j={DEC}pairwise(zi, zj),L=−∑(vi,vj)∈Ewi,jyi,jlogsi,j+wi,j(1−yi,j)log(1−si,j),

where is the similarity between two node embeddings , label represents the edge existence and is the binary cross entropy loss function with corresponding edge weights . Negative sampling is included.

For node recommendation task, we have

 ri={% DEC}node(zi),L=−∑vi∈VwiyTilogri,

where is the vector label and are node weights. Note that and refer to weights in different tasks and thus are not related.

We next introduce AdaGNN, using link prediction task as an example. AdaGNN includes a series of GNNs, called weak learners. Each weak learner is trained on weighted sample points and the weights used in present learner depend on the training errors from the previous learner.

Mathematically, the th weak learner can be written as

 zki={ENC}k(Gi):G→Zk,ski,j={DEC}k(zki,zkj):Zk×Zk→[0,1].

Given weak learners, the similarity between two nodes for the metalearner is

 si,j=K∑k=1αkski,j=K∑k=1αk{DEC}k({ENC}k(Gi),{ENC}k(Gj)),

where , .

### 3.5. Generalization Error of AdaGNN

The calculation of generalization error for AdaGNN is mainly based on the generalization error of boosting from Schapire et al.(Bartlett et al., 1998) and the VC dimension of GNN from Scarselli et al.(Scarselli et al., 2018). We start with the definitions of the functional space of boosting GNNs.

###### Definition 3.8 ().

Let be the functional space of GNN encoders from graph to embedding space , be the encoders and be the uniform decoder for all encoders, then we define be the set of a weighted average of weak learners from :

 C≐{f:G→∑hk∈Hαkski,j | αk≥0;K∑k=1αk=1},

and be the set of unweighted averages over weak learners from :

 CK≐{f:G→1KK∑k=1ski,j | hk∈H},

where .

Any projection can be associated with a distribution over defined by the coefficients . In other word, any such defines a natural distribution over where we draw function with probability . By choosing elements of independently at random according to this distribution and take an unweighted sum, we can generate an element of . Under such construction, we map each to a distribution over . That is, a function distributed according to can be sampled by choosing independently at random according to the coefficients and then defining .

The key property about the relationship between and is that each is completely determined by the other. Obviously is determined by because we defined it that way, but is also completely determined by as follows:

 (8) f(Gi,Gj)=Eg∼Q[g(Gi,Gj)].
###### Theorem 3.9 ().

Let be a distribution over , and let be a sample of node pairs chosen independently at random according to . Suppose the base-classifier space has VC-dimension , and let . Assume that . Then with probability at least over the random choice of the training set , every weighted average projection satisfies the following bound for all :

 (9) PD[(yi.j−τ)(f(Gi,Gj)−τ)≤0]≤PS[(yi,j−τ)(f(Gi,Gj)−τ)≤θ]+O(1√m(dlog2(md)θ2+log(1δ))1/2),
###### Proof.

Using function constructed as above, this proof follows the same idea in Schapire et al.(Bartlett et al., 1998). ∎

We have proved that the generalization error bound of AdaGNN depends on the number of training data and the VC-dimension of GNN. For one-layer GNN such as GCN and GraphSage, the VC-dimension has been calculated by Scarselli et al(Scarselli et al., 2018). The generalization error is plotted and discussed in Section 4.3.3.

### 3.6. Training

A single iteration of training in AdaGNN involves two parts: a) training a weak learner; b) boosting the label weights. For weak learners, we use different types of GNNs, including GraphSage, GAT for static graph and TGN for dynamic graph. For boosting, we use two different algorithms: a) SAMME.R (R for Real) algorithm (Zhu et al., 2006)

, it uses weighted class probability estimates rather than hard classifications in the weight-updating and prediction combination, which leads to a better generalization and faster convergence. b)

AdaBoost.R2 algorithm (Drucker, 1997), it uses bootstrapping, making it less prone to overfitting. The boosting algorithm using SAMME.R for link prediction is shown in Algorithm 1. AdaBoost.R2 or node-level tasks are similar.

We focus on the following variations of AdaGNN that are based on different combinations of underlying model, boosting algorithm and decoder types:

• AdaGNN-nn AdaBoost-based GNN with uniform non-linear decoder. We concatenate the node embeddings from all embedding spaces to be the new node embeddings and feed this new embeddings to uniform decoder.

## 4. Experiments

We test AdaGNN on three tasks and four real world social networks from diverse background applications.

• Twitch social networks(Rozemberczki et al., 2019) User-user networks where nodes correspond to Twitch users and edges to mutual friendships. Node features are games liked. The associated tasks are link prediction of whether two users have mutual friendships, node recommendation that recommends games each user like and multi-task learning.

• Wikipedia(Kumar et al., 2019) User-page networks where nodes correspond to Wikipedia users, Wikipedia pages and edges to one user editing one page. The associated tasks are future link prediction of whether one user will edit one page in the future.

• Movielens(Harper and Konstan, 2015) User-movie networks where nodes correspond to users, movies and edges to one user rating or tagging one movie. The associated tasks are future link prediction of whether one user will rate or tag one movie in the future.

• Linkedin User-user networks where nodes correspond to Linkedin users and edges to mutual friendships. The associated task is link prediction of whether two users have mutual friendships. Future link prediction task is left for future work.

We also consider both transductive and inductive tasks w.r.t whether nodes are observed in training dataset. However it is worth to note that both our baseline models and AdaGNN  variations are inductive in nature. For baseline models, we consider a few popular and representative state-of-the-art models for static graph, GraphSage((Hamilton et al., 2018)), GAT((Veličković et al., 2018)); as well as state-of-the art model for dynamic graph, TGN((Rossi et al., 2020)). By comparing the performance with baselines in 4.2 and experimentally, AdaGNN  shows advantages as follows:

• AdaGNN outperforms baselines on all datasets for all tasks, especially when the information of neighborhoods is rich.

• Multiple embedding spaces can capture different information, outperform single embedding space with higher dimension.

• AdaGNN is robust to the number of training data.

### 4.1. Experimental Settings

In addition to test link prediction and node classification separately, we also experiment on multi task learning (as shown in Figure 2). Each training data point’s weight will be updated based on the final combined predicting error, which makes the errors of two tasks comparable and gives high-degree nodes that are susceptible to high errors relatively higher weights in node recommendation tasks. The scalability and time complexity of AdaGNN  is highly coupled with the underlying model. It is obvious that AdaGNN  linearly growth with number of weak learners w.r.t time complexity. We also find that weak learners suffice for all datasets.

Hyperparameters For training of GNNs, including GraphSage, GAT, TGN, we use the Adam optimizer with a batch size of . The learning rate is selected in , the number of heads is selected in , the number of neighbors is selected in and the number of layers is fixed as . For boosting, the learning rate is selected in . The number of negative samples is equal to positive samples except Linkedin datasets, where we have more positive samples. We do a simple grid hyper-parameter tuning for both AdaGNN  and baseline models, the best performance metrics are reported in the next section.

### 4.2. Performance Comparison with Baselines

Table 2 presents the results for future link prediction on dynamic graph. AdaGNN outperforms the baselines in both transductive (2nd, 4th, 6th column) and inductive settings (3rd, 5th column) . One interesting model is AdaSage. For this model, we only aggregate the information from neighbor. The time complexity of this model is very low compared to the expensive dynamic graph method, such as TGN which needs to use RNN to update the memory at each training step. This simple model with boosting still achieves the accuracy around . It gives us the idea that one can use boosting-based shallow models to achieve comparable accuracy using much shorter time and much smaller GPU memory.

Noticeably, AdaGNN performs better on Movielens dataset than Wikipedia dataset. The reason is that AdaGNN can utilize users’ non-repetitive interaction behavior. In Wikipedia dataset, 69% users keep editing the same page for the whole time domain, and 84% users consecutively edit the same page. By 72% chance, there is only one unique neighbor in node neighborhoods for a fixed sample size of . In this case, the information of node neighborhoods is not very rich and projecting this one single neighbor onto multiple embedding spaces will not bring a dramatic increase. Movielens dataset, on the other hand, has 0% users rating the same movie for the whole time domain and there are always more than one neighbors in node neighborhoods. In this case, the information of neighborhoods is rich and high-dimensional, so projecting neighborhoods onto multiple embedding spaces gives a large improvement. The difference between Wikipedia and Movielens datasets tells us that the idea of multiple embedding spaces in general performs better when the information of node neighborhoods is rich.

Table 2 presents the results on multi-task learning, including link prediction and node recommendation, in transductive (2nd, 4th column) and inductive settings (3rd, 5th column) . GAT-200, GAT-1024, GAT-3170 represent GAT with embedding dimension 200, 1024, 3170 respectively. AdaGNN clearly outperforms the baselines by a large margin in both transductive and inductive settings for all dimensions, especially for the recommendation task.

### 4.3. Model Analysis

In this section, we further verify the multiple embedding spaces experimentally, including

• Do multiple low-dimensional embedding spaces perform better than one single high-dimensional embedding space, even when the dimension of single space is equal to the sum of weak learners’ dimensions?

• Does increasing the number of embedding spaces bound the generalization error as we predict theoretically?

• Do multiple embedding spaces capture different information in real experiment?

• Is AdaGNN robust towards limited training data?

#### 4.3.1. High-Dimensional Embedding Space

For high-dimensional information of node neighborhoods, we mentioned before that it can be projected onto either one single high-dimensional embedding space or multiple low-dimensional embedding spaces. In Figure 3, we discuss the performance of single high-dimensional embedding space and multiple low-dimensional embedding spaces by varying the embedding dimension from around 200 to 1800, and computing the average precision for static baselines and AdaGAT for different embedding dimensions on the future link prediction task of Movielens dataset. The effect on other tasks and datasets is similar. We observe that multiple embedding spaces outperform one single embedding space for all different embedding dimensions. Furthermore, AdaGAT even outperforms baselines when the embedding dimension of baselines and the sum of all subspace dimensions are almost equal. In these cases, it is difficult to find one single embedding space such that all pairs of linked nodes are close to each other and it’s better to use multiple embedding spaces.

#### 4.3.2. Generalization error

We use the margin theory to analyze the generalization errors of AdaGNN. Margins can be considered as a measure of confidence. Here we define the margin as , where represents the existence of edges or not, represents the predicted similarity between two nodes and is the threshold. The margin is negative when the prediction is incorrect and positive otherwise. Similar to (Scarselli et al., 2018), the margin reaches minimum when the model predicts incorrectly with high confidence, maximum when the model predicts correctly with high confidence, and close to zero when the prediction has low confidence. For example, given , the margin reaches minimum if , maximum if , and close to zero if . Therefore, margin is a measure of correctness and confidence. When the prediction is based on a clear and substantial majority of the base classifiers, the margin will be positively large, corresponding to greater confidence in the predicted labels.

In Section 3.5, we showed the existence of the generalization error theorem when the training and test errors are measured using the margin metric. Therefore, under margin metric, we can bound the test error by looking at the training error and the generalization error bound.

We visualize the effect of boosting on the margins by plotting their distribution, as in Figure 4. The first two rows are the average precision scores and errors for different number of weak learners on Twitch, Wikipedia, Movielens datasets respectively. The last row contains the margin distribution graph. We can observe that boosting aggressively pushed up the training data with small or negative margin, so the number of training data with small or negative margin decreased, which leads to smaller training and testing errors. The generalization error bound in Eq. 9 in terms of number of samples and VC-dimension has no explicit dependence on the number of weak learners. In the second row of Figure 4 the observed generalization errors in reality also stays roughly constant with respect to the number of weak learners.

#### 4.3.3. Embedding Visualization

For different weak learners, we want that it projects the node neighborhood onto different embedding spaces, capturing different similarities. In order to visualize it, we plot node embeddings in different spaces, as in Figure 5. We use t-SNE (van der Maaten and Hinton, 2008) to visualize the high-dimensional embeddings in 2d space. In Figure 5, node 1-7 are the neighbors of node 0 and we plot their embeddings in four different embedding spaces. In embedding space 1, node 7 is closest to node 0 and node 1 is farthest from node 0, while node 1 is closest to node 0 and node 7 is farthest from node 0 in embedding space 4. Similarly, node 6 is closest to node 0 in embedding space 2 while it is far away from node 0 in embedding space 3. From these observations, we can see that for the same pair of nodes, their similarity is different from space to space. It experimentally verifies that each embedding space captures different information and preserves different similarities about node neighborhoods in this case.

Combine the results in Figure 3 and 5, we can experimentally validate the necessity of multiple embedding spaces. In some case, it’s hard to find one single embedding space such that two linked nodes are always close to each other. Therefore, we can use multiple embedding spaces and only require parts of linked nodes are close to each other in each embedding space, which is easier to achieve. Then we can combine the similarities in all embedding spaces to predict the existence of edges.

#### 4.3.4. Robustness to Limited Training Data

One challenge of learning on social networks is the lack of high-quality data. Therefore, an ideal model should be efficient in leveraging limited training data. In this experiment, we validate the robustness of AdaGAT to limited training data in Figure 6. We vary the ratio of training data from to on Movielens dataset. The numbers of validation and testing data are always equal. We can see that the AdaGAT always outperforms the baselines.

## 5. Conclusion

In this work, we introduce a novel approach to automatically project the node neighborhoods onto multiple low-dimensional embedding spaces using boosting method and develop AdaGNN based on this approach. We theoretically and experimentally analyze the effectiveness and robustness of AdaGNN, especially the advantages of multiple embedding spaces over single embedding space. We demonstrate that AdaGNN can achieve great performance when the information of node neighborhoods is rich and the dimension of the ideal embedding space is large. Our work envisions the novel application of multiple embedding spaces and boosting method in graph neural network, opens up a direction along the idea that preserves the similarities between nodes in different spaces in the field of social networks and also leaves us several future questions to think about, including how to further reduce the information leakage between embedding spaces, how to further narrow down the focus of each embedding space and how to further measure the richness of node neighborhoods.

## References

• P. Bartlett, Y. Freund, W. S. Lee, and R. E. Schapire (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (5), pp. 1651 – 1686. External Links: Cited by: §3.5, Theorem 3.9.
• H. Drucker (1997) Improving regressors using boosting techniques. Cited by: §3.6.
• A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proc. of KDD, pp. 855–864. Cited by: §1.
• W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proc. of NIPS, pp. 1024–1034. Cited by: §1.
• W. L. Hamilton, R. Ying, and J. Leskovec (2018) Inductive representation learning on large graphs. External Links: 1706.02216 Cited by: §2, §4.
• F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 2160-6455, Link, Document Cited by: 3rd item.
• T. N. Kipf and M. Welling (2017a) Semi-Supervised Classification with Graph Convolutional Networks. In Proc. of ICLR, Cited by: §1.
• T. N. Kipf and M. Welling (2017b) Semi-supervised classification with graph convolutional networks. External Links: 1609.02907 Cited by: §2.
• S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450362016, Link, Document Cited by: §2, 2nd item.
• S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278. Cited by: §1.1.
• A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich (2019) PyTorch-biggraph: A large-scale graph embedding system. CoRR abs/1903.12287. External Links: Link Cited by: §1.
• L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai (2018) Towards efficient large-scale graph neural network computing. CoRR abs/1810.08403. External Links: Link Cited by: §1.
• A. Pal, C. Eksombatchai, Y. Zhou, B. Zhao, C. Rosenberg, and J. Leskovec (2020) PinnerSage. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. External Links: ISBN 9781450379984, Link, Document Cited by: §1.2, §2, §3.3.
• B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. In Proc. of KDD, pp. 701–710. Cited by: §1.
• E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein (2020) Temporal graph networks for deep learning on dynamic graphs. External Links: 2006.10637 Cited by: §1.1, §1, §2, §4.
• B. Rozemberczki, C. Allen, and R. Sarkar (2019) Multi-scale attributed node embedding. External Links: 1909.13021 Cited by: 1st item.
• F. Scarselli, A. C. Tsoi, and M. Hagenbuchner (2018) The vapnik–chervonenkis dimension of graph and recursive neural networks. Neural Networks 108, pp. 248–259. External Links: ISSN 0893-6080, Document, Link Cited by: §3.5, §3.5, §4.3.2.
• Z. Shen, T. Luo, Y. Zhou, H. Yu, and P. Du (2021) NPI-gnn: predicting ncrna–protein interactions with deep graph neural networks. Briefings in Bioinformatics. Cited by: §1.
• L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (86), pp. 2579–2605. External Links: Link Cited by: §4.3.3.
• P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. External Links: 1710.10903 Cited by: §1, §2, §4.
• J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee (2018) Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proc. of KDD, pp. 839–848. Cited by: §1.
• [22] J. Weston, R. J. Weiss, and H. Yee () Nonlinear latent factorization by embedding multiple user interests [extended abstract]. Cited by: §2.
• Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link Cited by: §1.
• D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan (2020) Inductive representation learning on temporal graphs. External Links: 2002.07962 Cited by: §2.
• K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In Proc. of ICLR, Cited by: §1.
• C. Yang, J. Zhang, H. Wang, S. Li, M. Kim, M. Walker, Y. Xiao, and J. Han (2019)

Relation learning on social networks with multi-modal graph edge variational autoencoders

.
External Links: 1911.05465 Cited by: §1, §1.
• R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018a)

Graph convolutional neural networks for web-scale recommender systems

.