Complex networks, included attributed and heterogeneous networks, are ubiquitous — from recommender systems to citation networks and biological systems (Hu et al., 2020)
. These networks present a multitude of machine learning problem statements, including node classification, link prediction, and community detection. A fundamental aspect of any such machine learning (ML) task, transductive or inductive, is the availability of featurized data. Traditionally, researchers have identified several network characteristics suited to specific ML tasks and used them for the learning algorithm. This practice is arduous as it often entails customizing to each specific ML task, and also is limited to the computable characteristics.
This has led to a surge in (unsupervised) algorithms and methods that learn embeddings from the networks, such that these embeddings form the featurized representation of the network for the ML tasks (Zhang et al., 2018; Wu et al., 2019; Li and Pi, 2020; Chami et al., 2020; Bahrami et al., 2021). This area of research is generally notated as representation learning in networks. Generally, these embeddings generated by representation learning methods are agnostic to the end use-case, as they are generated in an unsupervised fashion. Traditionally, the focus was on representation learning on homogeneous networks, i.e. the networks that have singular type of nodes and edges, and also do not have attributes attached to the nodes and edges (Li and Pi, 2020).
Existing representation learning models mainly focus on transductive learning, where a model can only be trained using the entire input graph. It means that the model requires all the nodes and a fixed structure of the network in the training phase, e.g., Node2vec (Grover and Leskovec, 2016a), DeepWalk (Perozzi et al., 2014) and GCN (Kipf and Welling, 2017), to some extent. Besides, there have been methods focused on heterogeneous networks that incorporate different typed nodes and edges in a network, as well as content at each node (Dong et al., 2017; Wang et al., 2021).
On the other hand, a less explored and exploited approach is the inductive setting. In this approach, only a part of the network is used to train the model to infer embeddings for new nodes. Several attempts have been made in the inductive setting including EP-B (García-Durán and Niepert, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), SDNE (Wang et al., 2016b), TADW (Yang et al., 2015), AHNG(Liu et al., 2019) or PVECB (Lan et al., 2020). There is also recent progress on heterogeneous graph embedding, e.g., MIFHNE (Li et al., 2020) or models based on graph neural networks (Zhang et al., 2019).
State-of-the-art network embedding techniques are mostly unsupervised, i.e., aim at learning low-dimensional representations that preserve the structure of an input graph, e.g., GraphSAGE (Hamilton et al., 2017), DANE (Gao and Huang, 2018), line2vec (Bandyopadhyay et al., 2019), RCAN (Chen and Qian, 2020). Nevertheless, semi-supervised or supervised methods can learn vector representations but for a specific downstream prediction task, e.g., TADW (Yang et al., 2015) or FSCNMF (Bandyopadhyay et al., 2018). Hence it has been shown in the literature that not much supervision is required to learn the embeddings.
In recent years, proposed models mainly focus on the graphs that do not contain attributes related to nodes and edges (Li and Pi, 2020). It is especially noticeable for edge attributes. The majority of proposed approaches consider node attributes only, omitting the richness of edge feature space while learning the representation. Nevertheless, there have been successfully introduced such models as DANE (Gao and Huang, 2018), GraphSAGE (Hamilton et al., 2017), SDNE (Wang et al., 2016b) or CAGE (Nozza et al., 2020) which make use of node features and EGNN (Kim et al., 2019), NEWEE (Li et al., 2019), EGAT (Gong and Cheng, 2019) that consume edge attributes.
|ECN (Aggarwal et al., 2016) (2016)||✓||✓||neigh. aggr.|
|GCN (Kipf and Welling, 2017) (2017)||✓||✓||✓||✓||GCN/GNN|
|ECC (Simonovsky and Komodakis, 2017) (2017)||✓||✓||✓||GCN, DL|
|FSCNMF (Bandyopadhyay et al., 2018) (2018)||✓||✓||✓||GCN|
|GAT (Veličković et al., 2018) (2018)||✓||✓||✓||✓||AE, DL|
|Planetoid (Bui et al., 2018) (2018)||✓||✓||✓||✓||GNN|
|EGNN (Kim et al., 2019) (2019)||✓||✓||✓||✓||✓||✓||GNN|
|EdgeConv (Wang et al., 2019) (2019)||✓||✓||GNN|
|EGAT (Gong and Cheng, 2019) (2019)||✓||✓||✓||✓||✓||✓||GNN|
|Attribute2vec (Wanyan et al., 2020) (2020)||✓||✓||✓||GCN|
|DeepWalk (Perozzi et al., 2014) (2014)||✓||✓||RW, skip-gram|
|TADW (Yang et al., 2015) (2015)||✓||✓||✓||RW, MF|
|LINE (Tang et al., 2015) (2015)||✓||✓||RW, skip-gram|
|Node2vec (Grover and Leskovec, 2016a) (2016)||✓||✓||RW, skip-gram|
|SDNE (Wang et al., 2016b) (2016)||✓||✓||✓||✓||AE|
|GraphSAGE (Hamilton et al., 2017) (2017)||✓||✓||✓||✓||RW|
|EP-B (García-Durán and Niepert, 2017) (2017)||✓||✓||✓||✓||AE|
|Struc2vec (Ribeiro et al., 2017) (2017)||✓||✓||RW, skip-gram|
|DANE (Gao and Huang, 2018) (2018)||✓||✓||✓||✓||AE|
|Line2vec (Bandyopadhyay et al., 2019) (2019)||✓||✓||RW, skip-gram|
|NEWEE (Li et al., 2019) (2019)||✓||✓||✓||✓||RW, skip-gram|
|AttrE2vec (2020)||✓||✓||✓||✓||✓||RW, AE, DL|
Both node-based embedding methods and graph neural network inspired methods do not generalize effectively to both transductive and inductive settings, especially when there are attributes associated with edges. This work is motivated by the idea of unsupervised learning on networks with attributed edges such that the embeddings are generalizable across tasks and are inductive.
To that end, we develop a novel AttrE2vec, an unsupervised learning model that adapts auto-encoder and self-attention network with the use of feature reconstruction and graph structural loss. To learn edge representation, AttrE2vec
splits edge neighborhood into two parts, separately for each node endings of the edge, and then generates random edge walks in both neighborhoods. All walks are then aggregated over the node and edge attributes using one of the proposed strategies (Avg, Exp, GRU, ConcatGRU). These are accumulated with the original nodes and edge features and then fed to attention and dense layer to encode the edge. The embeddings are subsequently inferred via a two-step loss function — for both feature reconstruction and graph structural loss. As a consequence,AttrE2vec can explicitly incorporate feature information from nodes and edges at many hops away to effectively produce the plausible edge embeddings for the inductive setting.
In summary, our main contributions are as follows:
we propose a novel unsupervised AttrE2vec method, which learns a low-dimensional vector representation for edges that are attributed
we exploit the concept of a graph-topology-driven edge feature aggregation, from simple ones to learnable GRU based, that captures edge topological proximity and similarity of edge features
the proposed method is inductive and allows getting the representation for edges not present in the training phase
we conduct various experiments and show that our AttrE2vec method has superior performance over all of the baseline methods on edge classification and clustering tasks.
2 Related work and Research Gap
Embedding information networks has received significant interest from the research community. We refer the readers to the survey articles for a comprehensive overview of network embedding (Li and Pi, 2020; Chami et al., 2020; Wu et al., 2019; Zhang et al., 2018) and cite only some of the most prominent works that are relevant.
Unsupervised network embedding methods use only the network structure or original attributes of nodes and edges to construct embeddings. The most common method is DeepWalk (Perozzi et al., 2014), which in two-phases constructs node neighborhoods by performing fixed-length random walks and employs the skip-gram (Grover and Leskovec, 2016a) model to preserve the co-occurrences between nodes and their neighbors. This two-phase framework was later an inspiration for learning network embeddings by proposing different strategies for constructing node neighborhoods or modeling co-occurrences between nodes, e.g., node2vec (Grover and Leskovec, 2016a), Struc2vec (Ribeiro et al., 2017), GraphSAGE (Hamilton et al., 2017), line2vec (Bandyopadhyay et al., 2019) or NEWEE (Li et al., 2019). Another group of unsupervised methods utilizes auto-encoder or graph neural networks to obtain embedding. SDNE (Wang et al., 2016b) uses auto-encoder architecture to preserve first and second-order proximities by jointly optimizing the loss in neighborhood reconstruction. Another auto-encoder based representatives are EP-B (García-Durán and Niepert, 2017) and DANE (Gao and Huang, 2018).
Supervised network embedding methods are constructed as an end-to-end methods for particular tasks like node classification or link prediction. These methods require network structure, attributes of nodes and edges (if method is capable of using) and some annotated target like node class. The representatives are ECN (Aggarwal et al., 2016), ECC (Simonovsky and Komodakis, 2017), FSCNMF (Bandyopadhyay et al., 2018), GAT (Veličković et al., 2018), planetoid (Bui et al., 2018), EGNN (Kim et al., 2019), GCN (Kipf and Welling, 2017), EdgeConv (Wang et al., 2019), EGAT (Gong and Cheng, 2019), Attribute2vec (Wanyan et al., 2020).
Edge representation learning has been already tackled by several methods, i.e. ECN (Aggarwal et al., 2016), EGNN (Kim et al., 2019), line2vec (Bandyopadhyay et al., 2019), EdgeConv (Wang et al., 2019), EGAT (Gong and Cheng, 2019). However, non of these methods was able to directly take into account attributes of edges as well as perform the learning in an unsupervised manner.
All the characteristics of the representative node and edge representation learning methods are grouped in Table 1.
In the following paragraphs, we explain our three-fold motivation to propose the AttrE2vec.
For a decade, network processing approaches gather more and more attention as graph data is produced in an increasing number of systems. Network embedding traditionally provided the notion of vectorizing nodes that was used in node classification or clustering. However, the edge representation learning did not gather enough attention and was accomplished through node embedding transformation (Grover and Leskovec, 2016b). Nevertheless, such an approach is problematic. For instance, inferring edge type from neighboring nodes’ embeddings may not be the best choice for edge type classification in heterogeneous social networks. We claim that efficient edge clustering, edge attribute regression, or link prediction tasks require dedicated and specific edge representations. We expect that the representation learning approach devoted strictly to edges provides more powerful vector representations than traditional methods that require node embeddings trained upfront and transform nodes’ embedding to represent edges.
Inductive embedding methods
A vast majority of contemporary network representation learning methods is transductive (see Table 1). It means that any change to the graph requires the whole retraining of the method to provide predictions for unseen cases—such property limits the applicability of methods due to high computational costs. Contrary, the inductive approach builds a predictive ability that can be applied to unseen cases and does not need retraining – in general, inductive methods have a lower computation cost. Considering these advantages, we expect modern edge embedding methods to be inductive.
Encoding graph attributes in embeddings
Much of the real-world data exhibits rich attribute sets or meta-data that contain crucial information, e.g., about the similarity of nodes or edges. Traditionally, graph representation learning has been focused on exploiting the network structure, omitting the related content. Thus, we may expect to consume attributes as a regularizer over the structure. It would allow overcoming the limitation when the only edge discriminating ability is encoded in the edges’ attributes, not in the graph’s structure. Relying only on the network would produce inconclusive embeddings.
3.2 Attributed graph edge embedding
We denote an attributed graph as , where is a set of nodes and a set of edges. Every node and every edge has associated features: and , where and are node and edge feature matrices, respectively. By we denote dimensionality of node feature space and dimensionality of edge feature space. The edge embedding task is defined as learning a function , which takes an edge and outputs its low-dimensional vector representation. Note that the embedding dimension should be much less than the original edge feature dimensionality , i.e.: . More specifically, we aim at using the topological structure of the graph and node and edge attributes: .
In contrast to traditional node embedding methods, we shift the focus from nodes to edges and consider a graph from an edge perspective. Given any edge , we can observe three natural sources of knowledge: the edge attributes itself and the two neighborhoods - and , located behind nodes and , respectively. In AttrE2vec, we exploit all three sources jointly.
First, we obtain aggregations (summaries) of the both neighborhoods . We want to capture the topological structure of the neighborhood, so we perform edge random walks of length , which start from node (or
, respectively) and use a uniformly distributed neighbor sampling approach (DeepWalk-like) to obtain the next edge. Eachth walk started from node is hence a sequences of edges.
Next, we take the attributes of the edges (and nodes, if applicable) in each random walk and aggregate them into a single vector using the walk aggregation model .
Later, aggregated walks are combined using the neighborhood aggregation model , which summarizes the neighborhood (and , respectively). The proposed implementations of these aggregation are given in Section 3.4.
Finally, we obtain the low dimensional edge embedding using an encoder module. It combines the edge attributes with the summarized neighborhood information ,
. We employ a simple Multilayer Perceptron (MLP) with 3 inputs (each of size equal to the edge features dimensionality) and an attention mechanism over these inputs, to check how much of the information of each input is used to create the embedding vector (see Figure3):
3.4 Aggregation models
For the purpose of the neighborhood aggregation model , we use an average over vectors , as there is no particular ordering of these vectors (each one was generated by an equally important random walk). In the case of walk aggregation, we propose the following:
average – that computes a simple average of the edge attribute vectors in the random walk;
exponential – that computes a weighted average, where the weights are onents of the ”minus” position in the random walk so that further away edges are less important than the near ones;
– that uses a Gated Recurrent Unit(Chung et al., 2014) architecture, where hidden and input dimension is equal to the edge attribute dimension; the aggregated representation is the output of the last hidden vector; the aggregation process starts here at the end of the random walk and proceeds to the beginning;
ConcatGRU – that is similar to the GRU-based aggregator, but here we also use the node feature information by concatenating the node attributes with the edge attributes; hence the GRU input size is equal to the sum of the edge and node dimensions; in case there are not any node features available, one could use network-specific features, like degree, betweenness or more advanced techniques like Node2vec; the hidden dimension size and the aggregation direction is unchanged;
3.5 Learning AttrE2vec’s parameters
AttrE2vec is designed to make the most use of edge attributes and information about the structure of the network. Therefore we propose a loss function, which consists of two main parts:
structural loss – computes a cosine embedding loss; such function tries to minimize the cosine distance between a given embedding and embeddings of edges sampled from the random walks (positive), and simultaneously to maximize a cosine distance between an embedding and embeddings of edges sampled from a set of all edges in the graph (negative), except for these in the random walks:
where denotes a minibatch of edges and the minibatch size,
feature reconstruction loss – computes a mean squared error of the actual edge features and the outputs of a decoder (implemented as a 3-layer MLP – see Figure 4), that reconstruct the edge features based on the edge embeddings;
where denotes a minibatch of edges and the minibatch size.
We combine the values of the above loss functions using a mixing parameter . The higher the value of this parameter is, the more structural information is preserved and less focus is one the feature reconstruction. The total loss of AttrE2vec is given as follows:
To evaluate the proposed model’s performance, we perform three tasks: edge classification, edge clustering, and embedding visualization on three real-world datasets. We first train our model on a small subset of edges (inductive setting). Then we use the model to infer embeddings for edges from the test set. Finally, we evaluate them in all downstream tasks: by predicting the class of edges in citation graphs (edge classification
), by applying the K-means++ algorithm (edge clustering; as defined in (Bandyopadhyay et al., 2019)) and by the dimensionality reduction method T-SNE (embedding visualization). We compare our model to several baselines and contemporary methods in all experiments, see Table 1
. Eventually, we check the influence of AttrE2vec’s hyperparameters and perform an ablation study on artificially generated datasets. We implement our model in the popular deep learning framework PyTorch. All experiments were performed on an NVIDIA GTX1080Ti. Upon acceptance in the journal, we will make our code available athttps://github.com/attre2vec/attre2vec and include our DVC (Kuprieiev et al., 2020) pipeline so that all experiments can be easily reproduced.
|Name||Features||Number of||Training instances|
|Cora||1 433||0||32||260||2 485||5 069||7+1||160||5 069|
|Citeseer||3 703||0||32||260||2 110||3 668||6+1||140||3 668|
|Pubmed||500||0||32||260||19 717||44 324||3+1||80||44 324|
In order to compare gathered evaluation evidence we focused on well known datasets, that appear in the literature, namely: Cora (Sen et al., 2008), Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012). These are citation networks of scientific papers in several research areas, where nodes are the papers and edges denote citations between papers. We summarize basic statistics about the datasets before and after pre-processing steps in Table 2. Raw datasets contain node features only in the form of high dimensional sparse bags of words. For Cora and Citeseer, these are binary vectors, showing which of the most popular words were used in a given paper, and for Pubmed, the features are in the form of TF-IDF vectors. To adjust the datasets to our problem setting, we apply the following pre-processing steps to obtain edge level features, which are used to train and evaluate our AttrE2vec model:
we create dense vector representations of the nodes’ features by applying Doc2vec (Le and Mikolov, 2014) in the PV-DBOW variant with a target dimension size of 128;
for each edge and its symmetrical version (necessary to perform uniform, undirected random walks) we extract the following features:
1 feature – cosine similarity of raw node features for nodesand (binary BoW; for Pubmed transformed from TF-IDF to binary BoW),
2 features – the ratios of the number of used words (number of ones in the BoW) to all possible words in the document (length of BoW vector) in each paper and ,
256 features – concatenation of Doc2vec features for nodes and ,
1 feature – a binary indicator, which denotes whether this is an original edge (1) or its symmetrical counterpart (0),
we apply standardization (StandardScaler in Scikit-Learn (Pedregosa et al., 2011)) of the edge feature matrix.
Moreover, we extracted new node features as 32-dimensional Node2vec embeddings to provide the evaluation possibility for one of our model versions (AttrE2vec with ConcatGRU aggregator), which generalizes upon both edge and nodes attributes.
Raw datasets provide each node labeled by the research area the paper comes from. To apply this knowledge in the edge classification problem setting, we applied the following rule: if an edge has two nodes from the same class (research area), the edge receives this class; if two nodes have different classes, the edge between these nodes is assigned with a cross-domain citation class.
To ensure a fair comparison method, we follow the dataset preparation scheme from EP-B (García-Durán and Niepert, 2017)
, i.e., for each dataset (Cora, Citeseer, Pubmed) we sample 10 train/validation/test sets, where the train set consists of 20 edges per class and the validation and test sets to contain 1 000 randomly chosen edges each. While reporting the resulting metrics, we show the mean values over these ten sampled sets (together with the standard deviation).
We compare our method against several baseline methods. In the most simple case, we use the edge features obtained during the pre-processing phase for all datasets (further referred to as Doc2vec).
Many standard approaches employ simple node embedding transformations to obtain edge embeddings. The authors of Node2vec (Grover and Leskovec, 2016b) proposed binary operators like averaging, Hadamard product, or L1 and L2 norms of vector differences. Here, we will use following methods to obtain node embeddings: DeepWalk (Perozzi et al., 2014), Node2vec (Grover and Leskovec, 2016b), SDNE (Wang et al., 2016a) and Struc2vec (Ribeiro et al., 2017). In preliminary experiments, we evaluated these methods and checked that the Average operator and an embedding size of 64 gives the best results. We will use these models in 2 setups: (a) Avg(,) – using only the averaged node features, (b) Avg(,) – like previously but concatenated with the edge features from the dataset (in total 324-dim vectors).
We also checked a scheme to compute a 64-dim PCA reduction of the concatenated features to have comparable vector sizes with the 64-dimensional embedding of our model, but these turned out to perform poorly. Note that SDNE has the capability of inductive reasoning, but due to the non-availability of such implementation, we decided to evaluate this method in the transductive scheme (which works in favor of the method).
We also extend our body of baselines by more sophisticated approaches – two dense autoencoder architectures. In the first settingMLP(,), we train a model (see Figure 5), which reconstructs concatenated embeddings of connected nodes. In the second baseline MLP(,,), the autoencoder (see Figure 6) is extended by edge attributes. In both settings, we employ the mean squared error as the model loss function. The output of the encoders (embeddings) is used in the downstream tasks. The input node embeddings are obtained using the methods mentioned above, i.e., DeepWalk, Node2vec, SDNE, and Struc2vec.
The last baseline is Line2vec (Bandyopadhyay et al., 2019), which is directly dedicated for edges - we use an embedding size of 64.
|Edge features only; (Doc2vec)||260||86.13 0.95||88.67 0.51||79.15 1.41|
|Line2vec||64||86.19 0.28||91.75 1.07||84.88 1.19|
|Avg(,)||DeepWalk||64||58.40 1.08||59.98 1.32||51.04 1.23|
|Node2vec||64||58.26 0.89||59.59 1.11||51.03 1.01|
|SDNE||64||54.28 1.57||55.91 1.11||50.00 0.00|
|Struc2vec||64||61.29 0.86||61.30 1.58||54.67 1.46|
|MLP(,)||DeepWalk||64||55.88 1.68||57.87 1.53||51.23 0.77|
|Node2vec||64||55.35 2.26||57.44 0.87||51.48 1.55|
|SDNE||64||55.56 0.93||56.02 1.22||50.00 0.00|
|Struc2vec||64||59.93 1.43||59.76 1.80||53.27 1.32|
|Avg(,)||DeepWalk||324||86.13 0.95||88.67 0.51||79.15 1.41|
|Node2vec||324||86.13 0.95||88.67 0.51||79.15 1.41|
|SDNE||324||86.14 1.03||88.70 0.51||79.15 1.41|
|Struc2vec||324||86.21 0.97||88.73 0.48||79.24 1.36|
|MLP(,,)||DeepWalk||64||84.58 1.11||86.47 0.87||78.60 1.84|
|Node2vec||64||84.65 1.05||86.71 0.68||78.84 1.71|
|SDNE||64||84.32 1.13||85.99 0.77||78.34 1.07|
|Struc2vec||64||83.95 1.16||85.54 0.96||77.19 1.42|
|Avg(,)||GraphSage||64||54.84 1.90||55.16 1.36||51.14 1.64|
|MLP(,)||GraphSage||64||55.19 1.04||55.47 1.66||50.36 1.54|
|Avg(,)||GraphSage||324||86.14 0.95||88.68 0.51||79.16 1.41|
|MLP(,,)||GraphSage||64||84.63 1.11||86.14 0.45||78.00 1.85|
|Exp||64||88.91 1.10||92.80 0.38||86.18 1.41|
|ConcatGRU||64||88.56 1.34||92.93 0.61||86.34 1.18|
4.3 Edge classification
To evaluate our model in an inductive setting, we need to make sure that test edges are unseen during the model training procedure – we remove them from the graph. Note that all baselines (except for GraphSage, see 1) require all edges during the training phase (i.e., these are transductive methods).
After each training epoch ofAttrE2vec
, we evaluate the embeddings using L2-regularized Logistic Regression (LR) classifier and compute AUC. The regression model is trained on edge embeddings from the train set and evaluated on edge embeddings from the validation set. We take the model with the highest AUC value on the validation set. Moreover, an early stopping strategy is implemented– if the validation AUC metric does not improve for more than 15 epochs, the learning is terminated. Our approach to model selection is aligned with the schema proposed in(Nguyen et al., 2020) because this approach is more natural than relying on the loss function. This is repeated for all 10 data splits (see: Section 4.1 for details). We report a mean and std AUC measures for 10 test sets (see Table 3)
We choose AdamW (Loshchilov and Hutter, 2017) with a learning rate of to optimize our model’s parameters. We also set the size of positive samples to and negative samples to in the cosine embedding loss. The mixing coefficient is set to , equally including the influence of features and topological graph structure. We choose an embedding size of 64 as a reasonable value while dealing with edge features of size 260.
In Table 3, we summarize the AUC values for baseline methods and for our model. Even though vectors’ original dimensionality is relatively high (260), good results are already yielded using only the edge features (Doc2vec). However, adding structural information about the graph could further improve the results.
Using representations from node embedding methods, which are transformed to edge embeddings using the average operator Avg(,), achieve poor results of about 50-60% AUC. However, if these are combined with the edge features from the datasets Avg(,), the AUC values increase significantly to about 86%, 88% and 79% for Citeseer, Cora, and Pubmed, respectively. Unfortunately, this results in an even higher vector dimensionality (324).
The MLP-based approach results lead to similar conclusions. Using only node embeddings MLP(,) we achieve quite poor results of about 50% (on Pubmed) up to 60% (on Cora). With MLP(,,) approach we observe that edge features improve the classification results. The AUC values are still slightly worse than concatenation operator (Avg(,)), but we can reduce the edge embedding size to 64.
The Line2vec (Bandyopadhyay et al., 2019) algorithm achieves very good results, without considering edge features information – we get about 86%, 92% and 85% AUC for Citeseer, Cora, and Pubmed, respectively. These values are higher than for any other baseline approach.
Our model performs the best among all evaluated methods. For Citeseer, we gain about 3 percent points compared to the best baselines: Line2vec, Struc2vec (Avg(,)) or GraphSage (Avg(,)). Note that the algorithm is trained only on 140 edges in the inductive setting, whereas all transductive baselines require the whole graph for training. The gains on Cora are 2 pp, and on Pubmed we achieve up to 4pp (and up to 8pp compared only to GraphSage (Avg(,))). Our model with the Average (Avg) aggregator works the best, whereas the Gated Recurrent Unit (GRU) aggregator achieves the second-best results.
4.4 Edge clustering
Similarly to Line2vec (Bandyopadhyay et al., 2019), we apply the K-Means++ algorithm on the resulting embedding vectors and compute an unsupervised clustering accuracy (Xie et al., 2016). We summarize the results in Table 4
. Our model performs the best in all but one case and achieves significantly better results than other baseline methods. The only exception is for the Pubmed dataset, where Line2vec achieves the best clustering accuracy. Other baseline methods perform similarly as in the edge classification task. Hence, we will not discuss the details, and we encourage the reader to go through the detailed results.
|Edge features only; (Doc2vec)||260||54.13 2.73||54.64 5.86||46.33 1.53|
|Line2vec||64||54.73 2.56||63.50 1.92|
|Avg(,)||DeepWalk||64||28.89 1.06||21.93 0.86||27.24 0.50|
|Node2vec||64||26.82 0.67||21.32 0.62||27.17 0.74|
|SDNE||64||21.01 0.50||17.97 0.47||31.38 0.69|
|Struc2vec||64||25.21 1.33||20.15 0.64||32.02 1.49|
|MLP(,)||DeepWalk||64||26.36 1.37||21.06 0.57||27.40 0.93|
|Node2vec||64||26.37 1.64||21.31 0.98||27.67 0.78|
|SDNE||64||22.27 0.76||17.15 0.36||28.44 1.21|
|Struc2vec||64||24.22 0.83||19.56 0.49||31.31 1.70|
|Avg(,)||DeepWalk||324||54.13 2.73||54.70 5.85||46.33 1.53|
|Node2vec||324||54.13 2.73||54.70 5.85||46.33 1.53|
|SDNE||324||55.29 2.06||55.43 4.63||46.33 1.53|
|Struc2vec||324||55.59 1.51||52.47 6.52||46.32 1.29|
|MLP(,,)||DeepWalk||64||48.74 4.03||47.38 4.72||46.49 1.20|
|Node2vec||64||50.80 2.30||48.48 3.38||46.15 1.43|
|SDNE||64||46.17 3.15||44.87 3.54||45.74 1.89|
|Struc2vec||64||47.35 3.73||44.38 3.04||45.40 1.72|
|Avg(,)||GraphSage||64||18.79 0.62||17.70 1.05||27.04 0.71|
|MLP(,)||GraphSage||64||18.92 0.98||17.89 0.85||27.09 0.81|
|Avg(,)||GraphSage||324||54.06 2.54||54.82 6.86||46.49 1.64|
|MLP(,,)||GraphSage||64||48.79 4.04||47.49 5.41||45.15 1.54|
|AttrE2vec (our)||Avg||64||59.82 3.30||65.42 1.71||48.86 2.46|
|Exp||64||59.07 4.65||48.02 2.55|
4.5 Embedding visualization
For all tested baseline methods and our proposed AttrE2vec method, we compute 2-dimensional projections of the produced embeddings using T-SNE (van der Maaten and Hinton, 2008) method. We visualize them in Figure 7. In our subjective opinion, these plots correspond to the AUC scores reported in Table 3—the higher the AUC, the better the group separation. In details, for Doc2vec raw edge features seem to form groups, but unfortunately overlap to some degree. We cannot observe any pattern in the node embedding-based settings (Avg(,) and MLP(,)), they tempt to be quasi-random. When concatenated with the edge attributes (Avg(,) and MLP(,,)) we observe a slightly better grouping, but yet non satisfying. AttrE2vec model produces much more formed groups, with only a little overlapping. To summarize, based on the observed groups’ separability and AUC metrics, our approach works the best among all methods.
5 Hyperparameter Sensitivity of AttrE2vec
We investigate hyperparameters’ effect considering each of them independently, i.e., setting a given parameter and preserving default values for all other parameters. The evaluation is applied for our model’s two inductive variants: with the Average aggregator and with the GRU aggregator. We use all three datasets (Cora, Citeseer, Pubmed) and report the AUC values. We choose following hyperparameter value sets (values with an asterisk denote the default value for that parameter):
length of random walk: ,
number of random walks: ,
embedding size: ,
mixing parameter: .
The results of all experiments are summarized in Figure 8. We observe that for both aggregation variants, Avg and GRU, the trends are similar, so we will include and discuss them based only on the Average aggregator.
In general, the higher the number of random walks and the length of a single random walk , the better results are achieved. One may require higher values of these parameters, but it significantly increases the random walk computation time and the model training itself.
Unsurprisingly, the embedding size (embedding dimension) also follows the same trend. With more dimensions, we can fit more information into the created representations. However, as an embedding goal is to find low-dimensional vector representations, we should keep reasonable dimensionality. Our chosen values (16, 32, 64) seem plausible while working with 260-dimensional edge features.
As for loss mixing parameter , we observe that too high values negatively influence the model performance. The greater the value, the more critical the structural loss becomes. Simultaneously the feature loss becomes less relevant. Choosing causes the loss function to consider feature reconstruction only and completely ignores the embedding loss. This yields significantly worse results and confirms that our approach of combining both feature reconstruction and structural embedding loss is justified. In general, the best values are achieved for setting an equal influence of both loss factors ().
6 Ablation study
We performed an ablation study to check whether our method AttrE2vec is invariant to introduced noise in an artificially generated network. We use a barbell graph, which consists of two fully connected graphs and a path which connects them (see: Figure 1). The graph has seven nodes in each full graph and seven nodes in the path – a total of 50 edges. Next, we generate features from 3 clusters in a 200-dimensional space using isotropic Gaussian blobs. We assign the features to 3 parts of the graph: the first to the edges in one of the full graphs, the second to the edges in the path and the third to the edges in the other full graph. The edge classes are matching the feature clusters (i.e., three classes). Therefore, the structure is aligned with the features, so any good structure based embedding method can fit this data very well (see: Figure 1). A problem occurs when the features (and hence the classes) are shuffled within the graph structure. Methods that employ only a structural loss function will fail. We want to check how our model AttrE2vec, which includes both structural and feature-based loss, performs with different amount of such noise.
We will use the graph mentioned above and introduce noise by shuffling of all edge pairs, which are from different classes, i.e., an edge with class 2 (originally located in the path) may be swapped with one from the full graphs (classes 1 or 3). We use our AttrE2vec model with an Average aggregator in the transductive setting (due to the graph size) and report the edge classification AUC for different values of and . The values of the mixing parameter allow us to check how the model behaves when working only with a feature-based loss (), only with a structural loss (), and with both losses at equal importance (). We train our model for five epochs and repeat the computations ten times for every pair, due to the shuffling procedure’s randomness. We report the mean and standard deviation of the AUC value in Figure 9.
Using only the feature loss or a combination of both losses allows us to achieve nearly 100% AUC in the classification task. The fluctuations appear due to the low number of training epochs and the local optima problem. The performance of the model that uses only structural loss (
) decreases with higher shuffling probabilities, and from a certain point, it starts improving slightly because shuffling results in a complete swap of two classes, i.e., all features and classes from one graph part are exchanged with all features and classes from another part of the graph.
We also demonstrate how our method reacts on noisy data with various . There are two graphs: one where the features are aligned to substructures of the graph and the second with shuffled features (ca. 50%), see Figure 10. Keeping AttrE2vec with allows to represent noisy graphs fairly.
7 Conclusions and future work
We introduce AttrE2vec – the novel unsupervised and inductive embedding model to learn attributed edge embeddings by leveraging on the self-attention network with auto-encoder over attribute space and structural loss on aggregated random walks. Attre2vec can directly aggregate feature information from edges and nodes at many hops away to infer embeddings not only for present nodes, but also for new nodes. Extensive experimental results show that AttrE2vec obtains the state-of-the-art results in edge classification and clustering on CORA, PUBMED and CITESEER.
The work was partially supported by the National Science Centre, Poland grant No. 2016/21/D/ST6/02948, and 2016/23/B/ST6/01735, as well as by the Department of Computational Intelligence, Wrocław University of Science and Technology statutory funds.
- Edge classification in networks. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 1038–1049. External Links: Cited by: Table 1, §2, §2.
Joint auto-weighted graph fusion and scalable semi-supervised learning. Information Fusion 66, pp. 213–228 (English). External Links: Cited by: §1.
- Beyond node embedding: A direct unsupervised edge representation framework for homogeneous networks. External Links: Cited by: Table 1, §1, §2, §2, §4.2, §4.3, §4.4, §4.
- FSCNMF: Fusing structure and content via non-negative matrix factorization for embedding information networks. External Links: Cited by: Table 1, §1, §2.
- Neural Graph Learning: Training Neural Networks Using Graphs. dl.acm.org 2018-Febua, pp. 64–71. External Links: Cited by: Table 1, §2.
- Machine Learning on Graphs: A Model and Comprehensive Taxonomy. External Links: Cited by: §1, §2.
- Relation constrained attributed network embedding. Information Sciences 515, pp. 341–351. External Links: Cited by: §1.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. External Links: Cited by: item 3.
- Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, New York, NY, USA, pp. 135–144. External Links: Cited by: §1.
Deep attributed network embedding.
IJCAI International Joint Conference on Artificial Intelligence, Vol. 2018-July, pp. 3364–3370. External Links: Cited by: Table 1, §1, §1, §2.
- Learning graph representations with embedding propagation. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 5120–5131. Cited by: Table 1, §1, §2, §4.1.
- Exploiting edge features for graph neural networks. In , pp. 9203–9211. External Links: Cited by: Table 1, §1, §2, §2.
- Node2vec: Scalable feature learning for networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu, pp. 855–864. External Links: Cited by: Table 1, §1, §2.
- Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §3.1, §4.2.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 1025–1035. External Links: Cited by: Table 1, §1, §1, §1, §2.
- Open graph benchmark: Datasets for machine learning on graphs. External Links: Cited by: §1.
- Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019-June, pp. 11–20. External Links: Cited by: Table 1, §1, §2, §2.
- Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp. 1–14. External Links: Cited by: Table 1, §1, §2.
- DVC: data version control - git for data & models. Zenodo. External Links: Cited by: §4.
- Improving network embedding with partially available vertex and edge content. Information Sciences 512, pp. 935–951. External Links: Cited by: §1.
- Distributed representations of sentences and documents. In 31st International Conference on Machine Learning, ICML 2014, Vol. 4, pp. 2931–2939. External Links: Cited by: item 1.
- Multi-source information fusion based heterogeneous network embedding. Information Sciences 534, pp. 53–71. External Links: Cited by: §1.
- Network representation learning: a systematic literature review. Neural Computing and Applications 32 (21), pp. 16647–16679. External Links: Cited by: §1, §1, §2.
- Graph representation learning with encoding edges. Neurocomputing 361, pp. 29–39. External Links: Cited by: Table 1, §1, §2.
- AHNG: representation learning on attributed heterogeneous network. Information Fusion 50, pp. 221–230 (English). Note: Cited By :3 External Links: Cited by: §1.
- Decoupled Weight Decay Regularization. External Links: Cited by: §4.3.
- Query-driven Active Surveying for Collective Classification. In Proceedings ofthe Workshop on Mining and Learn- ing with Graphs, Edinburgh, Scotland, UK., pp. 1–8. Cited by: §4.1.
- A self-attention network based node embedding model. External Links: Cited by: §4.3.
- CAGE: Constrained deep Attributed Graph Embedding. Information Sciences 518, pp. 56–70. External Links: Cited by: §1.
- Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: item 3.
- DeepWalk: Online Learning of Social Representations Bryan. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, New York, New York, USA, pp. 701–710. External Links: Cited by: Table 1, §1, §2, §4.2.
- Struc2vec: Learning node representations from structural identity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, pp. 385–394. External Links: Cited by: Table 1, §2, §4.2.
- Collective classification in network data. AI Magazine 29 (3), pp. 93. External Links: Cited by: §4.1.
Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-Janua, pp. 29–38. External Links: Cited by: Table 1, §2.
- LINE: Large-scale information network embedding. In WWW 2015 - Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. External Links: Cited by: Table 1.
- Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Cited by: §4.5.
- Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, pp. 1–12. External Links: Cited by: Table 1, §1, §2.
- Structural deep network embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1225–1234. External Links: Cited by: §4.2.
- Structural deep network embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu, pp. 1225–1234. External Links: Cited by: Table 1, §1, §1, §2.
Covid-19 classification by fgcnet with deep feature fusion from graph convolutional network and convolutional neural network. Information Fusion 67, pp. 208–229 (English). Note: Cited By :1 External Links: Cited by: §1.
- Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics 38 (5), pp. 146. External Links: Cited by: Table 1, §2, §2.
- Attribute2vec: Deep network embedding through multi-filtering GCN. External Links: Cited by: Table 1, §2.
- A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: Cited by: §1, §2.
Unsupervised deep embedding for clustering analysis. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 478–487. External Links: Cited by: §4.4.
- Network representation learning with rich text information. In IJCAI International Joint Conference on Artificial Intelligence, Vol. 2015-Janua, pp. 2111–2117. External Links: Cited by: Table 1, §1, §1.
- Heterogeneous graph neural network. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 793–803. External Links: Cited by: §1.
- Network Representation Learning: A Survey. IEEE Transactions on Big Data 6 (1), pp. 3–28. External Links: Cited by: §1, §2.