1 Introduction
Complex networks, included attributed and heterogeneous networks, are ubiquitous — from recommender systems to citation networks and biological systems (Hu et al., 2020)
. These networks present a multitude of machine learning problem statements, including node classification, link prediction, and community detection. A fundamental aspect of any such machine learning (ML) task, transductive or inductive, is the availability of featurized data. Traditionally, researchers have identified several network characteristics suited to specific ML tasks and used them for the learning algorithm. This practice is arduous as it often entails customizing to each specific ML task, and also is limited to the computable characteristics.
This has led to a surge in (unsupervised) algorithms and methods that learn embeddings from the networks, such that these embeddings form the featurized representation of the network for the ML tasks (Zhang et al., 2018; Wu et al., 2019; Li and Pi, 2020; Chami et al., 2020; Bahrami et al., 2021). This area of research is generally notated as representation learning in networks. Generally, these embeddings generated by representation learning methods are agnostic to the end usecase, as they are generated in an unsupervised fashion. Traditionally, the focus was on representation learning on homogeneous networks, i.e. the networks that have singular type of nodes and edges, and also do not have attributes attached to the nodes and edges (Li and Pi, 2020).
Existing representation learning models mainly focus on transductive learning, where a model can only be trained using the entire input graph. It means that the model requires all the nodes and a fixed structure of the network in the training phase, e.g., Node2vec (Grover and Leskovec, 2016a), DeepWalk (Perozzi et al., 2014) and GCN (Kipf and Welling, 2017), to some extent. Besides, there have been methods focused on heterogeneous networks that incorporate different typed nodes and edges in a network, as well as content at each node (Dong et al., 2017; Wang et al., 2021).
On the other hand, a less explored and exploited approach is the inductive setting. In this approach, only a part of the network is used to train the model to infer embeddings for new nodes. Several attempts have been made in the inductive setting including EPB (GarcíaDurán and Niepert, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), SDNE (Wang et al., 2016b), TADW (Yang et al., 2015), AHNG(Liu et al., 2019) or PVECB (Lan et al., 2020). There is also recent progress on heterogeneous graph embedding, e.g., MIFHNE (Li et al., 2020) or models based on graph neural networks (Zhang et al., 2019).
Stateoftheart network embedding techniques are mostly unsupervised, i.e., aim at learning lowdimensional representations that preserve the structure of an input graph, e.g., GraphSAGE (Hamilton et al., 2017), DANE (Gao and Huang, 2018), line2vec (Bandyopadhyay et al., 2019), RCAN (Chen and Qian, 2020). Nevertheless, semisupervised or supervised methods can learn vector representations but for a specific downstream prediction task, e.g., TADW (Yang et al., 2015) or FSCNMF (Bandyopadhyay et al., 2018). Hence it has been shown in the literature that not much supervision is required to learn the embeddings.
In recent years, proposed models mainly focus on the graphs that do not contain attributes related to nodes and edges (Li and Pi, 2020). It is especially noticeable for edge attributes. The majority of proposed approaches consider node attributes only, omitting the richness of edge feature space while learning the representation. Nevertheless, there have been successfully introduced such models as DANE (Gao and Huang, 2018), GraphSAGE (Hamilton et al., 2017), SDNE (Wang et al., 2016b) or CAGE (Nozza et al., 2020) which make use of node features and EGNN (Kim et al., 2019), NEWEE (Li et al., 2019), EGAT (Gong and Cheng, 2019) that consume edge attributes.
Method  Representation  Attributed  Reasoning  Family  
Nodes  Edges  Nodes  Edges  Transduct.  Induct.  
Supervised 
ECN (Aggarwal et al., 2016) (2016)  ✓  ✓  neigh. aggr.  
GCN (Kipf and Welling, 2017) (2017)  ✓  ✓  ✓  ✓  GCN/GNN  
ECC (Simonovsky and Komodakis, 2017) (2017)  ✓  ✓  ✓  GCN, DL  
FSCNMF (Bandyopadhyay et al., 2018) (2018)  ✓  ✓  ✓  GCN  
GAT (Veličković et al., 2018) (2018)  ✓  ✓  ✓  ✓  AE, DL  
Planetoid (Bui et al., 2018) (2018)  ✓  ✓  ✓  ✓  GNN  
EGNN (Kim et al., 2019) (2019)  ✓  ✓  ✓  ✓  ✓  ✓  GNN  
EdgeConv (Wang et al., 2019) (2019)  ✓  ✓  GNN  
EGAT (Gong and Cheng, 2019) (2019)  ✓  ✓  ✓  ✓  ✓  ✓  GNN  
Attribute2vec (Wanyan et al., 2020) (2020)  ✓  ✓  ✓  GCN  
Unsupervised 
DeepWalk (Perozzi et al., 2014) (2014)  ✓  ✓  RW, skipgram  
TADW (Yang et al., 2015) (2015)  ✓  ✓  ✓  RW, MF  
LINE (Tang et al., 2015) (2015)  ✓  ✓  RW, skipgram  
Node2vec (Grover and Leskovec, 2016a) (2016)  ✓  ✓  RW, skipgram  
SDNE (Wang et al., 2016b) (2016)  ✓  ✓  ✓  ✓  AE  
GraphSAGE (Hamilton et al., 2017) (2017)  ✓  ✓  ✓  ✓  RW  
EPB (GarcíaDurán and Niepert, 2017) (2017)  ✓  ✓  ✓  ✓  AE  
Struc2vec (Ribeiro et al., 2017) (2017)  ✓  ✓  RW, skipgram  
DANE (Gao and Huang, 2018) (2018)  ✓  ✓  ✓  ✓  AE  
Line2vec (Bandyopadhyay et al., 2019) (2019)  ✓  ✓  RW, skipgram  
NEWEE (Li et al., 2019) (2019)  ✓  ✓  ✓  ✓  RW, skipgram  
AttrE2vec (2020)  ✓  ✓  ✓  ✓  ✓  RW, AE, DL 
Both nodebased embedding methods and graph neural network inspired methods do not generalize effectively to both transductive and inductive settings, especially when there are attributes associated with edges. This work is motivated by the idea of unsupervised learning on networks with attributed edges such that the embeddings are generalizable across tasks and are inductive.
To that end, we develop a novel AttrE2vec, an unsupervised learning model that adapts autoencoder and selfattention network with the use of feature reconstruction and graph structural loss. To learn edge representation, AttrE2vec
splits edge neighborhood into two parts, separately for each node endings of the edge, and then generates random edge walks in both neighborhoods. All walks are then aggregated over the node and edge attributes using one of the proposed strategies (Avg, Exp, GRU, ConcatGRU). These are accumulated with the original nodes and edge features and then fed to attention and dense layer to encode the edge. The embeddings are subsequently inferred via a twostep loss function — for both feature reconstruction and graph structural loss. As a consequence,
AttrE2vec can explicitly incorporate feature information from nodes and edges at many hops away to effectively produce the plausible edge embeddings for the inductive setting.In summary, our main contributions are as follows:

we propose a novel unsupervised AttrE2vec method, which learns a lowdimensional vector representation for edges that are attributed

we exploit the concept of a graphtopologydriven edge feature aggregation, from simple ones to learnable GRU based, that captures edge topological proximity and similarity of edge features

the proposed method is inductive and allows getting the representation for edges not present in the training phase

we conduct various experiments and show that our AttrE2vec method has superior performance over all of the baseline methods on edge classification and clustering tasks.
2 Related work and Research Gap
Embedding information networks has received significant interest from the research community. We refer the readers to the survey articles for a comprehensive overview of network embedding (Li and Pi, 2020; Chami et al., 2020; Wu et al., 2019; Zhang et al., 2018) and cite only some of the most prominent works that are relevant.
Unsupervised network embedding methods use only the network structure or original attributes of nodes and edges to construct embeddings. The most common method is DeepWalk (Perozzi et al., 2014), which in twophases constructs node neighborhoods by performing fixedlength random walks and employs the skipgram (Grover and Leskovec, 2016a) model to preserve the cooccurrences between nodes and their neighbors. This twophase framework was later an inspiration for learning network embeddings by proposing different strategies for constructing node neighborhoods or modeling cooccurrences between nodes, e.g., node2vec (Grover and Leskovec, 2016a), Struc2vec (Ribeiro et al., 2017), GraphSAGE (Hamilton et al., 2017), line2vec (Bandyopadhyay et al., 2019) or NEWEE (Li et al., 2019). Another group of unsupervised methods utilizes autoencoder or graph neural networks to obtain embedding. SDNE (Wang et al., 2016b) uses autoencoder architecture to preserve first and secondorder proximities by jointly optimizing the loss in neighborhood reconstruction. Another autoencoder based representatives are EPB (GarcíaDurán and Niepert, 2017) and DANE (Gao and Huang, 2018).
Supervised network embedding methods are constructed as an endtoend methods for particular tasks like node classification or link prediction. These methods require network structure, attributes of nodes and edges (if method is capable of using) and some annotated target like node class. The representatives are ECN (Aggarwal et al., 2016), ECC (Simonovsky and Komodakis, 2017), FSCNMF (Bandyopadhyay et al., 2018), GAT (Veličković et al., 2018), planetoid (Bui et al., 2018), EGNN (Kim et al., 2019), GCN (Kipf and Welling, 2017), EdgeConv (Wang et al., 2019), EGAT (Gong and Cheng, 2019), Attribute2vec (Wanyan et al., 2020).
Edge representation learning has been already tackled by several methods, i.e. ECN (Aggarwal et al., 2016), EGNN (Kim et al., 2019), line2vec (Bandyopadhyay et al., 2019), EdgeConv (Wang et al., 2019), EGAT (Gong and Cheng, 2019). However, non of these methods was able to directly take into account attributes of edges as well as perform the learning in an unsupervised manner.
All the characteristics of the representative node and edge representation learning methods are grouped in Table 1.
3 Method
3.1 Motivation
In the following paragraphs, we explain our threefold motivation to propose the AttrE2vec.
Edge embeddings
For a decade, network processing approaches gather more and more attention as graph data is produced in an increasing number of systems. Network embedding traditionally provided the notion of vectorizing nodes that was used in node classification or clustering. However, the edge representation learning did not gather enough attention and was accomplished through node embedding transformation (Grover and Leskovec, 2016b). Nevertheless, such an approach is problematic. For instance, inferring edge type from neighboring nodes’ embeddings may not be the best choice for edge type classification in heterogeneous social networks. We claim that efficient edge clustering, edge attribute regression, or link prediction tasks require dedicated and specific edge representations. We expect that the representation learning approach devoted strictly to edges provides more powerful vector representations than traditional methods that require node embeddings trained upfront and transform nodes’ embedding to represent edges.
Inductive embedding methods
A vast majority of contemporary network representation learning methods is transductive (see Table 1). It means that any change to the graph requires the whole retraining of the method to provide predictions for unseen cases—such property limits the applicability of methods due to high computational costs. Contrary, the inductive approach builds a predictive ability that can be applied to unseen cases and does not need retraining – in general, inductive methods have a lower computation cost. Considering these advantages, we expect modern edge embedding methods to be inductive.
Encoding graph attributes in embeddings
Much of the realworld data exhibits rich attribute sets or metadata that contain crucial information, e.g., about the similarity of nodes or edges. Traditionally, graph representation learning has been focused on exploiting the network structure, omitting the related content. Thus, we may expect to consume attributes as a regularizer over the structure. It would allow overcoming the limitation when the only edge discriminating ability is encoded in the edges’ attributes, not in the graph’s structure. Relying only on the network would produce inconclusive embeddings.
3.2 Attributed graph edge embedding
We denote an attributed graph as , where is a set of nodes and a set of edges. Every node and every edge has associated features: and , where and are node and edge feature matrices, respectively. By we denote dimensionality of node feature space and dimensionality of edge feature space. The edge embedding task is defined as learning a function , which takes an edge and outputs its lowdimensional vector representation. Note that the embedding dimension should be much less than the original edge feature dimensionality , i.e.: . More specifically, we aim at using the topological structure of the graph and node and edge attributes: .
3.3 AttrE2vec
In contrast to traditional node embedding methods, we shift the focus from nodes to edges and consider a graph from an edge perspective. Given any edge , we can observe three natural sources of knowledge: the edge attributes itself and the two neighborhoods  and , located behind nodes and , respectively. In AttrE2vec, we exploit all three sources jointly.
First, we obtain aggregations (summaries) of the both neighborhoods . We want to capture the topological structure of the neighborhood, so we perform edge random walks of length , which start from node (or
, respectively) and use a uniformly distributed neighbor sampling approach (DeepWalklike) to obtain the next edge. Each
th walk started from node is hence a sequences of edges.Next, we take the attributes of the edges (and nodes, if applicable) in each random walk and aggregate them into a single vector using the walk aggregation model .
Later, aggregated walks are combined using the neighborhood aggregation model , which summarizes the neighborhood (and , respectively). The proposed implementations of these aggregation are given in Section 3.4.
Finally, we obtain the low dimensional edge embedding using an encoder module. It combines the edge attributes with the summarized neighborhood information ,
. We employ a simple Multilayer Perceptron (MLP) with 3 inputs (each of size equal to the edge features dimensionality) and an attention mechanism over these inputs, to check how much of the information of each input is used to create the embedding vector (see Figure
3):3.4 Aggregation models
For the purpose of the neighborhood aggregation model , we use an average over vectors , as there is no particular ordering of these vectors (each one was generated by an equally important random walk). In the case of walk aggregation, we propose the following:

average – that computes a simple average of the edge attribute vectors in the random walk;

exponential – that computes a weighted average, where the weights are onents of the ”minus” position in the random walk so that further away edges are less important than the near ones;

GRU
– that uses a Gated Recurrent Unit
(Chung et al., 2014) architecture, where hidden and input dimension is equal to the edge attribute dimension; the aggregated representation is the output of the last hidden vector; the aggregation process starts here at the end of the random walk and proceeds to the beginning; 
ConcatGRU – that is similar to the GRUbased aggregator, but here we also use the node feature information by concatenating the node attributes with the edge attributes; hence the GRU input size is equal to the sum of the edge and node dimensions; in case there are not any node features available, one could use networkspecific features, like degree, betweenness or more advanced techniques like Node2vec; the hidden dimension size and the aggregation direction is unchanged;
3.5 Learning AttrE2vec’s parameters
AttrE2vec is designed to make the most use of edge attributes and information about the structure of the network. Therefore we propose a loss function, which consists of two main parts:

structural loss – computes a cosine embedding loss; such function tries to minimize the cosine distance between a given embedding and embeddings of edges sampled from the random walks (positive), and simultaneously to maximize a cosine distance between an embedding and embeddings of edges sampled from a set of all edges in the graph (negative), except for these in the random walks:
where denotes a minibatch of edges and the minibatch size,

feature reconstruction loss – computes a mean squared error of the actual edge features and the outputs of a decoder (implemented as a 3layer MLP – see Figure 4), that reconstruct the edge features based on the edge embeddings;
where denotes a minibatch of edges and the minibatch size.
We combine the values of the above loss functions using a mixing parameter . The higher the value of this parameter is, the more structural information is preserved and less focus is one the feature reconstruction. The total loss of AttrE2vec is given as follows:
4 Experiments
To evaluate the proposed model’s performance, we perform three tasks: edge classification, edge clustering, and embedding visualization on three realworld datasets. We first train our model on a small subset of edges (inductive setting). Then we use the model to infer embeddings for edges from the test set. Finally, we evaluate them in all downstream tasks: by predicting the class of edges in citation graphs (edge classification
), by applying the Kmeans++ algorithm (
edge clustering; as defined in (Bandyopadhyay et al., 2019)) and by the dimensionality reduction method TSNE (embedding visualization). We compare our model to several baselines and contemporary methods in all experiments, see Table 1. Eventually, we check the influence of AttrE2vec’s hyperparameters and perform an ablation study on artificially generated datasets. We implement our model in the popular deep learning framework PyTorch. All experiments were performed on an NVIDIA GTX1080Ti. Upon acceptance in the journal, we will make our code available at
https://github.com/attre2vec/attre2vec and include our DVC (Kuprieiev et al., 2020) pipeline so that all experiments can be easily reproduced.4.1 Datasets
Name  Features  Number of  Training instances  

initial  preprocessed  
node  edge  node  edge  nodes  edges  classes  inductive  transductive  
Cora  1 433  0  32  260  2 485  5 069  7+1  160  5 069 
Citeseer  3 703  0  32  260  2 110  3 668  6+1  140  3 668 
Pubmed  500  0  32  260  19 717  44 324  3+1  80  44 324 
In order to compare gathered evaluation evidence we focused on well known datasets, that appear in the literature, namely: Cora (Sen et al., 2008), Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012). These are citation networks of scientific papers in several research areas, where nodes are the papers and edges denote citations between papers. We summarize basic statistics about the datasets before and after preprocessing steps in Table 2. Raw datasets contain node features only in the form of high dimensional sparse bags of words. For Cora and Citeseer, these are binary vectors, showing which of the most popular words were used in a given paper, and for Pubmed, the features are in the form of TFIDF vectors. To adjust the datasets to our problem setting, we apply the following preprocessing steps to obtain edge level features, which are used to train and evaluate our AttrE2vec model:

we create dense vector representations of the nodes’ features by applying Doc2vec (Le and Mikolov, 2014) in the PVDBOW variant with a target dimension size of 128;

for each edge and its symmetrical version (necessary to perform uniform, undirected random walks) we extract the following features:

1 feature – cosine similarity of raw node features for nodes
and (binary BoW; for Pubmed transformed from TFIDF to binary BoW), 
2 features – the ratios of the number of used words (number of ones in the BoW) to all possible words in the document (length of BoW vector) in each paper and ,

256 features – concatenation of Doc2vec features for nodes and ,

1 feature – a binary indicator, which denotes whether this is an original edge (1) or its symmetrical counterpart (0),


we apply standardization (StandardScaler in ScikitLearn (Pedregosa et al., 2011)) of the edge feature matrix.
Moreover, we extracted new node features as 32dimensional Node2vec embeddings to provide the evaluation possibility for one of our model versions (AttrE2vec with ConcatGRU aggregator), which generalizes upon both edge and nodes attributes.
Raw datasets provide each node labeled by the research area the paper comes from. To apply this knowledge in the edge classification problem setting, we applied the following rule: if an edge has two nodes from the same class (research area), the edge receives this class; if two nodes have different classes, the edge between these nodes is assigned with a crossdomain citation class.
To ensure a fair comparison method, we follow the dataset preparation scheme from EPB (GarcíaDurán and Niepert, 2017)
, i.e., for each dataset (Cora, Citeseer, Pubmed) we sample 10 train/validation/test sets, where the train set consists of 20 edges per class and the validation and test sets to contain 1 000 randomly chosen edges each. While reporting the resulting metrics, we show the mean values over these ten sampled sets (together with the standard deviation).
4.2 Baselines
We compare our method against several baseline methods. In the most simple case, we use the edge features obtained during the preprocessing phase for all datasets (further referred to as Doc2vec).
Many standard approaches employ simple node embedding transformations to obtain edge embeddings. The authors of Node2vec (Grover and Leskovec, 2016b) proposed binary operators like averaging, Hadamard product, or L1 and L2 norms of vector differences. Here, we will use following methods to obtain node embeddings: DeepWalk (Perozzi et al., 2014), Node2vec (Grover and Leskovec, 2016b), SDNE (Wang et al., 2016a) and Struc2vec (Ribeiro et al., 2017). In preliminary experiments, we evaluated these methods and checked that the Average operator and an embedding size of 64 gives the best results. We will use these models in 2 setups: (a) Avg(,) – using only the averaged node features, (b) Avg(,) – like previously but concatenated with the edge features from the dataset (in total 324dim vectors).
We also checked a scheme to compute a 64dim PCA reduction of the concatenated features to have comparable vector sizes with the 64dimensional embedding of our model, but these turned out to perform poorly. Note that SDNE has the capability of inductive reasoning, but due to the nonavailability of such implementation, we decided to evaluate this method in the transductive scheme (which works in favor of the method).
We also extend our body of baselines by more sophisticated approaches – two dense autoencoder architectures. In the first setting
MLP(,), we train a model (see Figure 5), which reconstructs concatenated embeddings of connected nodes. In the second baseline MLP(,,), the autoencoder (see Figure 6) is extended by edge attributes. In both settings, we employ the mean squared error as the model loss function. The output of the encoders (embeddings) is used in the downstream tasks. The input node embeddings are obtained using the methods mentioned above, i.e., DeepWalk, Node2vec, SDNE, and Struc2vec.The last baseline is Line2vec (Bandyopadhyay et al., 2019), which is directly dedicated for edges  we use an embedding size of 64.
Method group/name  Vector  AUC  
size  Citeseer  Cora  Pubmed  
Transductive 
Edge features only; (Doc2vec)  260  86.13 0.95  88.67 0.51  79.15 1.41  
Line2vec  64  86.19 0.28  91.75 1.07  84.88 1.19  
Avg(,)  DeepWalk  64  58.40 1.08  59.98 1.32  51.04 1.23  
Node2vec  64  58.26 0.89  59.59 1.11  51.03 1.01  
SDNE  64  54.28 1.57  55.91 1.11  50.00 0.00  
Struc2vec  64  61.29 0.86  61.30 1.58  54.67 1.46  
MLP(,)  DeepWalk  64  55.88 1.68  57.87 1.53  51.23 0.77  
Node2vec  64  55.35 2.26  57.44 0.87  51.48 1.55  
SDNE  64  55.56 0.93  56.02 1.22  50.00 0.00  
Struc2vec  64  59.93 1.43  59.76 1.80  53.27 1.32  
Avg(,)  DeepWalk  324  86.13 0.95  88.67 0.51  79.15 1.41  
Node2vec  324  86.13 0.95  88.67 0.51  79.15 1.41  
SDNE  324  86.14 1.03  88.70 0.51  79.15 1.41  
Struc2vec  324  86.21 0.97  88.73 0.48  79.24 1.36  
MLP(,,)  DeepWalk  64  84.58 1.11  86.47 0.87  78.60 1.84  
Node2vec  64  84.65 1.05  86.71 0.68  78.84 1.71  
SDNE  64  84.32 1.13  85.99 0.77  78.34 1.07  
Struc2vec  64  83.95 1.16  85.54 0.96  77.19 1.42  
Inductive 
Avg(,)  GraphSage  64  54.84 1.90  55.16 1.36  51.14 1.64 
MLP(,)  GraphSage  64  55.19 1.04  55.47 1.66  50.36 1.54  
Avg(,)  GraphSage  324  86.14 0.95  88.68 0.51  79.16 1.41  
MLP(,,)  GraphSage  64  84.63 1.11  86.14 0.45  78.00 1.85  
AttrE2vec (our)  Avg  64  
Exp  64  88.91 1.10  92.80 0.38  86.18 1.41  
GRU  64  
ConcatGRU  64  88.56 1.34  92.93 0.61  86.34 1.18 
4.3 Edge classification
To evaluate our model in an inductive setting, we need to make sure that test edges are unseen during the model training procedure – we remove them from the graph. Note that all baselines (except for GraphSage, see 1) require all edges during the training phase (i.e., these are transductive methods).
After each training epoch of
AttrE2vec, we evaluate the embeddings using L2regularized Logistic Regression (LR) classifier and compute AUC. The regression model is trained on edge embeddings from the train set and evaluated on edge embeddings from the validation set. We take the model with the highest AUC value on the validation set. Moreover, an early stopping strategy is implemented– if the validation AUC metric does not improve for more than 15 epochs, the learning is terminated. Our approach to model selection is aligned with the schema proposed in
(Nguyen et al., 2020) because this approach is more natural than relying on the loss function. This is repeated for all 10 data splits (see: Section 4.1 for details). We report a mean and std AUC measures for 10 test sets (see Table 3)We choose AdamW (Loshchilov and Hutter, 2017) with a learning rate of to optimize our model’s parameters. We also set the size of positive samples to and negative samples to in the cosine embedding loss. The mixing coefficient is set to , equally including the influence of features and topological graph structure. We choose an embedding size of 64 as a reasonable value while dealing with edge features of size 260.
In Table 3, we summarize the AUC values for baseline methods and for our model. Even though vectors’ original dimensionality is relatively high (260), good results are already yielded using only the edge features (Doc2vec). However, adding structural information about the graph could further improve the results.
Using representations from node embedding methods, which are transformed to edge embeddings using the average operator Avg(,), achieve poor results of about 5060% AUC. However, if these are combined with the edge features from the datasets Avg(,), the AUC values increase significantly to about 86%, 88% and 79% for Citeseer, Cora, and Pubmed, respectively. Unfortunately, this results in an even higher vector dimensionality (324).
The MLPbased approach results lead to similar conclusions. Using only node embeddings MLP(,) we achieve quite poor results of about 50% (on Pubmed) up to 60% (on Cora). With MLP(,,) approach we observe that edge features improve the classification results. The AUC values are still slightly worse than concatenation operator (Avg(,)), but we can reduce the edge embedding size to 64.
The Line2vec (Bandyopadhyay et al., 2019) algorithm achieves very good results, without considering edge features information – we get about 86%, 92% and 85% AUC for Citeseer, Cora, and Pubmed, respectively. These values are higher than for any other baseline approach.
Our model performs the best among all evaluated methods. For Citeseer, we gain about 3 percent points compared to the best baselines: Line2vec, Struc2vec (Avg(,)) or GraphSage (Avg(,)). Note that the algorithm is trained only on 140 edges in the inductive setting, whereas all transductive baselines require the whole graph for training. The gains on Cora are 2 pp, and on Pubmed we achieve up to 4pp (and up to 8pp compared only to GraphSage (Avg(,))). Our model with the Average (Avg) aggregator works the best, whereas the Gated Recurrent Unit (GRU) aggregator achieves the secondbest results.
4.4 Edge clustering
Similarly to Line2vec (Bandyopadhyay et al., 2019), we apply the KMeans++ algorithm on the resulting embedding vectors and compute an unsupervised clustering accuracy (Xie et al., 2016). We summarize the results in Table 4
. Our model performs the best in all but one case and achieves significantly better results than other baseline methods. The only exception is for the Pubmed dataset, where Line2vec achieves the best clustering accuracy. Other baseline methods perform similarly as in the edge classification task. Hence, we will not discuss the details, and we encourage the reader to go through the detailed results.
Method group/name  Vector  Accuracy  
size  Citeseer  Cora  Pubmed  
Transductive 
Edge features only; (Doc2vec)  260  54.13 2.73  54.64 5.86  46.33 1.53  
Line2vec  64  54.73 2.56  63.50 1.92  
Avg(,)  DeepWalk  64  28.89 1.06  21.93 0.86  27.24 0.50  
Node2vec  64  26.82 0.67  21.32 0.62  27.17 0.74  
SDNE  64  21.01 0.50  17.97 0.47  31.38 0.69  
Struc2vec  64  25.21 1.33  20.15 0.64  32.02 1.49  
MLP(,)  DeepWalk  64  26.36 1.37  21.06 0.57  27.40 0.93  
Node2vec  64  26.37 1.64  21.31 0.98  27.67 0.78  
SDNE  64  22.27 0.76  17.15 0.36  28.44 1.21  
Struc2vec  64  24.22 0.83  19.56 0.49  31.31 1.70  
Avg(,)  DeepWalk  324  54.13 2.73  54.70 5.85  46.33 1.53  
Node2vec  324  54.13 2.73  54.70 5.85  46.33 1.53  
SDNE  324  55.29 2.06  55.43 4.63  46.33 1.53  
Struc2vec  324  55.59 1.51  52.47 6.52  46.32 1.29  
MLP(,,)  DeepWalk  64  48.74 4.03  47.38 4.72  46.49 1.20  
Node2vec  64  50.80 2.30  48.48 3.38  46.15 1.43  
SDNE  64  46.17 3.15  44.87 3.54  45.74 1.89  
Struc2vec  64  47.35 3.73  44.38 3.04  45.40 1.72  
Inductive 
Avg(,)  GraphSage  64  18.79 0.62  17.70 1.05  27.04 0.71 
MLP(,)  GraphSage  64  18.92 0.98  17.89 0.85  27.09 0.81  
Avg(,)  GraphSage  324  54.06 2.54  54.82 6.86  46.49 1.64  
MLP(,,)  GraphSage  64  48.79 4.04  47.49 5.41  45.15 1.54  
AttrE2vec (our)  Avg  64  59.82 3.30  65.42 1.71  48.86 2.46  
Exp  64  59.07 4.65  48.02 2.55  
GRU  64  49.41 1.49  
ConcatGRU  64  66.00 2.21 
4.5 Embedding visualization
For all tested baseline methods and our proposed AttrE2vec method, we compute 2dimensional projections of the produced embeddings using TSNE (van der Maaten and Hinton, 2008) method. We visualize them in Figure 7. In our subjective opinion, these plots correspond to the AUC scores reported in Table 3—the higher the AUC, the better the group separation. In details, for Doc2vec raw edge features seem to form groups, but unfortunately overlap to some degree. We cannot observe any pattern in the node embeddingbased settings (Avg(,) and MLP(,)), they tempt to be quasirandom. When concatenated with the edge attributes (Avg(,) and MLP(,,)) we observe a slightly better grouping, but yet non satisfying. AttrE2vec model produces much more formed groups, with only a little overlapping. To summarize, based on the observed groups’ separability and AUC metrics, our approach works the best among all methods.
5 Hyperparameter Sensitivity of AttrE2vec
We investigate hyperparameters’ effect considering each of them independently, i.e., setting a given parameter and preserving default values for all other parameters. The evaluation is applied for our model’s two inductive variants: with the Average aggregator and with the GRU aggregator. We use all three datasets (Cora, Citeseer, Pubmed) and report the AUC values. We choose following hyperparameter value sets (values with an asterisk denote the default value for that parameter):

length of random walk: ,

number of random walks: ,

embedding size: ,

mixing parameter: .
The results of all experiments are summarized in Figure 8. We observe that for both aggregation variants, Avg and GRU, the trends are similar, so we will include and discuss them based only on the Average aggregator.
In general, the higher the number of random walks and the length of a single random walk , the better results are achieved. One may require higher values of these parameters, but it significantly increases the random walk computation time and the model training itself.
Unsurprisingly, the embedding size (embedding dimension) also follows the same trend. With more dimensions, we can fit more information into the created representations. However, as an embedding goal is to find lowdimensional vector representations, we should keep reasonable dimensionality. Our chosen values (16, 32, 64) seem plausible while working with 260dimensional edge features.
As for loss mixing parameter , we observe that too high values negatively influence the model performance. The greater the value, the more critical the structural loss becomes. Simultaneously the feature loss becomes less relevant. Choosing causes the loss function to consider feature reconstruction only and completely ignores the embedding loss. This yields significantly worse results and confirms that our approach of combining both feature reconstruction and structural embedding loss is justified. In general, the best values are achieved for setting an equal influence of both loss factors ().
6 Ablation study
We performed an ablation study to check whether our method AttrE2vec is invariant to introduced noise in an artificially generated network. We use a barbell graph, which consists of two fully connected graphs and a path which connects them (see: Figure 1). The graph has seven nodes in each full graph and seven nodes in the path – a total of 50 edges. Next, we generate features from 3 clusters in a 200dimensional space using isotropic Gaussian blobs. We assign the features to 3 parts of the graph: the first to the edges in one of the full graphs, the second to the edges in the path and the third to the edges in the other full graph. The edge classes are matching the feature clusters (i.e., three classes). Therefore, the structure is aligned with the features, so any good structure based embedding method can fit this data very well (see: Figure 1). A problem occurs when the features (and hence the classes) are shuffled within the graph structure. Methods that employ only a structural loss function will fail. We want to check how our model AttrE2vec, which includes both structural and featurebased loss, performs with different amount of such noise.
We will use the graph mentioned above and introduce noise by shuffling of all edge pairs, which are from different classes, i.e., an edge with class 2 (originally located in the path) may be swapped with one from the full graphs (classes 1 or 3). We use our AttrE2vec model with an Average aggregator in the transductive setting (due to the graph size) and report the edge classification AUC for different values of and . The values of the mixing parameter allow us to check how the model behaves when working only with a featurebased loss (), only with a structural loss (), and with both losses at equal importance (). We train our model for five epochs and repeat the computations ten times for every pair, due to the shuffling procedure’s randomness. We report the mean and standard deviation of the AUC value in Figure 9.
Using only the feature loss or a combination of both losses allows us to achieve nearly 100% AUC in the classification task. The fluctuations appear due to the low number of training epochs and the local optima problem. The performance of the model that uses only structural loss (
) decreases with higher shuffling probabilities, and from a certain point, it starts improving slightly because shuffling results in a complete swap of two classes, i.e., all features and classes from one graph part are exchanged with all features and classes from another part of the graph.
We also demonstrate how our method reacts on noisy data with various . There are two graphs: one where the features are aligned to substructures of the graph and the second with shuffled features (ca. 50%), see Figure 10. Keeping AttrE2vec with allows to represent noisy graphs fairly.
7 Conclusions and future work
We introduce AttrE2vec – the novel unsupervised and inductive embedding model to learn attributed edge embeddings by leveraging on the selfattention network with autoencoder over attribute space and structural loss on aggregated random walks. Attre2vec can directly aggregate feature information from edges and nodes at many hops away to infer embeddings not only for present nodes, but also for new nodes. Extensive experimental results show that AttrE2vec obtains the stateoftheart results in edge classification and clustering on CORA, PUBMED and CITESEER.
Acknowledgments
The work was partially supported by the National Science Centre, Poland grant No. 2016/21/D/ST6/02948, and 2016/23/B/ST6/01735, as well as by the Department of Computational Intelligence, Wrocław University of Science and Technology statutory funds.
References
 Edge classification in networks. In 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016, pp. 1038–1049. External Links: Document, ISBN 9781509020195 Cited by: Table 1, §2, §2.

Joint autoweighted graph fusion and scalable semisupervised learning
. Information Fusion 66, pp. 213–228 (English). External Links: Link Cited by: §1.  Beyond node embedding: A direct unsupervised edge representation framework for homogeneous networks. External Links: 1912.05140 Cited by: Table 1, §1, §2, §2, §4.2, §4.3, §4.4, §4.
 FSCNMF: Fusing structure and content via nonnegative matrix factorization for embedding information networks. External Links: 1804.05313 Cited by: Table 1, §1, §2.
 Neural Graph Learning: Training Neural Networks Using Graphs. dl.acm.org 2018Febua, pp. 64–71. External Links: ISBN 9781450355810, Document Cited by: Table 1, §2.
 Machine Learning on Graphs: A Model and Comprehensive Taxonomy. External Links: Link Cited by: §1, §2.
 Relation constrained attributed network embedding. Information Sciences 515, pp. 341–351. External Links: Document Cited by: §1.

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
. External Links: 1412.3555, Link Cited by: item 3.  Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, New York, NY, USA, pp. 135–144. External Links: Document, ISBN 9781450348874, Link Cited by: §1.

Deep attributed network embedding.
In
IJCAI International Joint Conference on Artificial Intelligence
, Vol. 2018July, pp. 3364–3370. External Links: Document, ISBN 9780999241127, ISSN 10450823 Cited by: Table 1, §1, §1, §2.  Learning graph representations with embedding propagation. In Advances in Neural Information Processing Systems, Vol. 2017Decem, pp. 5120–5131. Cited by: Table 1, §1, §2, §4.1.

Exploiting edge features for graph neural networks.
In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, pp. 9203–9211. External Links: ISBN 9781728132938, Document, ISSN 10636919 Cited by: Table 1, §1, §2, §2.  Node2vec: Scalable feature learning for networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 1317Augu, pp. 855–864. External Links: ISBN 9781450342322, Document Cited by: Table 1, §1, §2.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §3.1, §4.2.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Vol. 2017Decem, pp. 1025–1035. External Links: ISSN 10495258 Cited by: Table 1, §1, §1, §1, §2.
 Open graph benchmark: Datasets for machine learning on graphs. External Links: 2005.00687, Link Cited by: §1.
 Edgelabeling graph neural network for fewshot learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019June, pp. 11–20. External Links: Document, 1905.01436, ISBN 9781728132938, ISSN 10636919 Cited by: Table 1, §1, §2, §2.
 Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017  Conference Track Proceedings, pp. 1–14. External Links: 1609.02907, Link Cited by: Table 1, §1, §2.
 DVC: data version control  git for data & models. Zenodo. External Links: Document, Link Cited by: §4.
 Improving network embedding with partially available vertex and edge content. Information Sciences 512, pp. 935–951. External Links: Document Cited by: §1.
 Distributed representations of sentences and documents. In 31st International Conference on Machine Learning, ICML 2014, Vol. 4, pp. 2931–2939. External Links: 1405.4053, ISBN 9781634393973, Link Cited by: item 1.
 Multisource information fusion based heterogeneous network embedding. Information Sciences 534, pp. 53–71. External Links: Document Cited by: §1.
 Network representation learning: a systematic literature review. Neural Computing and Applications 32 (21), pp. 16647–16679. External Links: Document, ISBN 0123456789, ISSN 09410643 Cited by: §1, §1, §2.
 Graph representation learning with encoding edges. Neurocomputing 361, pp. 29–39. External Links: Document, ISSN 18728286 Cited by: Table 1, §1, §2.
 AHNG: representation learning on attributed heterogeneous network. Information Fusion 50, pp. 221–230 (English). Note: Cited By :3 External Links: Link Cited by: §1.
 Decoupled Weight Decay Regularization. External Links: 1711.05101, Link Cited by: §4.3.
 Querydriven Active Surveying for Collective Classification. In Proceedings ofthe Workshop on Mining and Learn ing with Graphs, Edinburgh, Scotland, UK., pp. 1–8. Cited by: §4.1.
 A selfattention network based node embedding model. External Links: 2006.12100, Link Cited by: §4.3.
 CAGE: Constrained deep Attributed Graph Embedding. Information Sciences 518, pp. 56–70. External Links: Document Cited by: §1.
 Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: item 3.
 DeepWalk: Online Learning of Social Representations Bryan. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’14, New York, New York, USA, pp. 701–710. External Links: Document, ISBN 9781450329569, Link Cited by: Table 1, §1, §2, §4.2.
 Struc2vec: Learning node representations from structural identity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. Part F1296, pp. 385–394. External Links: ISBN 9781450348874, Document Cited by: Table 1, §2, §4.2.
 Collective classification in network data. AI Magazine 29 (3), pp. 93. External Links: Link, Document Cited by: §4.1.

Dynamic edgeconditioned filters in convolutional neural networks on graphs
. In Proceedings  30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017Janua, pp. 29–38. External Links: ISBN 9781538604571, Document Cited by: Table 1, §2.  LINE: Largescale information network embedding. In WWW 2015  Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. External Links: ISBN 9781450334693, Document Cited by: Table 1.
 Visualizing data using tSNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §4.5.
 Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018  Conference Track Proceedings, pp. 1–12. External Links: 1710.10903 Cited by: Table 1, §1, §2.
 Structural deep network embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1225–1234. External Links: ISBN 9781450342322, Link, Document Cited by: §4.2.
 Structural deep network embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 1317Augu, pp. 1225–1234. External Links: ISBN 9781450342322, Document Cited by: Table 1, §1, §1, §2.

Covid19 classification by fgcnet with deep feature fusion from graph convolutional network and convolutional neural network
. Information Fusion 67, pp. 208–229 (English). Note: Cited By :1 External Links: Link Cited by: §1.  Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics 38 (5), pp. 146. External Links: Document Cited by: Table 1, §2, §2.
 Attribute2vec: Deep network embedding through multifiltering GCN. External Links: 2004.01375, Link Cited by: Table 1, §2.
 A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21. External Links: Document Cited by: §1, §2.

Unsupervised deep embedding for clustering analysis
. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 478–487. External Links: Link Cited by: §4.4.  Network representation learning with rich text information. In IJCAI International Joint Conference on Artificial Intelligence, Vol. 2015Janua, pp. 2111–2117. External Links: ISBN 9781577357384 Cited by: Table 1, §1, §1.
 Heterogeneous graph neural network. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 793–803. External Links: Document, ISBN 9781450362016, Link Cited by: §1.
 Network Representation Learning: A Survey. IEEE Transactions on Big Data 6 (1), pp. 3–28. External Links: Document Cited by: §1, §2.
Comments
There are no comments yet.