1 Introduction
After several attempts and modifications ([Bruna et al.2014] [Niepert et al.2016] [Kipf and Welling2016] [Tao et al.2017] [Jie et al.2018]), Graph Convolutional Networks (GCNs) ^{1}^{1}1Although some papers do not explicitly use the term convolution, such as [Tao et al.2017] and [Hamilton et al.2017], a similar idea of aggregating features in a node’s neighborhood can be seen in the models. Therefore, we categorize all similar work into the class of graph convolutional networks. are now rapidly gaining popularity due to its excellent performance in aggregating neighborhood information for individual graph nodes. However, although lowrank proximities and neighbor node features are successfully leveraged through the convolutional layers, the attributes that graph links may carry are generally ignored in GCNs. In almost all existing GCN models, graph links are regarded as indicators of proximities between nodes, and link weights, accordingly, as the proximity strengths. These proximities are only used to identify neighborships and their influences in the local neighborhoods.
However, in realworld scenarios, a link between a pair of nodes carries a lot more information than a simple indicator of neighborship: it represents a hypostatic relationship of various forms between two entities with concrete attributes. For example, two connected people in a social network may have different types of relationships such as family members, colleagues and alumni, and may accordingly have different communication patterns and contents; a link in a business network typically represents a transaction between two companies, and the properties of such transactions are obviously informative. For it is impossible to represent these complex relationships simply with binary or weighted links, reverting graph links to hypostatic relationships, as what they should be like in the real world, allows us to recover the exact relationships between nodes.
While one can somehow leverage link attributes in current GCN models with tricks such as concatenating them to neighbor node attributes, these implementations cannot adequately captures the interactions between the attributes ^{2}^{2}2Detailedly introduced in Section 3.. As to the best of our knowledge, there is no previous work focusing on incorporating link attributes into GCNs, we propose GCNLASE (Graph Convolutional Network with Link Attributes and Sampling Estimation)
as an attempt. LASE is an extension of GCN, which learns a function that maps a target node to its hidden representation, considering the features and structures in its local neighborhood–including both link and neighbor node attributes. The aggregated features are then used to conduct various tasks.
To leverage the link and node attributes as well as the interactions in between, we adopt their tensor products as the fully associated neighbor features, based on which a neighbor kernel is designed using the inner product of these tensors. Furthermore, we derive corresponding graph kernels and finally the neural architectures following a similar route introduced in [Tao et al.2017]. We then provide intuitive understandings of LASE is by modularizing it into a gate, an amplifier, and an aggregator
. Meanwhile, to accelerate the training process of LASE, we adopt the Monte Carlo method to quickly estimate the sum of features in entire neighborhoods. We also introduce a novel sampling setup for LASE to reduce the estimation variance: neighbors are sampled among the neighborhoods according to a calculated distribution. As it can be rather timeconsuming to calculate the optimal sampling probabilities batchwise, we make a tradeoff between variance and efficiency by controlling the interval between two calculation rounds.
Recovering more information in graphs is not the only benefit that incorporating link attributes brings about–it also enlarges the expressive abilities of graph structures that GCNs can handle. There are at lease two other types of graphstructured data on which LASE can be implemented: i) Graphs with heterogenous links, where link weights from different perspectives can be arranged into vectors and used as input link attributes; ii) Dynamic or temporal graphs, where link weights from different timestamps can be stacked together and used as input link attributes. See Figure
1 for detailed examples. We validate our approach on four datasets across different domains, and further design additional experiments in order to demonstrate the informativeness of link attributes and the effect of the proposed sampling setup.2 Related Work
Our method builds upon previous work of machine learning over graphs, including graph convolutional networks, graph kernels and node representation learning.
Graph convolutional networks.
The past few years have seen plenty of works focusing on implementing deep architectures over graphs ([Bruna et al.2014] [Henaff et al.2015] [Kipf and Welling2016] [Wang et al.2016] [Niepert et al.2016] [Hamilton et al.2017] [Tao et al.2017]), among which the convolutional networks seem to be the most appealing. There are mainly two types of convolutional networks existing, one learning features for entire graphs ([Tao et al.2017] [Niepert et al.2016]), the other for individual nodes ([Kipf and Welling2016] [Hamilton et al.2017]). Both methods adopt the concept of convolution by merging features from local neighborhoods. Meanwhile, as the original GCN introduced by Kipf et al kipfgcn does not support minibatch training, modifications towards better efficiency emerge ([Hamilton et al.2017] [Jie et al.2018] [Huang et al.2018]), in which the Monte Carlo method is generally used to estimate the features of entire neighborhoods through a controllable size of nodes. In our paper, a different sampling implementation is adopted, which makes a tradeoff between variance and efficiency by controlling the interval of calculating the optimal sampling probabilities.
Graph kernels.
Kernel methods [Schölkopf and Smola2002] have long been an important class of machine learning techniques, while it remain challenging to define effective and convenient kernels for graphs. Existing graph kernels ([Gärtner et al.2003] [Vishwanathan et al.2008] [Shervashidze et al.2011]) are typically defined over of graph substructures such as subtrees and walk sequences. However, to the best of our knowledge, there is no existing work aiming at incorporating link attributes into graph kernels. Later, [Tao et al.2017] introduces an innovative route to develop neural architectures that are grounded in graph kernels. These architectures therefore have better explainability. We adopt a similar route to design the architectures of LASE, using the novel graph kernels we derive under a tensorproductbased feature setup.
Node representation learning (NRL).
NRL aims to learn lowdimensional embeddings for graph nodes ([Perozzi et al.2014] [Tang et al.2015] [Grover and Leskovec2016] [Ribeiro et al.2017] [Hamilton et al.2017] [Li et al.2018] [Du et al.2018]
), which are later used in downstream prediction tasks such as node classification and link prediction. GCNs can, in a broader sense, also be classified as NRL methods when regarding the hidden embeddings of nodes as node representations. However, few existing NRL models incorporate link attributes. In Section 5, We compare the performances of LASE with several other NRL approaches.
Input graph.  
Nodes in G.  
A link from node to .  
A pair of node and link, or to say a  
neighbor of node .  
The feature of a node, a link or a pair.  
The set of neighbor nodes of .  
The operations of tensor product, elem  
entwise product and inner product.  
The operation of concatenating input  
vectors.  
The parameters in the neural network. 

The hidden representation of node in  
layer . 
3 Model
In this section, we introduce the architecture of LASE (see Figure 2) and the motivation behind. we first focus on exploiting the interactions between node and link attributes and define the form of the associated features of a neighbor ^{3}^{3}3The term neighbor
is used for an ordered pair, containing a neighbor node and the link connecting it to the central node, similarly hereinafter.
. As directly using the defined tensor features in GCNs would be clumsy, we design new graph kernels under this setup, based on which we further derive possible architectures of LASE. In the end, we modularize the architecture of LASE and provide intuitive understandings of the modules. The notations in this paper are illustrated in Table 1.3.1 Neighbor Feature Tensors
One simplest idea to incorporate link attributes in GCN models is to concatenate them to the node’s attributes, i.e.
where . However, as the node and link attributes are independently summed among the neighborhood, such implementations can not at all capture the interactions between attributes. As the key idea behind LASE is to adequately incorporate link attributes into node hidden representations, these interactions should be much informative. Moreover, this setup also leads to the confusion demonstrated in Figure 3, indicating that the graph structure is not appropriately captured.
To stress this problem, instead of simply adding or concatenating node and link attributes, we define their tensor product as the associated neighbor feature: for a central node connected to a neighbor node by a link , the corresponding neighbor is defined as
where calculates the multiplication of every pair of entries in the two vectors. The tensor product serves as a set of fully associated features. However, directly using the tensor as GCN inputs leads to unacceptably high dimensions and heavy redundancies (for the tensor is of rank ). Instead, we adapt existing graph kernels to sodefined neighbor features, and derive the architectures of LASE following a route similar to [Tao et al.2017].
3.2 Graph Kernels with Link Attributes
To adapt existing kernels to our setup of neighbor features, we first define the kernel of two neighbors, and . The neighbor kernel is defined as the inner product of the neighbor tensors, i.e.
Based on the neighbor kernel, a kernel of two hop neighborhoods with central node and can be defined as
by regarding the lowerhop kernel, , as the inner product of the th hidden representations of and . Furthermore, by recursively applying the neighborhood kernel, we can derive the hop Random Walk kernel for graphs with link attributes as
where denotes the set of all walk sequences of length in graph and denotes the th node in sequence . ^{4}^{4}4A WeisfeilerLehman kernel can also be defined by adopting the graph relabeling process, which is detailedly introduced in [Tao et al.2017]. We skip this part due to the limitation of space.
3.3 From Kernels to Neural Architectures
Following the route introduced in [Tao et al.2017] with above Random Walk kernel, a corresponding architecture LASERW can be immediately derived as
This architecture further enjoys a similar property to the architecture described in [Tao et al.2017]. We first construct with the th row vectors in the parameter matrices, denoted as and , where with , and with . Then, we have the following theorem:
Theorem 1.
For any , the sum of (the th coordinate of ) satisfies
and thus lies in the RKHS of kernel .
Similarly, the architecture LASEWL derived from WeisfeilerLehman Kernel should be adapted as
The WeisfeilerLehman architecture is originally designed to convolute nodes through both depth () and breadth (), however, the calculation of LASEWL would be too complex. We unite the depth and breadth convolution to reduce model size, and by referring to the neighborhood aggregation concept in GraphSAGE [Hamilton et al.2017], proposed LASESAGE:
3.4 Discussion
Although [Tao et al.2017] is originally introduced for aggregating features for entire graphs, its output graph features are an activated sum of all nodes features. We reckon these node features be as well informative in nodewise prediction tasks. In fact, we also provide some intuitive understandings of LASE. Note that the calculations in all three architectures can be similarly divided into three common modules, namely a gate, an amplifier and an aggregator, as is shown in Figure 2. Intuitively, the gate () evaluates the significance of a neighbor in ’s neighborhood. The amplifier, denoted with (), elementwisely amplifies the node attributes with link information. We also observe a slight elevation in performance when applying a sigmoid activation on , which makes the module functions more analogously to an amplifier. The aggregator sums up neighbor embeddings and combines them with the central node embedding using various strategies in different architectures. We would like to point out that the aggregators defined in [Hamilton et al.2017] may also be used as aggregators in LASE.
4 Sampling Estimation
Similar to GCN kipfgcn, scalability is an obvious challenge for LASE: calculating the convolutions demands a recursively expanded neighborhood. For nodes with high degrees, it will quickly cover a large portion of the graph. To control batch scales, we leverage the Monte Carlo method to estimate the summed neighborhood information by sampling a fixed number of neighbors. Despite different architectures, the output hidden embeddings of LASE can all be formulated as
where denotes the sampling probabilities in . We then approximate through estimating the expectation. As the sampling process is always unbiased, we look for the optimal probabilities that minimize the estimation variance.
Although there are existing sampling strategies proposed for GCNs ([Jie et al.2018] [Huang et al.2018]), these methods cannot be directly transferred to LASE because of the absence of explicit, constant link weights. Besides, the optimal distribution varies through the training process. However, a similar idea of importance sampling, coined gate sampling, can be used in LASE by regarding the decay factor s as the sampling weights, that is,
While sampling with gates may reduce the estimation variance, it is not an optimal solution because typically the norms of s are different. According to the derivations of importance sampling in [Owen2013], we derive min_var sampling, the optimal sampling probabilities as
Evaluating the sampling probabilities batchwisely can be rather inefficient. Under the hypothesis that the network parameters do not dramatically vary from batch to batch, we make a tradeoff between variance and efficiency by controlling the interval of calculating the optimal distribution. That is, the sampling probabilities for all training nodes are calculated every batches. Although the calculation may be timeconsuming, the batchaveraged time cost will be reduced to .
5 Experiments
5.1 Experiment Setups
Datasets.
We validate our method on the four datasets introduced below, including a graph with link attributes (reddit), a graph with heterogenous links (dblp), and two temporal networks (email, fmobile). The statistics of datasets are shown in Table 2.

Reddit is a Reddit post network with each node representing a post, and each link indicating that the connected two posts are commonly commented by at least three users. We adopt the same setup as [Hamilton et al.2017] for the node attributes, and use the useraveraged distributions of comments in different communities as link attributes.

Dblp
is a coauthor network constructed with papers from 2013 to 2017 in eight artificial intelligence conferences. We use the
tfidf vectors of paper titles as node attributes. The links are categorized under author perspective, i.e. the onehot embeddings of the common authors are used as link attributes. The node and link attributes are reduced to 200 dimensions using PCA. 
Email and fmobile are two temporal networks constructed with user contacts in email and mobilephone services. The contacts of exact times are discretized into timeslices and used as link attributes. As there is no available node features in the datasets, we use node embeddings with dim=128 obtained from transductiveLINE line as the pretrained node features in all convolutionbased models.
Datasets  

61,836  1,222,411  19.77  8  
dblp  14,389  111,858  7.77  8 
986  16,064  16.29  42  
fmobile  21,102  55,009  2.61  33 
Baselines.
We compare the performance of LASE with baselines including raw features, LINE line, DeepWalk dw, GCN kipfgcn and GraphSAGE sage. For LINE and DeepWalk, we adopt an onlinestyle training strategy for the test / validation set introduced in [Hamilton et al.2017] ^{5}^{5}5As there is no implementation of onlineLINE and [Qiu et al.2018] proves that LINE is theoretically equivalent with DeepWalk with walk_length=1, we use the implementation of onlineDeepWalk in [Hamilton et al.2017] instead. N_walks is respectively added to compensate the reduction in node contexts., and a onelayer softmaxactivated neural classifier is trained for all models. To demonstrate the ability of LASE in leveraging link attributes through the amplifiers, we also test the performance of a LASE variant, LASEconcat, implemented by naïvely concatenating link attributes to node attributes.
5.2 Nodewise Classification
We implement nodewise classification respectively over the four datasets mentioned above by predicting the community (reddit, email and fmobile) or the conference (dblp) that a node belongs to. In all datasets, nodes are used as the training set, as the validation set and the rest as the test set. The nodes in the training set with no neighbors are abandoned. The microaveraged f1 scores on the test set are shown in Table 3^{6}^{6}6We do not present the macroaveraged f1 scores for which an analogous trend holds..
As one of the most distinguished strengths of GCNs is to aggregate neighborhood features, convolutionalbased models including GCN, GraphSAGE and LASE show significant advantages to proximitybased models on datasets with node attributes. Through leveraging link attributes, LASE outperforms other GCNs. Moreover, with LASERW and LASESAGE outperforming the naïve implementation LASEconcat, the effect of the amplifier module can be corroborated. Although there is no original features in two temporal networks, LASE still outperforms pretrained features by exploring edge attributes, while GCN and GraphSAGE do not capture these additional information and struggles in overfitting the proximitybased features.
Model  dblp  fmobile  

LINE (online)  0.1802  0.2989  0.3604*  0.3047* 
DeepWalk (online)  0.1714  0.3306  0.3249*  0.4071* 
GCN  0.8172  0.5033  0.6396  0.3908 
GraphSAGE  0.8468  0.5798  0.6548  0.5334 
LASEconcat  0.8438  0.5805  0.7005  0.5380 
LASERW  0.8460  0.5433  0.7208  0.5441 
LASESAGE  0.8633  0.5881  0.7310  0.5649 
Raw Features  0.7923  0.4532     
LINE (transd.)      0.6904  0.4749 
Figure 4 a)
demonstrate the accuracies of LASESAGE using contaminated link attributes of different signaltonoise ratios (SNRs). That is, we add normaldistributed noises of different standard deviations to the original link attributes according to given SNRs, and separately train LASESAGE models under identical model settings. The SNR is defined as
where denotes the standard deviation of the inputs. As SNR increases, a significant trend of decrease in accuracy can be observed. This corroborates the informativeness of link attributes in the LASE architecture.
5.3 Comparison of Sampling Strategies
We look into the training processes of different neighborhood sampling strategies introduced in Section 4, namely uniform sampling, gate sampling and minimal variance (min_var) sampling. We separately train models with corresponding sampling strategy on email
, and present the variations of accuracies on the validation set against training epochs in Figure
4 b). While the convergence speeds appear analogous, min_var sampling consistently attains better convergence performance compared with uniform and gate sampling. The reason that gate sampling does not show a significant advantage over uniform sampling may be that the norms of transformed neighbor features () varies greatly in the neighborhood.Figure 4 c) shows the tradeoff between performance and efficiency made through different calculation intervals of the sampling distribution (under min_var setup). As the interval increases, the performance slightly drops. Calculating the probabilities batchwise attains a significant elevation in the performance, while the computation cost can be unacceptably high on larger datasets. Additionally, when becomes large enough, increasing does not significantly influence the training performance.
6 Conclusions and Future Work
In this paper, we propose LASE as an extension of graph convolutional networks, which leverages more information from graph links than existing GCNs by incorporating the link attributes. The contribution of LASE lies in three folds: i) LASE provides a ubiquitous solution to a wider class of graph data by incorporating link attributes; ii) LASE outperforms strong baselines and naïve concatenating implementations by adequately leveraging the information in the link attributes; iii) LASE adopt a more explainable approach in determining the neural architecture and thus enjoys better explainability.
For future work, we are looking for better sampling solutions for LASE, as although stressed with calculation intervals, current sampling setup seems to be rather clumsy when the graph becomes massively large. We are also looking for other possible approaches, hopefully with better performance, to incorporating link attributes. Besides, as LASE is an universal solution to all graphstructured data, an intriguing direction may be designing domain or taskspecific architectures based on LASE to attain better performances, such as more elegant adaptations to dynamic networks.
References
 [Bruna et al.2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations (ICLR’14), 2014.
 [Du et al.2018] Lun Du, Yun Wang, Guojie Song, Zhicong Lu, and Junshan Wang. Dynamic network embedding : An extended approach for skipgram based network embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), 2018.

[Gärtner et al.2003]
Thomas Gärtner, Peter Flach, and Stefan Wrobel.
On graph kernels: Hardness results and efficient alternatives.
In
Proceedings of the Annual Conference on Computational Learning Theory (CoLT’03)
, 2003.  [Grover and Leskovec2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), pages 855–864, 2016.
 [Hamilton et al.2017] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS’17), 2017.
 [Henaff et al.2015] Mikael Henaff, Joan Bruna, and Yann Lecun. Deep convolutional networks on graphstructured data. Computer Science, 2015.
 [Huang et al.2018] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fast graph representation learning. CoRR, abs/1809.05343, 2018.
 [Jie et al.2018] Chen Jie, Tengfei Ma, and Xiao Cao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), 2018.
 [Kipf and Welling2016] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16), 2016.
 [Li et al.2018] Ziyao Li, Liang Zhang, and Guojie Song. Sepne: Bringing separability to network embedding. CoRR, abs/1811.05614, 2018.

[Niepert et al.2016]
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov.
Learning convolutional neural networks for graphs.
In Proceedings of the 33rd International Conference on Machine Learning (ICML’16), 2016.  [Owen2013] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
 [Perozzi et al.2014] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), 2014.
 [Qiu et al.2018] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18), 2018.
 [Ribeiro et al.2017] Leonardo F.R. Ribeiro, Pedro H.P. Saverese, and Daniel R. Figueiredo. Struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17), 2017.
 [Schölkopf and Smola2002] B. Schölkopf and Alexander Johannes Smola. Learning With Kernels. The MIT Press, 2002.
 [Shervashidze et al.2011] Nino Shervashidze, Pascal Schweitzer, Erik Jan, Van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(3):2539–2561, 2011.
 [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line:largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW’15), 2015.
 [Tao et al.2017] Lei Tao, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. In Proceedings of the 34 th International Conference on Machine Learning (ICML’17), 2017.
 [Vishwanathan et al.2008] S. V. N. Vishwanathan, Karsten M. Borgwardt, Imre Risi Kondor, and Nicol N. Schraudolph. Graph kernels. Journal of Machine Learning Research, 11(2):1201–1242, 2008.
 [Wang et al.2016] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), 2016.
Comments
There are no comments yet.