1 Introduction
In recent years, deep learning has achieved great success in many kinds of fields such as image classification, video processing and speech recognition. The data in these tasks is usually represented in the Euclidean space. However, there are many applications where data is generated from the nonEuclidean domain and is represented as graphs. This kind of data is known as graph data. A graph data structure consists of a finite set of vertices (also called nodes), together with a set of unordered pairs of these vertices for an undirected graph or a set of ordered pairs for a directed graph. These pairs are known as edges. Using the information of graph data, we can capture the interdependence among instances (nodes), such as citationship in papers network, friendship in social network and interactions in molecule network. For instance, in a papers citation network, papers are linked to each other via citationship and the papers can be classified into different areas. The graph data is very complex because of its irregularity. The complexity of graph data results that some important operations of deep learning are not applicable to nonEuclidean domain. For example, convolutional neural networks(CNNs) cannot use a convolution kernel of the same size to convolve graph data of such complex structure.
To handle the complexity of graph data, there have been many studies to design new models for graph data inspired by convolution networks, recurrent networks, and deep autoencoders. These models which incorporate neural architectures are known as graph neural networks. Graph neural networks are categorized into graph convolution networks, graph attention networks
velivckovic2017graph zhang2018gaan , graph autoencodersKipf2016Variational wang2017mgae , graph generative networksde2018molgan li2018learning and graph spatialtemporal networksli2017diffusion according to Wu et al. wu2019comprehensive . In these graph neural networks, graph convolution networks(GCNs) are the most important ones, which are the fundamental of other graph neural network models. One of the earliest work on GCNs is presented in Bruna et al. (2013), which develops a variant of graph convolution bruna2013spectral . From then on, there have been many works to improve graph convolutional networks kipf2016semi defferrard2016convolutional henaff2015deep li2018adaptive levie2017cayleynets . These GCNs approaches fall into two categories. One category of the GCNs approaches is spatialbased. These approaches directly perform the convolution in the graph domain by aggregating information of the neighbor nodes. The other category of the GCNs approaches is spectralbased. These approaches propose a variant of graph convolution methods based on spectral graph theory from the perspective of graph signal processing. Although spectralbased methods have more computational cost than spatialbased ones, they have more powerful ability to extract features from graph data.As the earliest convolutional networks for graph data, spectralbased models have achieved impressive results in many graph related analytics tasks. However, spectralbased models are limited to work only on undirected graphs kipf2016semi . So the only way to apply spectralbased models to directed graphs is to relax directed graphs to undirected ones, which would be unable to represent the actual structure of directed graphs. Some of the researchers combine the recurrent model and spectralbased GCN to process the temporal directed graphs pareja2019evolvegcn , but they don’t focus on the GCN’s own structure. To the best of our knowledge, we are the first to make improvement of the spectralbased GCN layer’s propagation model to make it adapted to directed graphs.
In this paper, we use a definition of the Laplacian matrix on directed graphs chung2005laplacians to derive the propagation model’s mathematical representation. We use feature decomposition and Chebyshev polynomials to approximate the representation of directed Laplacian matrix to get our propagation model. Then we use this propagation model to design our spectralbased GCNs for directed graphs. Our approach can work well on different directed graph datasets in semisupervised nodes classification tasks and achieves better performance than the stateoftheart spectralbased and spatialbased GCN methods.
The remainder of this paper is organized as follows: Section 2 introduces the theoretical motivation of the classic spectralbased GCNs; Section 3 demonstrates the mathematical representation of Laplacians for directed graph and the models we construct in our methods; Section 4 demonstrates the details of our experiments on semisupervised classification tasks; Concluding discussions and remarks are provided in Section 5 and Section 6.
2 Preliminaries
Spectralbased GCNs are based on Laplacian matrix. For an undirected graph, suppose is the adjacency matrix of the graph, is a diagonal matrix of node degrees, . A graph Laplacian matirx is defined as . The normalized format of Laplacian matrix is defined as , which is a matrix representation of a graph in the graph theory and can be used to find many useful properties of a graph. is symmetric and positivesemidefinite. With these properties, the normalized Laplacian matrix can be factored as , where
is the matrix of eigenvectors ordered by eigenvalues and
is the diagonal matrix of eigenvalues.Spectral Graph Convolutions
The spectral graph convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian.
is the feature vector of graph’s nodes. The graph Fourier transform to
is defined as . The Fourier transform projects of input graph into the orthogonal space, which is equivalent to representing the arbitrary feature vector defined on the graph as a linear combination of the eigenvectors of the Laplacian matrix. The inverse graph Fourier transform is defined as , where is the output obtained by through the graph Fourier transform. Applying Convolution Theorem wiki:xxx to the graph Fourier transform, the spectral convolutions on graphs are defined as the multiplication of with a filter in the Fourier domain:(1) 
where represents convolution operation and represents the Hadamard product. For two matrices and of the same dimension , the Hadamard product is a matrix of the same dimension as the operands, with elements given by. By defining filter as , Equation 1 can be simplified as
(2) 
Here we can understand as a function of the eigenvalues of , i.e. .
Chebyshev Spectral GCN
As we can see, multiplication with the eigenvector matrix from Equation 2 is computationally expensive. To solve this problem, Defferrard et al. defferrard2016convolutional propose ChebNet which uses Chebyshev polynomials of the diagonal matrix of eigenvalues to approximate . ChebNet parametrizes to be a order polynomial of :
(3) 
where and denotes the largest eigenvalue of . The Chebyshev polynomials are defined recursively by with and . Now the definition of a convolution of with a filter becomes:
(4) 
where . represents a rescaling of the graph Laplacian that maps the eigenvalues from to since Chebyshev polynomial forms an orthogonal basis in .
First order of ChebNet(1stChebNet)
Kipf et al. kipf2016semi propose a firstorder approximation of ChebNet which assumes and to get a linear function. Equation 4 simplifies to:
(5) 
And further assuming , the definition of graph convolution becomes
(6) 
Because has eigenvalues in the range , it may lead to exploding or vanishing gradients when used in a deep neural network model. To alleviate this problem, Kipf et al. [12] use a renormalization trick , with and . It’s a further simplification and it means adding a selfloop to each node in practice. Finally, we can generalize this definition of the graph convolution layer:
(7) 
where is with Cdimensional feature vector for every node, is a matrix of filter parameters and is the convolved result. The graph convolution defined by this format is localized in space and connects the spectralbased methods with spatialbased ones.
However, the above derivation is based on a premise that the Laplacian matrix is the representation for undirected graphs. It results that these spectralbased models are limited to work only on undirected graphs kipf2016semi . The only way to handle directed edges is to relax directed graphs to undirected ones, which would be unable to represent the actual structure of directed graphs. To address this problem, we propose our spectralbased GCN method for directed graphs in the following section.
3 Method
Existing spectralbased GCNs methods cannot directly work on the directed graphs, but their powerful ability to extract features from graphs are impressive. It is expected that utilizing this ability of spectralbased GCNs in our work can improve the performance of our method. Besides, designing a spectralbased GCN is important for filling the gaps in the field of processing the directed graphs. Motivated by these, we design a spectralbased GCN method for the directed graph in our work.
In this section, we first give the definition of the Laplacians for directed graphs chung2005laplacians , which is fundamental in spectralbased GCN. We then give the approximation of localized spectral filters on directed graphs using Chebyshev polynomials of the diagonal matrix of Laplacian’s eigenvalues. Finally, we describe the models we use in our experiments.
3.1 Laplacians for directed graphs
Eigenvalues and eigenvectors are closely related to almost all major invariants of a graph, linking one extremal property to another. They play a central role in the fundamental understanding of graphs in spectral graph theory chung1997spectral . The eigenvalues and eigenvectors of Laplacian matrix provide very useful information of graph. In a graph Laplacian, if two vertices are connected by an edge with a large weight, the values of the eigenvector at those locations are likely to be similar. The eigenvectors associated with larger eigenvalues oscillate more rapidly and are more likely to have dissimilar values on vertices connected by an edge with high weight. In addition, Laplacian matrix is a semipositive symmetric matrix and the eigenvectors of the Laplacian matrix are a set of orthogonal basis in ndimensional space, it’s convenient to perform graph Fourier transform and inverse graph Fourier transform in practice as described in Section 2. According to what we discussed above, the Laplacian matrix can represent the properties of graphs well and graph Laplacian eigenvectors can be used as filtering bases of GCN. In order to deduce the principal properties and structure of a graph from its graph spectrum, we choose to use Laplacian matrix for directed graphs to be the fundamental of our method.
Suppose is a directed graph with vertex set and edge set . For a directed edge in , we say that there is an edge from to , or, has an outneighbor . The number of outneighbors of is the outdegree of , denoted by . Using the same representation in Section 2, we can define as the outdegree matrix of a directed graph, where is the adjacency matrix(or weight matrix for weighted directed graph) of the directed graph. If there is a path in each direction between each pair of vertices of the graph , then this directed graph is called strongly connected.
Transition Probability Matrix
Assuming
is a transition probability matrix, where
denotes the probability of moving from vertex to vertex . For a given directed graph , a transition probability matrix is defined as(8) 
For a weighted directed graph with edge weights , a transition probability matrix can be defined as being proportional to the corresponding weights and formally we have
(9) 
An unweighted directed graph is just a special case with weight having value 1 or 0. In practice, the transition probability matrix can be presented by
(10) 
Perron Vector
The PerronFrobenius Theorem horn2012matrix states that an irreducible matrix with nonnegative entries has a unique left eigenvector with all entries positive. This can be translated to language for directed graphs. Let denote the eigenvalue of the all positive eigenvector of the transition probability matrix , of a strongly connected directed graph has a unique left eigenvector with for all and
(11) 
where is a row vector. According to the PerronFrobenius Theorem, we have and all other eigenvalues of have absolute value at most 1. Then we normalize and choose so that
(12) 
We call the Perron vector of . For a strongly connected graph, is a stationary distribution. Define . Using , we establish the Laplacians for directed graphs in the following paragraph.
Definition of Directed Laplacian
As described in Section 2, in undirected graphs, we have the definition of and we can further derive this definition
(13) 
Now we generalize this definition of undirected graphs to directed graph. We find the most important problem is that is not symmetric in directed graph. So we use this following definition to guarantee that the normalized Laplacian is symmetric.
(14) 
3.2 Spectral GCN for Directed Graph
As the Laplacian defined in Equation 14 is symmetric, we can calculate it’s eigendecomposition as the filter. Then we approximate this filter using the Chebyshev polynomials and set it to firstorder as we demonstrated in Section 2. Finally, we can derive the definition of the directed graph convolution layer:
(15) 
where adjacent matrix(weight matrix) used in this definition to derive and are added selfloop for each node. That is, , , and is calculated based on . is feature vector for every node, is a matrix of filter parameters, is the convolved result.
Now we get the propagation model for directed graph convolution of our method DGCN(Directed Graph Convolutional Network). The details of DGCN propagation model are shown in Figure 1. The symbols in this figure represent the same meaning as defined in Equation 15. Edge information and node information is obtained from the input. The edge index and edge weight represent the edge and its weight in the graph after processing in DGCN propagation model.
3.3 Models
After introducing the propagation model, we design training models to solve the semisupervised node classification for directed graph. In preprocessing step, we calculate . Based on the conclusions in Section 3.2, we can naturally design models of multiple layers. Here we give a twolayer DGCN for example.
(16) 
where is the vectors of nodes’ features. Note that doesn’t contain information presented in , such as links between pages in a Wikipedia network. The neural network weights and are trained using gradient descent. In Equation 16, is an inputtohidden weight matrix and
is a hiddentooutput weight matrix. The softmax activation function is
. We evaluate the crossentropy loss over all labeled examples:(17) 
where denotes labels and is the set of node indices that have labels. We also use dropout to reduce overfitting in our graph convolutional network.
Considering the semisupervised classification tasks of different difficulty level, we design two models in our experiments. One is a twolayer model and the other is a threelayer model. The reason we use twolayer and threelayer model is to avoid overfitting along with the increasing number of parameters with deeper model depth as described in kipf2016semi . Figure 2 shows the architectures of our models. Each hidden layer in the graph convolutional network is a DGCN propagation model.
4 Experiments
We test our models in the semisupervised nodes classification tasks on four different datasets. All the datasets in our experiments can be obtained from open sources. These datasets have different graph structures and belong to different kinds of networks(citation networks, hyperlink networks and email networks). It guarantees that the assessments based on these datasets are comprehensive and objective.
4.1 Datasets
Dataset statistics are summarized in Table 1. We introduce the number of total nodes and edges of each dataset. The nodes belong to different classes and we give the number of these classes. Nodes and edges of the largest strongly connected component(LSCC) are also showed in this table. For all the datasets, we calculate the strongly connected component of the graphs and process the graphs into the edgelist format. The details of each dataset are given as follows.
Blogs
A directed network of hyperlinks among a large set of U.S. political weblogs from before the 2004 election Adamic:2005:PBU:1134271.1134277 . It includes blog political affiliation as metadata. Links between blogs were automatically extracted from a crawl of the front page of the blog. In addition, the authors drew on various sources (blog directories, and incoming and outgoing links and posts around the time of the 2004 presidential election) and classified 758 blogs as leftleaning and the remaining 732 as rightleaning.
Wikipedia
The hyperlink network of Wikipedia pages on editorial norms bradi16 , in 2015. Nodes are Wikipedia entries, and two entries are linked by a directed edge if one hyperlinks to the other. Editorial norms cover content creation, interactions between users, and formal administrative structure among users and admins. Metadata includes page information such as creation date, number of edits, page views and so on. The number of norm categories is also given.
The network was generated using email data from a large European research institution snapnets . We have anonymized information about all incoming and outgoing email between members of the research institution. There is an edge in the network if person sent person at least one email. The emails only represent communication between institution members. The dataset also contains groundtruth community memberships of the nodes. Each individual belongs to exactly one of 42 departments at the research institute.
Coracite
Citations among papers indexed by CORA, from 1998, an early computer science research paper search engine konect:2017:subelj_cora . Nodes in CORA citation network represent scientific papers. If a paper cites a paper also in this dataset, then a directed edge connects to . Papers not in the dataset are excluded. The papers are divided into 10 different computer science areas manually according to each paper’s description.
Dataset  Nodes  Edges  Nodes of LSCC  Edges of LSCC  Classes 

Blogs  1490  19090  793  15783  2 
Wikipedia  1976  17235  1345  14601  10 
1005  25571  803  27429  42  
Coracite  23166  91500  3991  18007  10 
4.2 Setup
We follow the experimental setup in kipf2016semi . In preprocessing, we calculate the largest strongly connected component of each dataset. For simple tasks(e.g. datasets with less than or equal to 10 classes), we design a twolayer model. For complicated tasks(e.g. Email dataset has more than 40 classes of nodes and less than 1000 nodes), we design a threelayer model to better extract graph data features. We use these two models for our four datasets. We train the models using about 10% of the nodes of the graph in each dataset following the settings of existing works kipf2016semi
. Then we use the rest of 90% nodes as test datasets to evaluate prediction accuracy. For the node features, we concatenate a onehot encoding of each node in the graph and the original features from the datasets. In practice, we implement our method using PyTorch and PyTorch Geometric(A geometric deep learning extension library for PyTorch)
fey2019fast . The codes to reproduce our experiments will be published if our paper is accepted.4.3 Baselines
We compare with several stateoftheart baselines methods, including spatialbased methodmorris2018weisfeiler , spectralbased methodskipf2016semi defferrard2016convolutional and method combining with the attention mechanismvelivckovic2017graph . The first is the classic spectralbased 1stChebNet(GCN) kipf2016semi . This is one of the best spectralbased GCN according to kipf2016semi . The second is the Chebyshev spectral convolutional graph(ChebConv) defferrard2016convolutional . The third method is the graph attention network(GAT) velivckovic2017graph , which leverages masked selfattentional layers to address the shortcomings of classic GCN methods. The fourth is the graph neural network(GraphConv) morris2018weisfeiler , which can take higherorder graph structures at multiple scales into account. In this method, we choose mean function to aggregate node features as described in their paper.
4.4 Results
Results of classification accuracy on test sets of our experiments are summarized in Table 2
. We trained and tested our models on the datasets with different splitting of train sets and test sets. We report the mean accuracy and confidence interval of 20 runs with random weight initializations. For Blogs, Wikipedia and Coracite datasets, we use the twolayer model. For Email dataset, we use the threelayer model. For the same dataset, we use training model with the same architecture and parameters. The only difference is the propagation model of the convolution layer.
As we can see in Table 2, our method outperforms the four baselines on four different datasets. The reason that our method achieves better performances may be described as followed. Our method makes use of the Laplacian designed for directed graphs, which has stronger ability to capture the connections between nodes of the network and to extract features from directed graphs.
The performances of all the methods are not so well on Coracite dataset and we believe there are three reasons. First, the Coracite dataset has 3991 nodes and only 18007 edges, it’s a complex classification task. Second, the dataset has no node features, we have to construct a onehot encoding of each node in the graph as the node features. Third, the classes of this dataset are manually divided into 10 areas according to each paper’s description, it may cause some deviation from the ground truth.
Method  Blogs  Wikipedia  Coracite  

GCN  
GraphConv  
GAT  
ChebConv  
DGCN(Ours) 
5 Discussion
As demonstrated in the previous sections, our method for semisupervised nodes classification of directed graphs outperforms several stateoftheart methods. However, our method does have some limitations. First, the computational cost of our model increases with the graph size because our method needs to compute eigenvector of the transition probability matrix. It’s a practical way to reduce the computational cost by implementing the matrix product using Coordinate Format(COO Format), but when paralleling or scaling to large graphs, the computational cost of our spectralbased method is still a problem. Second, our method also has to handle the whole graph at the same time, so the memory requirement is very high for spectralbased GCN method. The approximations of the large and densely connected graph can be very helpful as described in kipf2016semi . Third, our method is based on a premise that the input directed graph of our DGCN model should be strongly connected. According to this, we should calculate the largest strongly connected components of each dataset, which can cause some nodes to be removed from the original graph.
6 Conclusion and Future Work
In this paper, we propose a novel method to design the propagation model of spectralbased GCN layer to adapt to directed graphs. Experiments on a number of directed network datasets suggest that our method can work directly on the directed graph in the semisupervised nodes classification tasks. Our method outperforms several stateoftheart baseline methods, including spatialbased methods, spectralbased methods and methods combining with the attention mechanism.
In the future, there are several potential improvements and extensions to our work. For example, overcoming the practical problems described in Section 5
to reduce the computing cost and to handle graph in batch sizes can be a challenge in future work. We also believe it’s feasible to combine other techniques like attention mechanism with our method to improve the performances on more datasets. In addition, combining GCN for directed graphs and reinforcement learning in multiagent systems may be an attractive idea.
References
 [1] Cora citation network dataset – KONECT, April 2017.
 [2] Lada A. Adamic and Natalie Glance. The political blogosphere and the 2004 u.s. election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, LinkKDD ’05, pages 36–43, New York, NY, USA, 2005. ACM.
 [3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
 [4] Fan Chung. Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1–19, 2005.
 [5] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
 [6] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 [7] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 [8] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
 [9] Bradi Heaberlin and Simon DeDeo. The evolution of Wikipedia’s norm network. Future Internet, 8(2):14, 2016.
 [10] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 [11] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
 [12] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, 2017.
 [13] Thomas N Kipf and Max Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 [14] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 [15] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1):97–109, 2017.

[16]
Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang.
Adaptive graph convolutional neural networks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [17] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higherorder graph neural networks. arXiv preprint arXiv:1810.02244, 2018.
 [18] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, and Charles E Leisersen. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. arXiv preprint arXiv:1902.10191, 2019.
 [19] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In Proceedings of the International Conference on Learning Representations, 2017.
 [20] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 889–898. ACM, 2017.
 [21] Wikipedia contributors. Convolution theorem — Wikipedia, the free encyclopedia, 2019. [Online; accessed 14May2019].
 [22] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.

[23]
C. Dyer R. Pascanu Y. Li, O. Vinyals and P. Battaglia.
Learning deep generative models of graphs.
In
Proceedings of the International Conference on Machine Learning
, 2018. 
[24]
C. Shahabi Y. Li, R. Yu and Y. Liu.
Diffusion convolutional recurrent neural network: Datadriven traffic forecasting.
In Proceedings of International Conference on Learning Representations, 2018.  [25] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In Proceedings of the Uncertainty in Artificial Intelligence, 2018.