Binarized Graph Neural Network

04/19/2020 ∙ by Hanchen Wang, et al. ∙ University of Technology Sydney UNSW USTC 14

Recently, there have been some breakthroughs in graph analysis by applying the graph neural networks (GNNs) following a neighborhood aggregation scheme, which demonstrate outstanding performance in many tasks. However, we observe that the parameters of the network and the embedding of nodes are represented in real-valued matrices in existing GNN-based graph embedding approaches which may limit the efficiency and scalability of these models. It is well-known that binary vector is usually much more space and time efficient than the real-valued vector. This motivates us to develop a binarized graph neural network to learn the binary representations of the nodes with binary network parameters following the GNN-based paradigm. Our proposed method can be seamlessly integrated into the existing GNN-based embedding approaches to binarize the model parameters and learn the compact embedding. Extensive experiments indicate that the proposed binarized graph neural network, namely BGN, is orders of magnitude more efficient in terms of both time and space while matching the state-of-the-art performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph analysis provides powerful insights into how to unlock the value graphs hold. Due to this power, techniques for analyzing graphs are becoming an increasingly popular topic of study in both academics and industry. To effectively and efficiently support important analytic tasks on graph data, such as node/graph classification, node clustering, community detection, node recommendation, link prediction and graph visualization, a variety of graph embedding techniques (See Hamilton et al. (2017b); Cui et al. (2019)

for a comprehensive survey) have been developed. Graph data is mapped into low-dimension data such that the proximity relationship among graph nodes (i.e., objects) is preserved and the off-the-shelf machine learning methods, which are designed to handle vector representations, can be immediately applied.

The existing graph embedding techniques can be roughly classified into three broad categories: (1) random walk based embedding (e.g., Deepwalk 

Perozzi et al. (2014) and Node2vec Grover and Leskovec (2016)) ; (2) node similarity based embedding (e.g., LINE Tang et al. (2015) and NetMF Qiu et al. (2018)); and (3) graph neural networks (GNN) based embedding (e.g., GCN Kipf and Welling (2017), GraphSage Hamilton et al. (2017a), GAT Veličković et al. (2017) and AS-GCN Huang et al. (2018)). As reported by Leskovec et al. in their tutorial on graph embedding at WWW 2018111, the first two categories of embedding techniques are only able to learn a “shallow” representation of the graph nodes due to the simplicity of the models. It is shown in Kipf and Welling (2017); Hamilton et al. (2017a) that the neural network based embedding methods significantly outperform the state-of-the-art techniques in the first two categories for the node classification task. Therefore, exploring how to use neural network to create a “deep” representation more efficiently is a promising direction in graph representation learning. However, most of the existing graph neural network models suffer from the scalability issue due to the high time and space cost of the real-valued model.

Recently, there have been some researches on learning binary graph embedding (e.g., Lian et al. (2018); Shen et al. (2018); Yang et al. (2018)), in which each node is represented by a binary vector (code), instead of a real-valued vector. It has been shown that the binarized graph embedding can achieve much better time and space efficiency.

Time efficiency. It is well-known that the distance computation of binary vectors (i.e., Hamming distance) is much more efficient than that of real-valued vectors (e.g., Euclidian distance). In addition to the specifically tailored search algorithms (e.g., Qin et al. (2018)), the dot product between binary vectors can also enjoy the hardware support (e.g., xnor and build-in CPU instruction popcount).

As stressed in a recent work Li et al. (2019) from DeepMind, the pairwise dot product of the vectors has been intensively used by the model for some specific tasks (e.g., graph similarity computation in Bai et al. (2019)). Thus, the binary vector has been used in their graph matching network (GMN) to speedup the computation.

Space Efficiency. The binary embedding can represent the node in a compact way while well preserving the structure information. As shown in Lian et al. (2018), INH-MF can achieve competitive graph node classification performance with 128 bits for each node compared to the conventional embedding approaches (e.g., DeepWalk) with 128 dimensions (i.e., bits) per node. This will be a great advantage when we face a large-scale graph because the binarized embedding of a graph is more likely to be accommodated in the main memory.

Motivation and Challenges. The existing GNN-based methods have demonstrated outstanding performance in various tasks such as classification Hamilton et al. (2017a); Kipf and Welling (2017); Veličković et al. (2017); Huang et al. (2018), link prediction Zhang and Chen (2018); Kazemi and Poole (2018), graph similarity match Bai et al. (2019); Li et al. (2019) and graph clustering Wang et al. (2019); Zhang et al. (2019). However, they may suffer from the limitation of the memory and speed due to the use of real-valued vectors for node and graph representations and model parameters.

Given the outstanding embedding quality, various applications of the GNN-based approaches and the space and time efficiency of the binarized representation, one may wonder if we can design a binarized GNN-based graph embedding approach such that we can achieve a good trade-off between embedding quality and time/space efficiency in the GNN-based methods.

We notice that the existing binarized graph embedding methods Lian et al. (2018); Shen et al. (2018) rely on the discretization of the matrix factorization following the node-similarity based approaches. They cannot be extended to binarize the GNN-based embedding due to the inherently different natures of two categories of approaches.

As to our best knowledge, the only attempt for the binarization of GNN is from DeepMind in their recent work Li et al. (2019). Their binarization method converts each learned -dimensional real-valued vector into a -dimensional ”nearly” binary vector by applying well-known binarization function tanh to approximate hamming distance for the binarization and optimization. However, the output of tanh is not exact binary value and cannot be accelerated by the binary logic operations (e.g., xnor and popcount). As an alternative, one may consider the Binarized Neural Network (BNN) (e.g., Hubara et al. (2016)) for the graph embedding so that the representation is naturally binarized. However, BNN is not designed for graph data, and as to our best knowledge, there is no existing graph embedding work based on BNN.

These issues motivate us to develop a new binarized graph embedding technique which can be integrated into existing GNN-based models to binarize the parameters and produce high-quality binarized graph embeddings. The key challenge is how to generate effective compact embedding vectors with binary network parameters in an effective way. To address the challenge, we design a binarized graph neural network framework to learn the binary parameters and representations efficiently and effectively .

Contributions. Our principle contributions are summarized as follows:

  • To the best of our knowledge, this is the first study on binarized graph neural network (GNN) with binary parameters to generate binary graph representations. The proposed method, namely BGN, can be seamlessly integrated into the existing GNNs.

  • An end-to-end binarized graph neural network framework is proposed with binary weights and activations. This binarized framework can immediately reduce the memory consumption for the network; the bit-wise operations between the binary vectors can substantially speedup the inference time of the model and the gradient estimator enables our model to effectively process back-propagation through discrete parameters and activations.

  • Extensive experiments on multiple benchmark networks are conducted for node classification task. The results demonstrate that our proposed method outperforms existing binarized embedding methods with a big margin. Compare to the real-valued GNNs, our BGN model can achieve nearly state-of-the-art performance while consuming much fewer computation resources (up to parameter and embedding memory space and inference time).

  • Binarization approaches are employed on the GNN-based application GMN to show that, by applying our BGN techniques, GMN model can dramatically reduce the time and space complexity while keeping the performance competitiveness.

  • Experiments further show that our proposed BGN technique allows users to achieve a trade-off between the space/time and embedding quality in a flexible way by tuning different level and setting of binarization on the parameters and activations.

2 Related Works

Graph Embedding.

A key problem in machine learning on graphs is finding a way to incorporate information about the structure of the graph into the desired machine learning model. Graph embedding is one of the most promising approaches because it maps nodes into a low-dimensional space such that the structure of the graph is well preserved. Once accomplished, an existing machine learning approach (e.g., k-means clustering) can be used to assimilate and analyse the graph in the embedded low-dimensional space. Loosely following the seminal graph embedding approach, DeepWalk, three broad categories of embedding methods have appeared in the literature: (1) node similarity based embedding methods (e.g., LINE, NetMF), which rely on the proximity of the nodes w.r.t various similarity metrics. The matrix factorization techniques have been used to learn the embedding of the nodes. (2) Random walk based embedding methods (e.g., Deepwalk and node2vec) which encode the nodes by applying the Skip-Gram technique 

Mikolov et al. (2013) on the random walks; and (3) graph neural networks (GNN) based embedding methods (e.g., GCN, GraphSage and GIN) which apply the neural network techniques on graph to learn the representations of the nodes.

Most of the existing graph embedding studies use the real-valued vector to encode the graph nodes following the above three computing paradigms. Recently, three unsupervised approaches Lian et al. (2018); Shen et al. (2018); Yang et al. (2018) have been proposed to learn the binary embedding of the graphs following the node-similarity based embedding methods. Particularly, INH-MF Lian et al. (2018) and DNE Shen et al. (2018) are independently developed for binarized graph embedding based on the discretization of the matrix factorization on proximity graphs. BANE proposed in Yang et al. (2018) is a natural extension of DNE by considering both structure and attribute similarities on the attributed graphs.

Binary Hashing. The binary hashing has been widely used to learn the binary vectors (codes) of the objects in many applications. The most popular application is the approximate nearest neighbor search in high dimension space where binary hashing methods encode high-dimensional objects (e.g., documents and images) to binary codes, while preserving similarity distance in the original space. Many learning to hash approaches have been proposed including unsupervised methods (e.g.,  Salakhutdinov and Hinton (2009); Liu et al. (2014)), supervised methods (e.g., Shen et al. (2015)

), and deep learning based methods (e.g., 

Liu et al. (2016)). Please refer to Wang et al. (2018) for a comprehensive survey. Recently, three approaches Lian et al. (2018); Shen et al. (2018); Yang et al. (2018) have been proposed to learn the binary embedding of the graphs following the node-similarity based embedding methods. As to our best knowledge, there is no existing work on the binarized graph embedding based on GNNs.

Binarized Neural Networks Binarized neural networks was first proposed by BNN Courbariaux et al. (2016). The binarization technique proposed in Courbariaux et al. (2016) is used by most network binarization models. Among them, XNOR-Net Rastegari et al. (2016) and DoReFa-Net Zhou et al. (2016) are the most popular ones because of their great performance on the image classification task.

XNOR-Net was proposed to have high accuracy of classification task on the ImageNet dataset while XNOR-Net has

faster convolutional operations and

memory saving. DoReFa-Net replaces the binarization by quantization which allows the model to change the bit size for weights, activations and even gradient calculations during backpropagation.

However, these methods are all designed for computer vision tasks. Though they perform well on the image dataset, they cannot be adapted to the graph representation learning and graph analysis task directly.

Graph Neural Network Applications There are several applications that are based on the GNN. Such as Graph Matching Network Li et al. (2019) and SimGNN Bai et al. (2019). These models utilize GNN and use the similarity (distance) of graph embedding to approximate the graph edit distance and graph similarity.

The Graph Matching Network (i.e., GMN) is a novel GNN-based framework proposed by DeepMind

to compute the similarity score between input pairs of graphs. Separate MLPs will first map the input nodes in the graphs into vector space. Then the propagation layer will aggregate the messages of the edges and cross-graph matching vector by MLP or GRU with input concatenation of node representations and edge vectors. Matching function is applied to compute the attention coefficients based on the node information between the input pair of graphs. The matching function is based on the softmax function over node vectors which requires the calculation of vector space similarity like Euclidean, cosine similarity or dot product between all pairs of node representations. This attention coefficients calculation across two graphs requires a computation cost of

, where and indicate the number of vertices of input graph 1 and 2 respectively, and is the dimension of the node representation. The match vector is concatenated with the message vector and the node representation

, then the concatenation is fed into MLP or a recurrent neural network core to produce the new node representations. Given the learned node representations of graph, the aggregation module proposed in 

Li et al. (2015) is used to obtain the graph representations. The similarity score in vector space such as Euclidean similarity, cosine similarity and approximate hamming similarity will be computed between graph representations to approximate the similarity between the input graphs.

3 Background and Preliminaries

Recent studies have revealed that graph neural network can perform excellently on label classification tasks. The existing GNN-based graph embedding approaches share the same computing paradigm. GNNs take graph nodes’ feature and neighborhood information as the input. During the training, the representations of nodes (real-valued vectors) at each layer will be updated by the aggregators and non-linear activation functions. The output representations will be fed into the task-specific layer to calculate the loss of the model. Based on that, the model will be optimized by the optimizer through backpropagation. The main differences among these GNN-based graph embedding approaches are the design of the aggregator which combines the context representations and the loss function designed for different graph analytic tasks.

These models have real-valued parameters and learn a real-valued representation for each node in an end-to-end manner for graph node classification. However, the real-valued parameters and representations are space-consuming for storage and time-consuming for multiplication computation, especially for large-scale graphs. To address these issues, in this paper we devise a novel binarized graph neural network, namely BGN, with binary parameters in the neural network to learn binary embedding representations for node classification task.

The important notations used throughout the paper are summarized in Table 1.

Notation Definition
the graph dataset
the set for nodes and edges in the graph.
the feature information for node .
the neighborhood nodes of node .
denotes that the vector or matrix is binary-valued.

the hidden representation of node

the weight matrix in the neural network.
the binarization function which is used to transform the real-valued vector or matrix into binary-valued vector or matrix.
the attention coefficient between node and node .
Table 1: Summary of Notations

4 Binarized Graph Neural Network

Figure 1: The overall framework of the proposed model BGN. (a) All input node features are projected into a unified representation space by binary-valued weights.(b) Masked summation between binary matrix and real-valued matrix is employed to speed up the dot product. (c) Binary attention coefficients are produced based on the hidden representations. (d) Output of the layer is calculated via multi-head attention mechanism. (e) xnor and popcount are employed to calculate the dot product between binary-valued matrix. (f) Loss calculation and end-to-end optimization for the node classification task.

As illustrated in Figure 1, we introduce a new graph neural network with binarized weights and activations. Our model BGN (Binarized Graph Neural Network) is based on the attention mechanism and can be easily adapted into other graph neural network frameworks. For a given graph, BGN takes the nodes and their contexts including feature and neighborhood structure information as input. Binarization function will transform the weights, activations and even coefficients into binarized vectors to reduce the time and space complexity, while the attention mechanism enables the nodes to attend over their neighborhoods’ features. We also apply the balance function to ensure that and are almost equal with each other in the binarized vectors. Furthermore, the gradient estimator is used for backpropagation of gradients through discretization.

The following subsections present the listed key components of our model:

  • Section 4.1 introduces the framework of our work.

  • Section 4.2 introduces the binarization of our model in detail, including the forward propagation and backpropagation.

  • Section 4.3 describes the optimization objective of our model.

  • Section 4.4 introduces the techniques we used to reduce the time and space complexity and improve the performance.

  • Section 4.5 introduces the adaptation of our model to other GNN frameworks.

4.1 Framework

Algorithm 1 illustrates the framework of our model. We follow the attention mechanism introduced in Vaswani et al. (2017); Veličković et al. (2017) to involve the importance of the node’s neighborhoods into the graph representation learning process. Given a graph , where and denote the set of graph nodes and edges respectively, we use nodes features and the neighborhood information of nodes as inputs. Our model will first produce the binarized node representations for each node within the input graph. After that, the binarized node embeddings will be fed into the output layer to compute the loss for some specific tasks like node classification.

Input: Graph , node features , number of layers , binarization function , number of attention heads , neighbors of node

Output: Classification result

1:Let ;
2:for  do
4:     for  do
5:         for  do
6:              ;          
9:for  do
11:return , the classification result for node
Algorithm 1 Binarized Graph Neural Network

Attention Mechanism Our proposed framework is based on the graph attention mechanism. The attention layer is utilized in our model to learn the importance of every node to other nodes. The key is to get the importance of one node’s feature to other nodes that is the attention coefficients of the input graph, afterwards, the node’s feature can attend on other nodes. Inspired by Veličković et al. (2017), we perform masked attention to the model to keep the structural information of the input graph. Only the attention coefficients of one node with its neighborhood nodes i.e., will be computed.

In order to obtain the attention coefficients, we use a shared binarized weight matrix

to apply the linear transformation to each node. Softmax function is used to normalize the coefficients, but unlike the model proposed in 

Veličković et al. (2017), LeakyRelu activation is not employed in our model while the sign function is used to binarize the attention coefficients. With the following Equation (1), we will get a binarized attention coefficient matrix where is the element of the matrix (0 is contained in the matrix since we only compute the attention coefficients between neighbors such that the matrix is sparse).


where is the binarization function for attention coefficients which maps 0 to 0, positive values to and negative values to .

Once the attention coefficient matrix is obtained, it will be used to compute the output of the attention layer. The attention coefficients will multiply the linear transformed node feature. We employ the multi-head attention mechanism to stabilize the learning process. The binarization function, which is served as activation function, is applied to every attention head to binarize the pre-activations. And concatenation of the output of K independent attention head is the output of the attention layer. Therefore, the output node representation will be like following:


Where means the concatenation of the vectors and is the output binarized node representation where .

After several attention layers, the node representation will be fed into the last layer to calculate the loss for specific task which is classification in this paper. We will introduce the learning objective in the Section 4.3.

4.2 Binarization

In this section, we introduce how to obtain a graph neural network with binary parameters that can learn binary representations. Section 4.2.1 introduces the binarization function used to transform the real-valued parameters and pre-activations into binary space. Section 4.2.2 introduces the gradient estimators that enable the binarized model to be optimized by the off-the-shelf optimizers such as Adam and SGD.

4.2.1 Forward Propagation

Binarization function is important in our model. Specific binarization function will be chosen in the forward propagation calculation process to binarize the weights and the activations. In that way the low-bit parameters and activations will help to reduce the time and space complexity. In our case, various binarization functions will work, and the most straightforward example is the sign function. As mentioned in Courbariaux et al. (2016) and Rastegari et al. (2016), deterministic and stochastic binarization based sign function are widely applied to the continuous pre-activations as well as the real-valued weights to obtain binarized activations and weights.


The above equation is the deterministic binarization function, where

is the real-valued variable. The stochastic binarization is the sign function with probability:



denotes the sigmoid function, that is

. The stochastic binarization is more appealing but needs the computer to generates random bits while the deterministic binarization is easier to calculate. Deterministic binarization function(i.e., Equation (3)) is applied for the binarization of weights and activations because the deterministic sign function provides more stable and reproducible results. Please note that we use a variant of deterministic sign function which maps 0 to 0 to binarize the attention coefficients.

4.2.2 Backprobagation

In this part, we describe how to backpropagate the gradients through the binarization function. We adapt the gradient estimator into our model for better optimization.

Propagation gradients through binarization function. It is obvious that the binarization function has zero derivative almost everywhere, which leads to the zero gradients of the loss function w.r.t the pre-activations and weights. The trainable variables cannot be updated with zero gradient. Therefore, the model cannot be trained by simple backpropagation, and the estimation of the gradients should be obtained for optimization. Previous studies have investigated how to propagate gradients through stochastic discrete functions. Below we investigate two popular unbiased gradient estimators for binarization function: straight through estimator and REINFORCE estimator Williams (1992).

Straight through estimator. The straight-through estimator is proposed a simple unbiased gradient estimator. It estimates the derivative of binarization function of pre-activation or weight as (a vector or matrix whose elements are all 1). Let denote the binarized representation and denote the pre-activation before binarization. The straight-through estimation of the gradient of the loss w.r.t the pre-activation is thus:


This gradient will then be back-propagated to obtain the gradient of quantities (i.e., pre-activations or weights) that influence .

REINFORCE estimator. The reinforce estimator is proposed in Bengio et al. (2013) to estimate the expectation of the gradient of loss with regard to the pre-activation vector or weight . When binarization function is stochastic with the probability given by sigmoid, it has been proven that:


where is the sigmoid function and

is a constant vector. To minimize the variance of the estimation,

can be chosen as:


The reinforce estimator can work directly on the weights and pre-activations without actual computation of the gradient. The estimation is obtained by monitoring numerator and denominator during the training process.

Compared with straight through estimator, reinforce estimator is more advanced with better performance in many applications. However, we observe that its performance is not superior than the straight through estimator. On the other hand, straight through estimator helps the model to obtain the gradient faster than the reinforce estimator due to its simplicity. The comparison between these two gradient estimators with regards to the performance is included in Section 5. In practice, we choose straight through estimator for our model in the experiments.

4.3 Optimization Objectives

Existing GNN-based graph embedding approaches provide an end-to-end model, which focuses on the node classification task. Therefore, our model is also learned for the node classification task. Below, we introduce the objective of BGN and the learning process that optimizes the parameters.

For the node classification learning, we feed the binarized embedding into the output layer to predict the class label for the node. The predicting probability of label is written as:


where denotes the number of labels for each node. After obtaining the classification result in Equation (8), we calculate the cross-entropy as the loss for the node classification task.


where is the set of nodes that have label information which are used for training process, is the multi-hot encoding for ground truth classification labels.

The gradients will be back propagated via estimator and be applied on the optimization of parameters by the off-the-shelf optimizer during the training process.

4.4 Techniques to Improve the Model

Several techniques are used on binarized graph neural network model to reduce the time and space complexity and improve the performance. Logic operation xnor between binary values, build-in CPU instruction popcount and the masked summation are used to replace the tradition arithmetic operation dot product to reduce time complexity. The figure 2 is a toy example that introduces the differences between these operations. Balance function is used to make and to be balanced in the embedding vectors which can raise the performance of the GMN. Also, the binary parameters of the neural network and the binary node representations can reduce the space complexity intuitively.

Figure 2: The toy examples of (a) dot product (b) Masked summation and (c) xnor and popcount instruction

4.4.1 xnor and popcount

The logic xnor and CPU build-in instruction popcount between binary matrices are used to replace the dot product between them.

Input A Input B Output
+1 +1 +1
+1 -1 -1
-1 +1 -1
-1 -1 +1
Table 2: xnor calculation

As shown in Table 2, xnor produces binary value with input of and . Instruction popcount is then be employed to count the number of bits that is set to . The xnor can be more than one order of magnitude faster than the dot product which can dramastically reduce the time complexity. As mentioned in Courbariaux et al. (2016), a 32-bit floating point multiplier costs about 200 Xilinx FPGA slices, whereas a 1-bit xnor gate costs only 1 slice.

4.4.2 Masked Summation

Masked summation is used to replace the dot product between binary matrix and real-valued matrix. The binary matrix will be transformed into the mask matrix with ”True” and ”False”. During the multiplication, the real-valued vector will be masked by the corresponding mask vector, then the positive and negative masked vector are produced with only the elements at the same position as ”True” and ”False” on the mask vector. The model calculates the summations of the positive and negative masked vector separately. The subtraction of these two summation results is the result of dot product between the given matrices.

The masked summation can reduce the time complexity of dot product of two matrix. Usually, the time complexity of naive dot product between two real-value matrices and is , while the time complexity of masked summation between binary matrix and real-valued matrix is . Theoretically and also in practice, the masked summation can significantly reduce the time complexity of our proposed binarized graph neural network.

4.4.3 Balance Function

The distribution of and is sometimes unbalanced in the representation vectors. For example, if most pre-activations have positive elements, the output graph representation vector of binarization function will be formed mainly by . Then the dot product of two vectors will be which is the dimension of the vectors. This unwanted situation should be avoid because it dramatically lower the effectiveness of the proposed model, especially when the BGN is applied to GMN which requires a great number of dot product between representations. As a result, we apply the following balance function to the pre-activations before binarization in order to balance the distribution of positive and negative elements of pre-activations:


Where the is the vector whose elements are all mean value of the pre-activation vector . The balance function ensures that the pre-activation vectors contain almost half positive and half negative elements, which leads to the balance distribution of and after binarization.

4.5 Adapted to Other GNN Based Models

The proposed binarized graph neural network is a very general framework that can be adapted to other graph neural network-based model to project the real-valued parameters and activations into the binary space to reduce the space and time cost. We introduce how we binarize the state-of-the-art GNN-based model AS-GCN Huang et al. (2018) and the graph matching network.

4.5.1 Binarization of AS-GCN

AS-GCN is a general framework that is designed for fast representation learning based on graph neural networks such as GCN. Therefore, the binarization of AS-GCN is similar to our proposed BGN. We use deterministic binary function to binarize the parameters and pre-activations of AS-GCN. And straight through estimator is employed for back propagation. The binarized model is denoted as BGN-ASGCN in our experiment.

4.5.2 Binarization of GMN

As mentioned above, the time cost of GMN comes mainly from the pair-wise node similarity computation. We utilize the deterministic binarization function (Equation (3)) on the preactivations and transform the node and graph representations into binary codes such that the xnor can be applied to replace the dot product. Straight through estimator (Equation (5)) is used for the back propagation. Furthermore, we noticed that the distribution of and is usually not symmetric which dramatically lower the performance, hence, balance function (Equation (10))is employed on the graph representations.

5 Experiment

We conduct extensive experiments to evaluate the performance of our model for the node classification task on real-world network datasets. We compare the time and space efficiency thoroughly between the proposed model and other baseline models. The case study shows the effectiveness and efficiency brought by our framework on the GNN-based application such as GMN.

5.1 Dataset

To facilitate the comparison between our model and the relevant baselines, we conduct the classification experiments on three well-known citation network datasets: Cora, Citeseer and Pubmed Sen et al. (2008). Each dataset contains bag-of-words representations of documents and citation links between the documents. Graph is constructed based on the citation links. In the classification task, we only use labeled instances per class for training. The test data contains 1000 nodes as in GCN, GAT and AS-GCN.

The details of the datasets are summarized in the Table3.

Dataset #Nodes #Edges #Classes #Labled Nodes
Cora 2708 5429 7 140
Citeseer 3327 4732 6 120
Pubmed 19717 44338 3 60
Table 3: Citation Datasets

5.2 Baseline Methods

The following GNN-based and binary embedding methods are compared as baselines:

GCN (Graph Convolutional Network) Veličković et al. (2017) is a semi-supervised neural network method for node classification.

GAT (Graph Attention Network) Veličković et al. (2017) is a graph neural network model which first exploits the attention mechanism to solve the node classification task.

AS-GCN (Adaptive Sampling over GCN) Huang et al. (2018) is a state-of-the-art method for node classification task. AS-GCN aims to increase the scalability of GCN using adaptive sampling. The experiments demonstrate that the application of BGN can further reduce the time and space complexity of AS-GCN.

GAT-binary and ASGCN-binary are the models that directly apply sign function on the node representations learned by the original version of GAT and AS-GCN. The naively binarized representations will be fed into the task-specific layer to learn the classification result.

GAT-tanh and ASGCN-tanh are the models that employ the binarization function tanh used by DeepMind’s work. tanh function is used to binarize the parameters and embedding vectors of GAT and AS-GCN. We clip the value of the parameters and activations in both models to make sure that tanh can produce “exact” binary codes.

INH-MF Lian et al. (2018) is a MF-based information network hashing algorithm that learns binary codes as node embedding which can preserve high-order proximity.

BANE(Binarized Attributed Network Embedding) Yang et al. (2018) is an extension of DNE Shen et al. (2018) which based on the Weisfeiler-Lehman proximity matrix factorization learning function to produce binary node representations.

5.3 Experiment Setup

For the performance experiment, we evaluate the models with the same bit-width representations. For the experiment of inference efficiency, the embedding dimension of our method and other baseline methods are all set to 64. During training process, the whole graph can be seen, but only a few nodes are labeled while most nodes have no label information. We put all nodes information in one training phase due to the need of calculation for graph attention coefficients.

For this classification task, we report the average accuracy of the evaluated GNN-based embedding approaches after ten independent runs using the accuracy metric introduced in Kipf and Welling (2017); Veličković et al. (2017). Because INH-MF and BANE only produce the binary embedding vectors but have no build-in classifier, we employ the one-vs-rest logic regression implemented by Liblinear  Fan et al. (2008) to obtain the classification result of the networks, in which 90% nodes are labeled.

All the experiments were conducted on the server which is running RHEL 7.5 and has 2x 2.4GHz Intel Xeon E5-2680 v4 (14 Cores) CPU, 256GB 2400MHz ECC DDR4-RAM and 2x NIVDIA Quadro P5000 16GB Graphics Card (GPUs) (2560 Cores).

5.4 Classification Results

Because our model produces the compact representations for vertices, we compare the performance between our model and other baselines with the same bit width.

5.4.1 Comparison Among Binary Embedding Methods

We compare the classification results between our model and other binary-valued embedding methods.

Figure 3: Classification results of three citation network dataset among the binary-valued embedding methods with different embedding dimensions

As shown in the Figure 3, under different embedding dimensions, BGN outperforms all the other binary-valued embedding methods significantly on all three datasets. With the help of the graph neural network, our model can make better use of the graph structured data and feature information and is trained specifically for the node classification task. Therefore, our model outperforms other MF-based binarized graph embedding models by a significantly large margin. In comparison with the naively binarized GAT-binary and ASGCN-binary, our model considers the binary property of parameters and vectors during the training process, hence our model achieves better accuracy. In terms of GAT-tanh and ASGCN-tanh, because tanh function has zero gradient when the output is nearly or and has real-value output when the gradient is not zero. This property determines that tanh function is not suitable for binarize the neural network. When the input values are clipped to produce exact binary parameters and embeddings via tanh function, the gradient will be zero which results in the insufficient optimization and worse performance than BGN.

5.4.2 Comparison among the GNN-based methods

We compare our model with other GNN-based methods (GCN, GAT and AS-GCN). All baseline methods produce the real-valued embedding vectors each dimension of which is encoded by at least 32 bits. Compared with these methods, each dimension of the embedding vectors learned by our model is only encoded by 1 bit. As a result, a real-valued 16 dimension vector requires at least 256 bits while a binary vector only requires 16 bits. Figure 4 shows the performance of the models with bit width varies for a single embedding vector.

Figure 4: Classification results of three citation network dataset among the GNN-based methods with varied bit width for embedding vector

Our model significantly outperforms all the baseline methods with low bit width. When getting more space for the learned representations, our model can still achieve competitive classification results compared with the state-of-the-art graph neural network-based methods. In conclusion, the performance gap between our model and baselines with large bit-width representations is acceptably small while our model’s performance is notably better with the low bit-width representations.

5.5 Comparison of Time and Space Efficiency

Cora Time(s)
Accuracy 84.0% 87.3% 77.7% 84.1%
Citeseer Time(s)
Accuracy 72.1% 78.9% 63.7% 77.2%
Pubmed Time(s)
Accuracy 78.2% 89.0% 75.7% 82.0%
Table 4: Comparison of performance, inference time and memory space required for the parameters between the real-valued and BGN-based models.

In this section, we report the inference time and space efficiency of our model. The inference is the process that produces the classification result when we have already trained the model. Acceleration is brought by the xnor and popcount operation with just little sacrifice on the classification performance. In this experiment, we train the binary parameters and activations of our model, then replace dot product operation between binarized matrices by xnor and popcount and also replace the dot product between binary matrix and real-valued matrix by masked-summation during the inference process.

Table 4 reports the experiment results. Our model under the binarized framework is more than one order of magnitude faster than the baseline methods GAT and AS-GCN with regards to the inference time. The proposed model can be up to faster and save up to space compared with the baseline methods.

5.6 Analysis of Binarization

In this section, we introduce the effect of the estimator and binarization level with regard to the space, time and performance. We compare the space, inference time and performance between BGN-GAT and GAT on the Cora dataset. We fix the dimension of embedding vector to 64 for both methods and change the setting of BGN to show the space and time saving compared with the baseline GAT.

Method Estimator Param space Vec space Speed up Accuracy
GAT N/A 84.0%
STE 80.5%
Reinforce 80.3%
STE 81.2%
Reinforce 81.3%
STE 77.2%
Reinforce 77.5%
STE 77.7%
Reinforce 76.9%
Table 5: Trade-off between time/space efficiency and classification accuracy of proposed BGN w.r.t the level and setting of binarization.

Result is shown in Table 5 where , , and mean that the BGN is with weights binarized, embedding vectors binarized, weights and embedding vectors binarized, weights, embedding vectors and attention coefficients binarized based on the graph attention mechanism respectively. We can conclude from the Table 5 that (1) when the weights, activations and attention coefficients are all binarized, the BGN-GAT can save largest space for parameters and the output vectors while holding acceptable classification accuracy. (2) Straight through estimator and reinforce estimator have similar accuracy on the node classification task. Therefore, we choose the STE for our model in the above experiments because of its simplicity and certainty. (3) Compared with original GAT, BGN-GAT can save space for model parameters, space for activations and achieve speed up.

5.7 Case Study

In this section, we investigate how binarized graph neural network improve the time efficiency of the GNN-based applications such as GMN. Because GMN needs to compute the pair-wise dot product between node and graph embedding vectors, the time consumption is extremely high when the number of nodes in each graph goes up. However, with the binary representations, we can apply xnor between binary vectors to replace the dot product, which will alleviate the time complexity problem significantly. The following experiment results will introduce the performance and time complexity of GMN with binary node and graph representations compared with the origin version. The graph similarity will then be used for the graph matching task.

Experiment Setup

We follow the experiment setting of Li et al. (2019) to test the performance of Binarized GMN. The training data is generated by sampling binomial graphs with nodes and edge probability  Erdős and Rényi (1960). Then the positive example is generated by randomly substituting edges from with new edges and negative example is generated by substituting edges from , where . In the experiment, we set , and . We also set the hamming similarity between vectors as loss function, which is more suitable for the binary-valued vectors as the loss function to train the model. The model needs to predict a higher similarity score for positive pair than negative pair

. The evaluation metric remains the same: (1) pair AUC - the area under the ROC curve for classifying pairs of graphs as similar or not on a fixed set of 1000 pairs and (2) triplet accuracy - the accuracy of correctly assigning higher similarity to the positive pair in a triplet than the negative pair on a fixed set of 1000 triplets.

Inference time and Graph Matching Performance

We report the graph matching accuracy and inference time of the binarized and original GMN with regards to the number of nodes in each graph. The default setting in GMN is 20 nodes per graph, which is quite small for real-world networks. We set the number of nodes in one graph from 20 to 160 and keep other settings the same as described above to evaluate the performance and inference time. The dimensions of node and graph representations are set to 32 and 64 respectively.

Figure 5: The performance of graph matching and inference time for GMN and BGN-GMN w.r.t the number of nodes per graph

As shown in Figure 5, the inference of BGN-GMN is significantly faster. This is because of the fact that the similarity computation (pair-wise dot product) between node representations of two graphs mainly accounts for the time complexity of GMN. Under the same dimension of node and graph embedding vectors, BGN-GMN is up to faster than the baseline model in terms of the inference time with the help of the replacement of dot product by fast operations such as xnor and popcount between binary vectors.

In terms of graph matching task, the original version of GMN has better performance when the number of nodes in each graph is small. However, when the number of nodes gets larger, the pair AUC and triplet accuracy will both decay. When the number of nodes is more than , the real-valued representations cannot tell the similarity difference between the graphs. Hence, the model is not able to learn the different similarity scores for positive and negative pairs of graphs with the hamming similarity metric. However, with the help of binarization and balance function, the binary representations still hold an acceptable and more robust performance for the graph matching task. This is due to the fact that the binarized model produces true binary representations for the calculation of hamming loss and is designed for the graph matching task specifically on hamming space.

Parameter Sensitivity Analysis

We compare the performance of binarized and original version GMN to show the effect of dimension for node and graph embedding vectors. We set the number of nodes in each graph for this comparison. We change the dimension of graph embeddings produced by two models to ensure them to produce the same bit-width embedding vectors and keep the other settings as the same to compare the performance of two models.

(a) Performance w.r.t bit width of Graph Representation
(b) Performance w.r.t bit width of Node Representation
Figure 6: The performance comparison of graph matching task between original version of GMN and the BGN-GMN with (a) graph representations binarized and (b) node representations binarized

The result is included in Figure 6(a). We can find that the binary graph representations tend to have better performance when they are low bit-width and have similar accuracy when the bit-width for the representations getting larger. The binary representations have more robust performance compared with the baseline model when the dimension of embedding varied.

The node representations’ binarization is more important than the graph representations’ because the dot product operation is mainly conducted between the node representations which costs plenty of time. The performances of GMN and BGN-GMN are compared under different bit-width for the node embedding vectors by varying the dimensions.

As shown in Figure 6(b), the result for the pair-wise AUC is similar between the binary and the real-valued node embedding vectors, but BGN-GMN holds a better performance with low bit-width representations. As for the triplet graph accuracy, the binary embedding vector achieves better performance with short code length and similar accuracy as real-valued node embedding with long code length. These results indicate that the binary representations are much better for the comparison between two graphs under low bit-width circumstances. In line with the result of the binary graph embedding vectors, the binary node embedding vectors also have more robust performance compared with the real-valued node representations.

6 Conclusion

We present a model focused on the challenging problem of seeking binary representations of network embeddings using a compact neural network structure. We proposed a novel binarized graph embedding method, namely BGN, that has binarized parameters and enables GNNs to learn discrete embedding. The binarized neural network can reduce the memory and time cost of the GNN such that increases the scalability of GNNs. BGN can be naturally integrated into other GNN models to enhance the performance of the model such as graph matching network in terms of the inference time and space consumption. External experiment also illustrates that BGN can increase the time efficiency while holding competitive accuracy.


  • Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang (2019) SimGNN: A neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, pp. 384–392. Cited by: §1, §1, §2.
  • Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §4.2.2.
  • M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. NIPS. Cited by: §2, §4.2.1, §4.4.1.
  • P. Cui, X. Wang, J. Pei, and W. Zhu (2019) A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31 (5), pp. 833–852. Cited by: §1.
  • P. Erdős and A. Rényi (1960) On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §5.7.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: A library for large linear classification. JMLR 9, pp. 1871–1874. Cited by: §5.3.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In ACM SIGKDD, pp. 855–864. Cited by: §1.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017a) Inductive representation learning on large graphs. In NIPS, pp. 1024–1034. Cited by: §1, §1.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017b) Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40 (3), pp. 52–74. Cited by: §1.
  • W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 4563–4572. Cited by: §1, §1, §4.5, §5.2.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NIPS, pp. 4107–4115. Cited by: §1.
  • S. M. Kazemi and D. Poole (2018)

    SimplE embedding for link prediction in knowledge graphs

    In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 4289–4300. Cited by: §1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §5.3.
  • Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 3835–3845. Cited by: §1, §1, §1, §2, §5.7.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
  • D. Lian, K. Zheng, V. W. Zheng, Y. Ge, L. Cao, I. W. Tsang, and X. Xie (2018) High-order proximity preserving information network hashing. In ACM SIGKDD, pp. 1744–1753. Cited by: §1, §1, §1, §2, §2, §5.2.
  • H. Liu, R. Wang, S. Shan, and X. Chen (2016)

    Deep supervised hashing for fast image retrieval

    In CVPR, pp. 2064–2072. Cited by: §2.
  • W. Liu, C. Mu, S. Kumar, and S. Chang (2014) Discrete graph hashing. In NIPS, pp. 3419–3427. Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §2.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In ACMSIGKDD, pp. 701–710. Cited by: §1.
  • J. Qin, Y. Wang, C. Xiao, W. Wang, X. Lin, and Y. Ishikawa (2018) GPH: similarity search in hamming space. In IEEE ICDE, pp. 29–40. Cited by: §1.
  • J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In ACM WSDM, pp. 459–467. Cited by: §1.
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016)

    Xnor-net: imagenet classification using binary convolutional neural networks

    In European Conference on Computer Vision, pp. 525–542. Cited by: §2, §4.2.1.
  • R. Salakhutdinov and G. E. Hinton (2009) Semantic hashing. Int. J. Approx. Reasoning 50 (7), pp. 969–978. Cited by: §2.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.1.
  • F. Shen, C. Shen, W. Liu, and H. T. Shen (2015) Supervised discrete hashing. In CVPR, pp. 37–45. Cited by: §2.
  • X. Shen, S. Pan, W. Liu, Y. Ong, and Q. Sun (2018) Discrete network embedding. In IJCAI, pp. 3549–3555. Cited by: §1, §1, §2, §2, §5.2.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) LINE: large-scale information network embedding. In WWW, pp. 1067–1077. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998–6008. Cited by: §4.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §1, §4.1, §4.1, §4.1, §5.2, §5.2, §5.3.
  • C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang (2019) Attributed graph clustering: A deep attentional embedding approach. In

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019

    pp. 3670–3676. Cited by: §1.
  • J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen (2018) A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 769–790. Cited by: §2.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Machine Learning 8, pp. 229–256. Cited by: §4.2.2.
  • H. Yang, S. Pan, P. Zhang, L. Chen, D. Lian, and C. Zhang (2018) Binarized attributed network embedding. In IEEE ICDM, pp. 1476–1481. Cited by: §1, §2, §2, §5.2.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 5171–5181. Cited by: §1.
  • X. Zhang, H. Liu, Q. Li, and X. Wu (2019) Attributed graph clustering via adaptive graph convolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 4327–4333. Cited by: §1.
  • S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. Cited by: §2.