1 Introduction
Graph analysis provides powerful insights into how to unlock the value graphs hold. Due to this power, techniques for analyzing graphs are becoming an increasingly popular topic of study in both academics and industry. To effectively and efficiently support important analytic tasks on graph data, such as node/graph classification, node clustering, community detection, node recommendation, link prediction and graph visualization, a variety of graph embedding techniques (See Hamilton et al. (2017b); Cui et al. (2019)
for a comprehensive survey) have been developed. Graph data is mapped into lowdimension data such that the proximity relationship among graph nodes (i.e., objects) is preserved and the offtheshelf machine learning methods, which are designed to handle vector representations, can be immediately applied.
The existing graph embedding techniques can be roughly classified into three broad categories: (1) random walk based embedding (e.g., Deepwalk
Perozzi et al. (2014) and Node2vec Grover and Leskovec (2016)) ; (2) node similarity based embedding (e.g., LINE Tang et al. (2015) and NetMF Qiu et al. (2018)); and (3) graph neural networks (GNN) based embedding (e.g., GCN Kipf and Welling (2017), GraphSage Hamilton et al. (2017a), GAT Veličković et al. (2017) and ASGCN Huang et al. (2018)). As reported by Leskovec et al. in their tutorial on graph embedding at WWW 2018^{1}^{1}1http://snap.stanford.edu/proj/embeddingswww, the first two categories of embedding techniques are only able to learn a “shallow” representation of the graph nodes due to the simplicity of the models. It is shown in Kipf and Welling (2017); Hamilton et al. (2017a) that the neural network based embedding methods significantly outperform the stateoftheart techniques in the first two categories for the node classification task. Therefore, exploring how to use neural network to create a “deep” representation more efficiently is a promising direction in graph representation learning. However, most of the existing graph neural network models suffer from the scalability issue due to the high time and space cost of the realvalued model.Recently, there have been some researches on learning binary graph embedding (e.g., Lian et al. (2018); Shen et al. (2018); Yang et al. (2018)), in which each node is represented by a binary vector (code), instead of a realvalued vector. It has been shown that the binarized graph embedding can achieve much better time and space efficiency.
Time efficiency. It is wellknown that the distance computation of binary vectors (i.e., Hamming distance) is much more efficient than that of realvalued vectors (e.g., Euclidian distance). In addition to the specifically tailored search algorithms (e.g., Qin et al. (2018)), the dot product between binary vectors can also enjoy the hardware support (e.g., xnor and buildin CPU instruction popcount).
As stressed in a recent work Li et al. (2019) from DeepMind, the pairwise dot product of the vectors has been intensively used by the model for some specific tasks (e.g., graph similarity computation in Bai et al. (2019)). Thus, the binary vector has been used in their graph matching network (GMN) to speedup the computation.
Space Efficiency. The binary embedding can represent the node in a compact way while well preserving the structure information. As shown in Lian et al. (2018), INHMF can achieve competitive graph node classification performance with 128 bits for each node compared to the conventional embedding approaches (e.g., DeepWalk) with 128 dimensions (i.e., bits) per node. This will be a great advantage when we face a largescale graph because the binarized embedding of a graph is more likely to be accommodated in the main memory.
Motivation and Challenges. The existing GNNbased methods have demonstrated outstanding performance in various tasks such as classification Hamilton et al. (2017a); Kipf and Welling (2017); Veličković et al. (2017); Huang et al. (2018), link prediction Zhang and Chen (2018); Kazemi and Poole (2018), graph similarity match Bai et al. (2019); Li et al. (2019) and graph clustering Wang et al. (2019); Zhang et al. (2019). However, they may suffer from the limitation of the memory and speed due to the use of realvalued vectors for node and graph representations and model parameters.
Given the outstanding embedding quality, various applications of the GNNbased approaches and the space and time efficiency of the binarized representation, one may wonder if we can design a binarized GNNbased graph embedding approach such that we can achieve a good tradeoff between embedding quality and time/space efficiency in the GNNbased methods.
We notice that the existing binarized graph embedding methods Lian et al. (2018); Shen et al. (2018) rely on the discretization of the matrix factorization following the nodesimilarity based approaches. They cannot be extended to binarize the GNNbased embedding due to the inherently different natures of two categories of approaches.
As to our best knowledge, the only attempt for the binarization of GNN is from DeepMind in their recent work Li et al. (2019). Their binarization method converts each learned dimensional realvalued vector into a dimensional ”nearly” binary vector by applying wellknown binarization function tanh to approximate hamming distance for the binarization and optimization. However, the output of tanh is not exact binary value and cannot be accelerated by the binary logic operations (e.g., xnor and popcount). As an alternative, one may consider the Binarized Neural Network (BNN) (e.g., Hubara et al. (2016)) for the graph embedding so that the representation is naturally binarized. However, BNN is not designed for graph data, and as to our best knowledge, there is no existing graph embedding work based on BNN.
These issues motivate us to develop a new binarized graph embedding technique which can be integrated into existing GNNbased models to binarize the parameters and produce highquality binarized graph embeddings. The key challenge is how to generate effective compact embedding vectors with binary network parameters in an effective way. To address the challenge, we design a binarized graph neural network framework to learn the binary parameters and representations efficiently and effectively .
Contributions. Our principle contributions are summarized as follows:

To the best of our knowledge, this is the first study on binarized graph neural network (GNN) with binary parameters to generate binary graph representations. The proposed method, namely BGN, can be seamlessly integrated into the existing GNNs.

An endtoend binarized graph neural network framework is proposed with binary weights and activations. This binarized framework can immediately reduce the memory consumption for the network; the bitwise operations between the binary vectors can substantially speedup the inference time of the model and the gradient estimator enables our model to effectively process backpropagation through discrete parameters and activations.

Extensive experiments on multiple benchmark networks are conducted for node classification task. The results demonstrate that our proposed method outperforms existing binarized embedding methods with a big margin. Compare to the realvalued GNNs, our BGN model can achieve nearly stateoftheart performance while consuming much fewer computation resources (up to parameter and embedding memory space and inference time).

Binarization approaches are employed on the GNNbased application GMN to show that, by applying our BGN techniques, GMN model can dramatically reduce the time and space complexity while keeping the performance competitiveness.

Experiments further show that our proposed BGN technique allows users to achieve a tradeoff between the space/time and embedding quality in a flexible way by tuning different level and setting of binarization on the parameters and activations.
2 Related Works
Graph Embedding.
A key problem in machine learning on graphs is finding a way to incorporate information about the structure of the graph into the desired machine learning model. Graph embedding is one of the most promising approaches because it maps nodes into a lowdimensional space such that the structure of the graph is well preserved. Once accomplished, an existing machine learning approach (e.g., kmeans clustering) can be used to assimilate and analyse the graph in the embedded lowdimensional space. Loosely following the seminal graph embedding approach, DeepWalk, three broad categories of embedding methods have appeared in the literature: (1) node similarity based embedding methods (e.g., LINE, NetMF), which rely on the proximity of the nodes w.r.t various similarity metrics. The matrix factorization techniques have been used to learn the embedding of the nodes. (2) Random walk based embedding methods (e.g., Deepwalk and node2vec) which encode the nodes by applying the SkipGram technique
Mikolov et al. (2013) on the random walks; and (3) graph neural networks (GNN) based embedding methods (e.g., GCN, GraphSage and GIN) which apply the neural network techniques on graph to learn the representations of the nodes.Most of the existing graph embedding studies use the realvalued vector to encode the graph nodes following the above three computing paradigms. Recently, three unsupervised approaches Lian et al. (2018); Shen et al. (2018); Yang et al. (2018) have been proposed to learn the binary embedding of the graphs following the nodesimilarity based embedding methods. Particularly, INHMF Lian et al. (2018) and DNE Shen et al. (2018) are independently developed for binarized graph embedding based on the discretization of the matrix factorization on proximity graphs. BANE proposed in Yang et al. (2018) is a natural extension of DNE by considering both structure and attribute similarities on the attributed graphs.
Binary Hashing. The binary hashing has been widely used to learn the binary vectors (codes) of the objects in many applications. The most popular application is the approximate nearest neighbor search in high dimension space where binary hashing methods encode highdimensional objects (e.g., documents and images) to binary codes, while preserving similarity distance in the original space. Many learning to hash approaches have been proposed including unsupervised methods (e.g., Salakhutdinov and Hinton (2009); Liu et al. (2014)), supervised methods (e.g., Shen et al. (2015)
), and deep learning based methods (e.g.,
Liu et al. (2016)). Please refer to Wang et al. (2018) for a comprehensive survey. Recently, three approaches Lian et al. (2018); Shen et al. (2018); Yang et al. (2018) have been proposed to learn the binary embedding of the graphs following the nodesimilarity based embedding methods. As to our best knowledge, there is no existing work on the binarized graph embedding based on GNNs.Binarized Neural Networks Binarized neural networks was first proposed by BNN Courbariaux et al. (2016). The binarization technique proposed in Courbariaux et al. (2016) is used by most network binarization models. Among them, XNORNet Rastegari et al. (2016) and DoReFaNet Zhou et al. (2016) are the most popular ones because of their great performance on the image classification task.
XNORNet was proposed to have high accuracy of classification task on the ImageNet dataset while XNORNet has
faster convolutional operations andmemory saving. DoReFaNet replaces the binarization by quantization which allows the model to change the bit size for weights, activations and even gradient calculations during backpropagation.
However, these methods are all designed for computer vision tasks. Though they perform well on the image dataset, they cannot be adapted to the graph representation learning and graph analysis task directly.
Graph Neural Network Applications There are several applications that are based on the GNN. Such as Graph Matching Network Li et al. (2019) and SimGNN Bai et al. (2019). These models utilize GNN and use the similarity (distance) of graph embedding to approximate the graph edit distance and graph similarity.
The Graph Matching Network (i.e., GMN) is a novel GNNbased framework proposed by DeepMind
to compute the similarity score between input pairs of graphs. Separate MLPs will first map the input nodes in the graphs into vector space. Then the propagation layer will aggregate the messages of the edges and crossgraph matching vector by MLP or GRU with input concatenation of node representations and edge vectors. Matching function is applied to compute the attention coefficients based on the node information between the input pair of graphs. The matching function is based on the softmax function over node vectors which requires the calculation of vector space similarity like Euclidean, cosine similarity or dot product between all pairs of node representations. This attention coefficients calculation across two graphs requires a computation cost of
, where and indicate the number of vertices of input graph 1 and 2 respectively, and is the dimension of the node representation. The match vector is concatenated with the message vector and the node representation, then the concatenation is fed into MLP or a recurrent neural network core to produce the new node representations. Given the learned node representations of graph, the aggregation module proposed in
Li et al. (2015) is used to obtain the graph representations. The similarity score in vector space such as Euclidean similarity, cosine similarity and approximate hamming similarity will be computed between graph representations to approximate the similarity between the input graphs.3 Background and Preliminaries
Recent studies have revealed that graph neural network can perform excellently on label classification tasks. The existing GNNbased graph embedding approaches share the same computing paradigm. GNNs take graph nodes’ feature and neighborhood information as the input. During the training, the representations of nodes (realvalued vectors) at each layer will be updated by the aggregators and nonlinear activation functions. The output representations will be fed into the taskspecific layer to calculate the loss of the model. Based on that, the model will be optimized by the optimizer through backpropagation. The main differences among these GNNbased graph embedding approaches are the design of the aggregator which combines the context representations and the loss function designed for different graph analytic tasks.
These models have realvalued parameters and learn a realvalued representation for each node in an endtoend manner for graph node classification. However, the realvalued parameters and representations are spaceconsuming for storage and timeconsuming for multiplication computation, especially for largescale graphs. To address these issues, in this paper we devise a novel binarized graph neural network, namely BGN, with binary parameters in the neural network to learn binary embedding representations for node classification task.
The important notations used throughout the paper are summarized in Table 1.
Notation  Definition 

the graph dataset  
the set for nodes and edges in the graph.  
the feature information for node .  
the neighborhood nodes of node .  
denotes that the vector or matrix is binaryvalued.  
the hidden representation of node . 

the weight matrix in the neural network.  
the binarization function which is used to transform the realvalued vector or matrix into binaryvalued vector or matrix.  
the attention coefficient between node and node . 
4 Binarized Graph Neural Network
As illustrated in Figure 1, we introduce a new graph neural network with binarized weights and activations. Our model BGN (Binarized Graph Neural Network) is based on the attention mechanism and can be easily adapted into other graph neural network frameworks. For a given graph, BGN takes the nodes and their contexts including feature and neighborhood structure information as input. Binarization function will transform the weights, activations and even coefficients into binarized vectors to reduce the time and space complexity, while the attention mechanism enables the nodes to attend over their neighborhoods’ features. We also apply the balance function to ensure that and are almost equal with each other in the binarized vectors. Furthermore, the gradient estimator is used for backpropagation of gradients through discretization.
The following subsections present the listed key components of our model:

Section 4.1 introduces the framework of our work.

Section 4.2 introduces the binarization of our model in detail, including the forward propagation and backpropagation.

Section 4.3 describes the optimization objective of our model.

Section 4.4 introduces the techniques we used to reduce the time and space complexity and improve the performance.

Section 4.5 introduces the adaptation of our model to other GNN frameworks.
4.1 Framework
Algorithm 1 illustrates the framework of our model. We follow the attention mechanism introduced in Vaswani et al. (2017); Veličković et al. (2017) to involve the importance of the node’s neighborhoods into the graph representation learning process. Given a graph , where and denote the set of graph nodes and edges respectively, we use nodes features and the neighborhood information of nodes as inputs. Our model will first produce the binarized node representations for each node within the input graph. After that, the binarized node embeddings will be fed into the output layer to compute the loss for some specific tasks like node classification.
Attention Mechanism Our proposed framework is based on the graph attention mechanism. The attention layer is utilized in our model to learn the importance of every node to other nodes. The key is to get the importance of one node’s feature to other nodes that is the attention coefficients of the input graph, afterwards, the node’s feature can attend on other nodes. Inspired by Veličković et al. (2017), we perform masked attention to the model to keep the structural information of the input graph. Only the attention coefficients of one node with its neighborhood nodes i.e., will be computed.
In order to obtain the attention coefficients, we use a shared binarized weight matrix
to apply the linear transformation to each node. Softmax function is used to normalize the coefficients, but unlike the model proposed in
Veličković et al. (2017), LeakyRelu activation is not employed in our model while the sign function is used to binarize the attention coefficients. With the following Equation (1), we will get a binarized attention coefficient matrix where is the element of the matrix (0 is contained in the matrix since we only compute the attention coefficients between neighbors such that the matrix is sparse).(1) 
where is the binarization function for attention coefficients which maps 0 to 0, positive values to and negative values to .
Once the attention coefficient matrix is obtained, it will be used to compute the output of the attention layer. The attention coefficients will multiply the linear transformed node feature. We employ the multihead attention mechanism to stabilize the learning process. The binarization function, which is served as activation function, is applied to every attention head to binarize the preactivations. And concatenation of the output of K independent attention head is the output of the attention layer. Therefore, the output node representation will be like following:
(2) 
Where means the concatenation of the vectors and is the output binarized node representation where .
After several attention layers, the node representation will be fed into the last layer to calculate the loss for specific task which is classification in this paper. We will introduce the learning objective in the Section 4.3.
4.2 Binarization
In this section, we introduce how to obtain a graph neural network with binary parameters that can learn binary representations. Section 4.2.1 introduces the binarization function used to transform the realvalued parameters and preactivations into binary space. Section 4.2.2 introduces the gradient estimators that enable the binarized model to be optimized by the offtheshelf optimizers such as Adam and SGD.
4.2.1 Forward Propagation
Binarization function is important in our model. Specific binarization function will be chosen in the forward propagation calculation process to binarize the weights and the activations. In that way the lowbit parameters and activations will help to reduce the time and space complexity. In our case, various binarization functions will work, and the most straightforward example is the sign function. As mentioned in Courbariaux et al. (2016) and Rastegari et al. (2016), deterministic and stochastic binarization based sign function are widely applied to the continuous preactivations as well as the realvalued weights to obtain binarized activations and weights.
(3) 
The above equation is the deterministic binarization function, where
is the realvalued variable. The stochastic binarization is the sign function with probability:
(4) 
where
denotes the sigmoid function, that is
. The stochastic binarization is more appealing but needs the computer to generates random bits while the deterministic binarization is easier to calculate. Deterministic binarization function(i.e., Equation (3)) is applied for the binarization of weights and activations because the deterministic sign function provides more stable and reproducible results. Please note that we use a variant of deterministic sign function which maps 0 to 0 to binarize the attention coefficients.4.2.2 Backprobagation
In this part, we describe how to backpropagate the gradients through the binarization function. We adapt the gradient estimator into our model for better optimization.
Propagation gradients through binarization function. It is obvious that the binarization function has zero derivative almost everywhere, which leads to the zero gradients of the loss function w.r.t the preactivations and weights. The trainable variables cannot be updated with zero gradient. Therefore, the model cannot be trained by simple backpropagation, and the estimation of the gradients should be obtained for optimization. Previous studies have investigated how to propagate gradients through stochastic discrete functions. Below we investigate two popular unbiased gradient estimators for binarization function: straight through estimator and REINFORCE estimator Williams (1992).
Straight through estimator. The straightthrough estimator is proposed a simple unbiased gradient estimator. It estimates the derivative of binarization function of preactivation or weight as (a vector or matrix whose elements are all 1). Let denote the binarized representation and denote the preactivation before binarization. The straightthrough estimation of the gradient of the loss w.r.t the preactivation is thus:
(5) 
This gradient will then be backpropagated to obtain the gradient of quantities (i.e., preactivations or weights) that influence .
REINFORCE estimator. The reinforce estimator is proposed in Bengio et al. (2013) to estimate the expectation of the gradient of loss with regard to the preactivation vector or weight . When binarization function is stochastic with the probability given by sigmoid, it has been proven that:
(6) 
where is the sigmoid function and
is a constant vector. To minimize the variance of the estimation,
can be chosen as:(7) 
The reinforce estimator can work directly on the weights and preactivations without actual computation of the gradient. The estimation is obtained by monitoring numerator and denominator during the training process.
Compared with straight through estimator, reinforce estimator is more advanced with better performance in many applications. However, we observe that its performance is not superior than the straight through estimator. On the other hand, straight through estimator helps the model to obtain the gradient faster than the reinforce estimator due to its simplicity. The comparison between these two gradient estimators with regards to the performance is included in Section 5. In practice, we choose straight through estimator for our model in the experiments.
4.3 Optimization Objectives
Existing GNNbased graph embedding approaches provide an endtoend model, which focuses on the node classification task. Therefore, our model is also learned for the node classification task. Below, we introduce the objective of BGN and the learning process that optimizes the parameters.
For the node classification learning, we feed the binarized embedding into the output layer to predict the class label for the node. The predicting probability of label is written as:
(8) 
where denotes the number of labels for each node. After obtaining the classification result in Equation (8), we calculate the crossentropy as the loss for the node classification task.
(9) 
where is the set of nodes that have label information which are used for training process, is the multihot encoding for ground truth classification labels.
The gradients will be back propagated via estimator and be applied on the optimization of parameters by the offtheshelf optimizer during the training process.
4.4 Techniques to Improve the Model
Several techniques are used on binarized graph neural network model to reduce the time and space complexity and improve the performance. Logic operation xnor between binary values, buildin CPU instruction popcount and the masked summation are used to replace the tradition arithmetic operation dot product to reduce time complexity. The figure 2 is a toy example that introduces the differences between these operations. Balance function is used to make and to be balanced in the embedding vectors which can raise the performance of the GMN. Also, the binary parameters of the neural network and the binary node representations can reduce the space complexity intuitively.
4.4.1 xnor and popcount
The logic xnor and CPU buildin instruction popcount between binary matrices are used to replace the dot product between them.
Input A  Input B  Output 

+1  +1  +1 
+1  1  1 
1  +1  1 
1  1  +1 
As shown in Table 2, xnor produces binary value with input of and . Instruction popcount is then be employed to count the number of bits that is set to . The xnor can be more than one order of magnitude faster than the dot product which can dramastically reduce the time complexity. As mentioned in Courbariaux et al. (2016), a 32bit floating point multiplier costs about 200 Xilinx FPGA slices, whereas a 1bit xnor gate costs only 1 slice.
4.4.2 Masked Summation
Masked summation is used to replace the dot product between binary matrix and realvalued matrix. The binary matrix will be transformed into the mask matrix with ”True” and ”False”. During the multiplication, the realvalued vector will be masked by the corresponding mask vector, then the positive and negative masked vector are produced with only the elements at the same position as ”True” and ”False” on the mask vector. The model calculates the summations of the positive and negative masked vector separately. The subtraction of these two summation results is the result of dot product between the given matrices.
The masked summation can reduce the time complexity of dot product of two matrix. Usually, the time complexity of naive dot product between two realvalue matrices and is , while the time complexity of masked summation between binary matrix and realvalued matrix is . Theoretically and also in practice, the masked summation can significantly reduce the time complexity of our proposed binarized graph neural network.
4.4.3 Balance Function
The distribution of and is sometimes unbalanced in the representation vectors. For example, if most preactivations have positive elements, the output graph representation vector of binarization function will be formed mainly by . Then the dot product of two vectors will be which is the dimension of the vectors. This unwanted situation should be avoid because it dramatically lower the effectiveness of the proposed model, especially when the BGN is applied to GMN which requires a great number of dot product between representations. As a result, we apply the following balance function to the preactivations before binarization in order to balance the distribution of positive and negative elements of preactivations:
(10) 
Where the is the vector whose elements are all mean value of the preactivation vector . The balance function ensures that the preactivation vectors contain almost half positive and half negative elements, which leads to the balance distribution of and after binarization.
4.5 Adapted to Other GNN Based Models
The proposed binarized graph neural network is a very general framework that can be adapted to other graph neural networkbased model to project the realvalued parameters and activations into the binary space to reduce the space and time cost. We introduce how we binarize the stateoftheart GNNbased model ASGCN Huang et al. (2018) and the graph matching network.
4.5.1 Binarization of ASGCN
ASGCN is a general framework that is designed for fast representation learning based on graph neural networks such as GCN. Therefore, the binarization of ASGCN is similar to our proposed BGN. We use deterministic binary function to binarize the parameters and preactivations of ASGCN. And straight through estimator is employed for back propagation. The binarized model is denoted as BGNASGCN in our experiment.
4.5.2 Binarization of GMN
As mentioned above, the time cost of GMN comes mainly from the pairwise node similarity computation. We utilize the deterministic binarization function (Equation (3)) on the preactivations and transform the node and graph representations into binary codes such that the xnor can be applied to replace the dot product. Straight through estimator (Equation (5)) is used for the back propagation. Furthermore, we noticed that the distribution of and is usually not symmetric which dramatically lower the performance, hence, balance function (Equation (10))is employed on the graph representations.
5 Experiment
We conduct extensive experiments to evaluate the performance of our model for the node classification task on realworld network datasets. We compare the time and space efficiency thoroughly between the proposed model and other baseline models. The case study shows the effectiveness and efficiency brought by our framework on the GNNbased application such as GMN.
5.1 Dataset
To facilitate the comparison between our model and the relevant baselines, we conduct the classification experiments on three wellknown citation network datasets: Cora, Citeseer and Pubmed Sen et al. (2008). Each dataset contains bagofwords representations of documents and citation links between the documents. Graph is constructed based on the citation links. In the classification task, we only use labeled instances per class for training. The test data contains 1000 nodes as in GCN, GAT and ASGCN.
The details of the datasets are summarized in the Table3.
Dataset  #Nodes  #Edges  #Classes  #Labled Nodes 

Cora  2708  5429  7  140 
Citeseer  3327  4732  6  120 
Pubmed  19717  44338  3  60 
5.2 Baseline Methods
The following GNNbased and binary embedding methods are compared as baselines:
GCN (Graph Convolutional Network) Veličković et al. (2017) is a semisupervised neural network method for node classification.
GAT (Graph Attention Network) Veličković et al. (2017) is a graph neural network model which first exploits the attention mechanism to solve the node classification task.
ASGCN (Adaptive Sampling over GCN) Huang et al. (2018) is a stateoftheart method for node classification task. ASGCN aims to increase the scalability of GCN using adaptive sampling. The experiments demonstrate that the application of BGN can further reduce the time and space complexity of ASGCN.
GATbinary and ASGCNbinary are the models that directly apply sign function on the node representations learned by the original version of GAT and ASGCN. The naively binarized representations will be fed into the taskspecific layer to learn the classification result.
GATtanh and ASGCNtanh are the models that employ the binarization function tanh used by DeepMind’s work. tanh function is used to binarize the parameters and embedding vectors of GAT and ASGCN. We clip the value of the parameters and activations in both models to make sure that tanh can produce “exact” binary codes.
INHMF Lian et al. (2018) is a MFbased information network hashing algorithm that learns binary codes as node embedding which can preserve highorder proximity.
5.3 Experiment Setup
For the performance experiment, we evaluate the models with the same bitwidth representations. For the experiment of inference efficiency, the embedding dimension of our method and other baseline methods are all set to 64. During training process, the whole graph can be seen, but only a few nodes are labeled while most nodes have no label information. We put all nodes information in one training phase due to the need of calculation for graph attention coefficients.
For this classification task, we report the average accuracy of the evaluated GNNbased embedding approaches after ten independent runs using the accuracy metric introduced in Kipf and Welling (2017); Veličković et al. (2017). Because INHMF and BANE only produce the binary embedding vectors but have no buildin classifier, we employ the onevsrest logic regression implemented by Liblinear Fan et al. (2008) to obtain the classification result of the networks, in which 90% nodes are labeled.
All the experiments were conducted on the server which is running RHEL 7.5 and has 2x 2.4GHz Intel Xeon E52680 v4 (14 Cores) CPU, 256GB 2400MHz ECC DDR4RAM and 2x NIVDIA Quadro P5000 16GB Graphics Card (GPUs) (2560 Cores).
5.4 Classification Results
Because our model produces the compact representations for vertices, we compare the performance between our model and other baselines with the same bit width.
5.4.1 Comparison Among Binary Embedding Methods
We compare the classification results between our model and other binaryvalued embedding methods.
As shown in the Figure 3, under different embedding dimensions, BGN outperforms all the other binaryvalued embedding methods significantly on all three datasets. With the help of the graph neural network, our model can make better use of the graph structured data and feature information and is trained specifically for the node classification task. Therefore, our model outperforms other MFbased binarized graph embedding models by a significantly large margin. In comparison with the naively binarized GATbinary and ASGCNbinary, our model considers the binary property of parameters and vectors during the training process, hence our model achieves better accuracy. In terms of GATtanh and ASGCNtanh, because tanh function has zero gradient when the output is nearly or and has realvalue output when the gradient is not zero. This property determines that tanh function is not suitable for binarize the neural network. When the input values are clipped to produce exact binary parameters and embeddings via tanh function, the gradient will be zero which results in the insufficient optimization and worse performance than BGN.
5.4.2 Comparison among the GNNbased methods
We compare our model with other GNNbased methods (GCN, GAT and ASGCN). All baseline methods produce the realvalued embedding vectors each dimension of which is encoded by at least 32 bits. Compared with these methods, each dimension of the embedding vectors learned by our model is only encoded by 1 bit. As a result, a realvalued 16 dimension vector requires at least 256 bits while a binary vector only requires 16 bits. Figure 4 shows the performance of the models with bit width varies for a single embedding vector.
Our model significantly outperforms all the baseline methods with low bit width. When getting more space for the learned representations, our model can still achieve competitive classification results compared with the stateoftheart graph neural networkbased methods. In conclusion, the performance gap between our model and baselines with large bitwidth representations is acceptably small while our model’s performance is notably better with the low bitwidth representations.
5.5 Comparison of Time and Space Efficiency
Dataset  GAT  ASGCN  BGNGAT  BGNASGCN  

Cora  Time(s)  
Space(bit)  
Accuracy  84.0%  87.3%  77.7%  84.1%  
Citeseer  Time(s)  
Space(bit)  
Accuracy  72.1%  78.9%  63.7%  77.2%  
Pubmed  Time(s)  
Space(bit)  
Accuracy  78.2%  89.0%  75.7%  82.0% 
In this section, we report the inference time and space efficiency of our model. The inference is the process that produces the classification result when we have already trained the model. Acceleration is brought by the xnor and popcount operation with just little sacrifice on the classification performance. In this experiment, we train the binary parameters and activations of our model, then replace dot product operation between binarized matrices by xnor and popcount and also replace the dot product between binary matrix and realvalued matrix by maskedsummation during the inference process.
Table 4 reports the experiment results. Our model under the binarized framework is more than one order of magnitude faster than the baseline methods GAT and ASGCN with regards to the inference time. The proposed model can be up to faster and save up to space compared with the baseline methods.
5.6 Analysis of Binarization
In this section, we introduce the effect of the estimator and binarization level with regard to the space, time and performance. We compare the space, inference time and performance between BGNGAT and GAT on the Cora dataset. We fix the dimension of embedding vector to 64 for both methods and change the setting of BGN to show the space and time saving compared with the baseline GAT.
Method  Estimator  Param space  Vec space  Speed up  Accuracy 

GAT  N/A  84.0%  
STE  80.5%  
Reinforce  80.3%  
STE  81.2%  
Reinforce  81.3%  
STE  77.2%  
Reinforce  77.5%  
STE  77.7%  
Reinforce  76.9% 
Result is shown in Table 5 where , , and mean that the BGN is with weights binarized, embedding vectors binarized, weights and embedding vectors binarized, weights, embedding vectors and attention coefficients binarized based on the graph attention mechanism respectively. We can conclude from the Table 5 that (1) when the weights, activations and attention coefficients are all binarized, the BGNGAT can save largest space for parameters and the output vectors while holding acceptable classification accuracy. (2) Straight through estimator and reinforce estimator have similar accuracy on the node classification task. Therefore, we choose the STE for our model in the above experiments because of its simplicity and certainty. (3) Compared with original GAT, BGNGAT can save space for model parameters, space for activations and achieve speed up.
5.7 Case Study
In this section, we investigate how binarized graph neural network improve the time efficiency of the GNNbased applications such as GMN. Because GMN needs to compute the pairwise dot product between node and graph embedding vectors, the time consumption is extremely high when the number of nodes in each graph goes up. However, with the binary representations, we can apply xnor between binary vectors to replace the dot product, which will alleviate the time complexity problem significantly. The following experiment results will introduce the performance and time complexity of GMN with binary node and graph representations compared with the origin version. The graph similarity will then be used for the graph matching task.
Experiment Setup
We follow the experiment setting of Li et al. (2019) to test the performance of Binarized GMN. The training data is generated by sampling binomial graphs with nodes and edge probability Erdős and Rényi (1960). Then the positive example is generated by randomly substituting edges from with new edges and negative example is generated by substituting edges from , where . In the experiment, we set , and . We also set the hamming similarity between vectors as loss function, which is more suitable for the binaryvalued vectors as the loss function to train the model. The model needs to predict a higher similarity score for positive pair than negative pair
. The evaluation metric remains the same: (1) pair AUC  the area under the ROC curve for classifying pairs of graphs as similar or not on a fixed set of 1000 pairs and (2) triplet accuracy  the accuracy of correctly assigning higher similarity to the positive pair in a triplet than the negative pair on a fixed set of 1000 triplets.
Inference time and Graph Matching Performance
We report the graph matching accuracy and inference time of the binarized and original GMN with regards to the number of nodes in each graph. The default setting in GMN is 20 nodes per graph, which is quite small for realworld networks. We set the number of nodes in one graph from 20 to 160 and keep other settings the same as described above to evaluate the performance and inference time. The dimensions of node and graph representations are set to 32 and 64 respectively.
As shown in Figure 5, the inference of BGNGMN is significantly faster. This is because of the fact that the similarity computation (pairwise dot product) between node representations of two graphs mainly accounts for the time complexity of GMN. Under the same dimension of node and graph embedding vectors, BGNGMN is up to faster than the baseline model in terms of the inference time with the help of the replacement of dot product by fast operations such as xnor and popcount between binary vectors.
In terms of graph matching task, the original version of GMN has better performance when the number of nodes in each graph is small. However, when the number of nodes gets larger, the pair AUC and triplet accuracy will both decay. When the number of nodes is more than , the realvalued representations cannot tell the similarity difference between the graphs. Hence, the model is not able to learn the different similarity scores for positive and negative pairs of graphs with the hamming similarity metric. However, with the help of binarization and balance function, the binary representations still hold an acceptable and more robust performance for the graph matching task. This is due to the fact that the binarized model produces true binary representations for the calculation of hamming loss and is designed for the graph matching task specifically on hamming space.
Parameter Sensitivity Analysis
We compare the performance of binarized and original version GMN to show the effect of dimension for node and graph embedding vectors. We set the number of nodes in each graph for this comparison. We change the dimension of graph embeddings produced by two models to ensure them to produce the same bitwidth embedding vectors and keep the other settings as the same to compare the performance of two models.
The result is included in Figure 6(a). We can find that the binary graph representations tend to have better performance when they are low bitwidth and have similar accuracy when the bitwidth for the representations getting larger. The binary representations have more robust performance compared with the baseline model when the dimension of embedding varied.
The node representations’ binarization is more important than the graph representations’ because the dot product operation is mainly conducted between the node representations which costs plenty of time. The performances of GMN and BGNGMN are compared under different bitwidth for the node embedding vectors by varying the dimensions.
As shown in Figure 6(b), the result for the pairwise AUC is similar between the binary and the realvalued node embedding vectors, but BGNGMN holds a better performance with low bitwidth representations. As for the triplet graph accuracy, the binary embedding vector achieves better performance with short code length and similar accuracy as realvalued node embedding with long code length. These results indicate that the binary representations are much better for the comparison between two graphs under low bitwidth circumstances. In line with the result of the binary graph embedding vectors, the binary node embedding vectors also have more robust performance compared with the realvalued node representations.
6 Conclusion
We present a model focused on the challenging problem of seeking binary representations of network embeddings using a compact neural network structure. We proposed a novel binarized graph embedding method, namely BGN, that has binarized parameters and enables GNNs to learn discrete embedding. The binarized neural network can reduce the memory and time cost of the GNN such that increases the scalability of GNNs. BGN can be naturally integrated into other GNN models to enhance the performance of the model such as graph matching network in terms of the inference time and space consumption. External experiment also illustrates that BGN can increase the time efficiency while holding competitive accuracy.
References
 SimGNN: A neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 1115, 2019, pp. 384–392. Cited by: §1, §1, §2.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §4.2.2.  Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. NIPS. Cited by: §2, §4.2.1, §4.4.1.
 A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31 (5), pp. 833–852. Cited by: §1.
 On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5 (1), pp. 17–60. Cited by: §5.7.
 LIBLINEAR: A library for large linear classification. JMLR 9, pp. 1871–1874. Cited by: §5.3.
 Node2vec: scalable feature learning for networks. In ACM SIGKDD, pp. 855–864. Cited by: §1.
 Inductive representation learning on large graphs. In NIPS, pp. 1024–1034. Cited by: §1, §1.
 Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40 (3), pp. 52–74. Cited by: §1.
 Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 4563–4572. Cited by: §1, §1, §4.5, §5.2.
 Binarized neural networks. In NIPS, pp. 4107–4115. Cited by: §1.

SimplE embedding for link prediction in knowledge graphs
. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 4289–4300. Cited by: §1.  Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §5.3.
 Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pp. 3835–3845. Cited by: §1, §1, §1, §2, §5.7.
 Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
 Highorder proximity preserving information network hashing. In ACM SIGKDD, pp. 1744–1753. Cited by: §1, §1, §1, §2, §2, §5.2.

Deep supervised hashing for fast image retrieval
. In CVPR, pp. 2064–2072. Cited by: §2.  Discrete graph hashing. In NIPS, pp. 3419–3427. Cited by: §2.
 Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. Cited by: §2.
 Deepwalk: online learning of social representations. In ACMSIGKDD, pp. 701–710. Cited by: §1.
 GPH: similarity search in hamming space. In IEEE ICDE, pp. 29–40. Cited by: §1.
 Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In ACM WSDM, pp. 459–467. Cited by: §1.

Xnornet: imagenet classification using binary convolutional neural networks
. In European Conference on Computer Vision, pp. 525–542. Cited by: §2, §4.2.1.  Semantic hashing. Int. J. Approx. Reasoning 50 (7), pp. 969–978. Cited by: §2.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.1.
 Supervised discrete hashing. In CVPR, pp. 37–45. Cited by: §2.
 Discrete network embedding. In IJCAI, pp. 3549–3555. Cited by: §1, §1, §2, §2, §5.2.
 LINE: largescale information network embedding. In WWW, pp. 1067–1077. Cited by: §1.
 Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 5998–6008. Cited by: §4.1.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §1, §4.1, §4.1, §4.1, §5.2, §5.2, §5.3.

Attributed graph clustering: A deep attentional embedding approach.
In
Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 1016, 2019
, pp. 3670–3676. Cited by: §1.  A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 769–790. Cited by: §2.

Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine Learning 8, pp. 229–256. Cited by: §4.2.2.  Binarized attributed network embedding. In IEEE ICDM, pp. 1476–1481. Cited by: §1, §2, §2, §5.2.
 Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 5171–5181. Cited by: §1.
 Attributed graph clustering via adaptive graph convolution. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 1016, 2019, pp. 4327–4333. Cited by: §1.
 DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. Cited by: §2.