Interest in learning graph structured data has risen rapidly in recent years because of its wide applicability in bioinformatics, chemoinformatics, social network analysis and data mining. For learning graph-structured data, we need an algorithm that can effectively represent the graph structure and relations between the graph nodes. In recent years, numerous approaches to learn graph structure were developed, including graph kernel methods [26, 7, 15, 14, 2] and ”neural message passing”  based graph neural network (GNN) methods [23, 5, 25, 13, 10, 3, 22, 20, 12].
Most of GNN algorithms aggregate feature information on connected nodes recursively, and thereby create new feature vectors for each node in the graph. Attention methods are also used to get a better representation of graphs when aggregating node features [18, 11, 1, 24]. By repeating this process, an algorithm gets information about -hop neighborhood of each node and a representation of the whole graph by combining those feature vectors. Graph isomoprhism network (GIN)  formulated this GNN encompassing process mathematically by using a concept of multiset functions.
A major limitation of GNN architectures’ aggregator is that each node uses only its neighborhood nodes’ information, which does not comprise relationships of neighborhood nodes. This limitation causes GNN architectures map different graphs into the same representation, which leads difficulties in learning the graph structure. To illustrate this limitation, we propose a family of artificial graphs that are impossible to classify using the 1-dimensional Weisfeiler-Lehman (WL) test and traditional GNN algorithms with the 1-hop neighborhood aggregator; this shows that reflecting relationships between the nodes in the neighborhood is necessary.
To overcome this limitation, we propose Neighborhood Edge AggregatoR (NEAR), a framework that enables the graph representation to reflect the relationships between the nodes in the neighborhoods. Our idea was inspired by certain graph structures that cannot be classified by previous GNN frameworks. In recent years, some graph classification architectures used features related to the local structure of graphs, such as local clustering coefficient or return probability[23, 26], or used a method such as hierarchical pooling to handle this problem. However, there is not an approach up to date to solving this problem within the recursive learning process in GNN. Our proposed algorithm aggregates relationships between the nodes in the neighborhood and combines those with the existing feature vectors by incorporating NEAR and GIN  frameworks. The algorithm proposed in this paper outperformed current state-of-the-art algorithms in various graph classification tasks, and overcame the limitation mentioned above.
Our main contributions can be summarized as follows.
We constructed a family of graphs that cannot be classified by the existing GNN models that are based on 1-hop neighborhood aggregator, thus claiming that reflecting relationships between the nodes in neighborhoods is required to represent their local structure.
We proposed NEAR, a new GNN framework that aggregates the local structure of neighborhoods. We verified that our framework can reflect the local structures of graphs well enough to classify the family of graphs that we have proposed.
We proposed simple variants of NEAR, which have a more powerful discriminative power to classify local structures. Our variants of NEAR showed noble performances on various graph classification tasks .
2 Proposed method
GNN’s neighborhood aggregator and graph-level readout function operate on a set of feature vectors of nodes, potentially admitting the same feature vectors. Therefore, we first introduce a generalized concept of sets that allows repetition of elements.
(Multiset) A multiset is a generalized concept of sets that allows repetition of elements. Multiset can be represented as a pair of a set and a function , namely . represents a set of unique elements in and represents multiplicities of each element in .
For example, two multisets and can be represented as with . We can easily observe that a multiset is invariant under permutation, because its underlying set and multiplicities remain the same under permutation. Therefore, functions that operate on multisets should be at least permutation invariant to be well-defined. Typical examples of multiset function are count (for finite case), summation (sum), average (mean), and min/max.
According to , the main structure of message-passing based GNN layer can be formulated using three core functions: AGGREGATOR, COMBINE, and READOUT. Given a graph , suppose that there exist feature vectors on set of nodes with . Let be a feature vector of node , be an aggregated feature vector of neighborhood of node , and be a representation vector of graph . In this case, and a set of new feature vectors can be defined as below.
First, AGGREGATOR operates on a set of feature vectors of neighborhood of node . AGGREGATOR integrates information of and returns a feature vector that represents neighborhood of . Second, COMBINE operates on and and creates a new feature vector of node in the next GNN layer. While repeating this for every GNN layer, READOUT function operates on the set of nodes in and returns a vector that represents the whole graph . In , summation was used as AGGREGATOR and summation/mean were used as READOUT. 2-layer MLP with learnable parameters was used as COMBINE to approximate an injective universal multiset function.
Let be a multiset of feature vectors of nodes in GNN layer , where be a multiset of the given initial feature vectors of nodes. Because GNN layers are stacked in a row, every layer computes its new feature vectors recursively: for . By stacking GNN layers in a row, we can expect that the model can learn up to -hop neighborhood’s representation.
2.2 Toy example
Summing AGGREGATOR in GIN aggregates the information on neighborhood’s size and distribution . However, besides GIN, the current GNNs only with any simple 1-hop AGGREGATOR (regardless of sum/mean) may misclassify sets with different local structures. Here, we introduce a family of graphs that cannot be distinguished by traditional GNNs. Figure 1 is an example of graphs that have the same neighborhood set with different local structures.
There exists a graph with a multiset of feature vectors of nodes with , satisfying the following conditions for every .
where and .
There are black nodes and white nodes.
Every white node is connected with two black nodes and has the same feature vector .
Every black node is connected with two white nodes and one black node, and has the same feature vector .
Let be a set of black nodes and be a set of white nodes, where and are connected for . We define a multiset of black nodes with multiplicities 2, where .
If , we randomly pick 2 different elements in and remove them from multiset . Next, we connect them with given white nodes for . Repeating this procedure, we get 2 remaining white nodes and 4 elements in .
If , then we have three possible cases.
Without loss of generality, if , then we pick and connect them with the white node . For the white node , we connect it to black nodes .
Without loss of generality, if , then we pick and connect them with the white node . For the white node , we connect it to black nodes .
If all elements in are distinct, then we choose 2 elements randomly and connect them to the white node . The remaining elements will be connected with the white node .
Finishing the procedures above, every white node is connected with two black nodes and every black node is connected with two white nodes and one black node, which generates the desired graph. This completes the proof. ∎
Suppose black nodes have its feature vector and white nodes have its feature vector . Previous message-passing-based GNNs aggregate neighborhood of target node and combine it with feature vector of . These only take into account the combination of feature vectors of the nodes in , not the relationship between the nodes in . Therefore, although marked black nodes in Figure 1 have different local structures, they will be mapped to the same feature vector in the next GNN layer. The following lemma is a general statement for graphs in Figure 1.
1-hop neighborhood AGGREGATOR and COMBINE do not modify the graph structure. Therefore, it is sufficient to prove that all nodes with degree 2 and degree 3 have the same feature vector mapped by the GNN layer respectively. Let be a feature vector of the white node and be a feature vector of the black node. Then, a new feature vector for the white node can be represented as below.
Because every white node has the neighborhood with the same feature vectors , this can be applied to all white nodes in , regardless of function or function. Therefore, all white nodes are mapped to the same feature vector . For black nodes, this can be similarly proved from the fact
Every black node in such graph has the same feature vector, degree, and multiset of feature vectors of neighborhood nodes. Therefore, every black node will be mapped to the same feature vector in the next GNN layer. Likewise, every white node will be mapped onto the same feature vector. Repeating these procedures for every GNN layer with our proposed family of graphs, we obtain the following theorem which states that previous GNNs may fail to catch differences in the local structures.
For a family of graphs introduced in Proposition 1, suppose a GNN model contains GNN layers, which contain 1-hop neighborhood AGGREGATOR and COMBINE. Let a vector be a representation vector of graph in GNN layer. Define a representation vector of graph as , then
If mean is used as a READOUT, then GNN model maps every graph to the same vector, regardless of AGGREGATOR and COMBINE, i.e. is constant for every graph .
In general, for any READOUT function, is only dependent on , where is the number of nodes in graph .
Let be an original graph with its node feature vectors and be a graph that passed GNN layers. From the Lemma 1, we can deduce that also satisfies the conditions in Proposition 1 inductively. Therefore, there are black nodes with degree 3 and white nodes with degree 2 which have the same feature vector respectively in graph . Let , be the feature vectors of black/white nodes in respectively. Then,
Let , then can be represented as below.
Note that are fixed for all graphs . If READOUT function is independent of the size of set , then and is also independent with . This immediately proves the first statement, because
is independent of . In general, is only dependent on , which directly proves the second statement. ∎
Theorem 1 states that previous GNN models may miss valuable information on local structures in such cases. Therefore, the problem is to find graph features or new algorithms that can distinguish such differences.
2.3 Proposed method : NEAR
One simple and straightforward solution to handle graphs in the toy example above is inserting node features that integrate the graph’s local structure, such as local clustering coefficient  or return probability vectors . However, we aim to obtain structural properties while mapping the feature vectors to the hidden vectors on every GNN layer. Here, we propose NEAR, a new GNN framework that aggregates information of neighborhood via edges and encodes the local structures to the hidden vectors.
Let be a graph with a multiset of node feature vectors . Let be a set of nodes in neighborhood of and be a set of edges that connect nodes in . Suppose that is a real-valued function, where is the dimension of feature vector of nodes and is the dimension of the embedded vectors. Let be a fixed multiset function. , which operates on graph , node , and feature vector multiset , is defined as below.
If we set to be summation, can be simply written as below, where is an adjacency matrix’s element. To simplify the notation, we write .
For a given node , we add the feature vector and the aggregated neighborhood feature vector , where is calculated by 1-hop neighborhood AGGREGATOR. In NEAR, edges in are additionally aggregated and mapped to , which is shown with bold edges. After then, two vectors and are concatenated and mapped to a new feature vector of in the next GNN layer by COMBINE (MLP in Figure 2). These procedures will be done for every node in graph .
Note that is sufficient to encode the connection between two nodes whose feature vectors are . Once we define NEAR that maps the connection between the nodes in neighborhoods onto some hidden vector with size , we can re-design the previous GIN architecture using our proposed method. NEAR can encode the local structure into every GNN layer.
We propose four simple variants of NEAR: NEAR-c, NEAR-e, NEAR-m, and NEAR-h. NEAR-c uses the simplest constant function and NEAR-e uses a simple addition function . NEAR-m uses an element-wise max function and NEAR-h uses the Hadamard product. Especially, for NEAR-c and NEAR-e, we can reduce our computation as below by using graph invariants. is the number of nodes in that are connected with node , which is equal to the number of triangles that contain node and node .
We conduct two experiments to show the importance of relations between the neighborhoods and achieve state-of-the-art performance for several graph classification benchmarks. In the experiments, we constructed up GIN-0 based models with our variants of NEAR. Detailed structure is illustrated in Figure 3. Firstly, we performed binary graph classification tasks with the toy examples from Section 2.2. Each task requires classifying graph properties such as existence of a simple cycle with a length longer than or equal to 6 in the cycle basis, and global clustering coefficient. Secondly, we performed graph classification for 9 benchmark datasets. 10-fold cross-validation was applied, and mean and standard deviation were reported for validation accuracies. Overall, we achieved state-of-the-art results on graph classification benchmarks.
3.1 Toy example
In this experiment, we aim to show that NEAR can encode relations between the nodes in the neighborhoods. Comparing it with a plain GIN model and its variants, we can deduce that considering local structures to GNN layers is strongly required. We perform two basic binary graph classification tasks with the family of graphs introduced in Section 2.2, which was proven to be indistinguishable by the previous GNN models. Firstly, we randomly generate 1000 artificial graphs that satisfy Proposition 1, where with . We put binary labels on the graphs and perform two basic graph classification tasks for the graphs above. Each of the tasks investigates the clustering coefficient and a property of cycle basis. Details of graph classification tasks for the toy example are given below.
ARTFCC : Clustering coefficient of a graph is larger than or equal to 0.2
ARTFCYCLE6 : The cycle basis of a given graph contains a simple cycle whose length is longer than or equal to 6
To emphasize the importance on information of local structures, we performed graph classification tasks with a plain GIN model (GIN-0 in  with sum AGGREGATOR) and 4 simple variants of GIN: GIN with NEAR-c/e/m/h. Details are given in Table 1. We report the training/validation loss curve and its training/validation accuracy of ARTFCC for some fixed train/validation sets.
|GIN-0||-||Plain GIN model|
|NEAR-c||1||GIN-0 with NEAR-c|
|NEAR-e||GIN-0 with NEAR-e|
|NEAR-m||GIN-0 with NEAR-m|
|NEAR-h||GIN-0 with NEAR-h|
The GIN model in the toy example experiments has 5 GNN layers; each of them has sum AGGREGATOR, sum READOUT and 2 fully-connected layers as COMBINE . After generating
, this is feed-forwarded into 2-layer MLP with ReLU activation function and softmax function to obtain a probability vector. Number of batch size and dimension of hidden layer is given by 32. Batch normalization is applied after every hidden layer and dropout  ratio for the final prediction layer is given by . We used Adam optimizer  with its learning rate with exponential decay
. Cross entropy is used as a loss function for the toy example tasks.
From the results on the toy examples with our GIN variants in Figure 4, we can observe that the training loss of GIN decreases slowly compared with the proposed NEAR variants. GIN was trained to fit these toy example graphs only with the size of the graphs , while other proposed NEAR variants were trained well. These results empirically prove Theorem 1 with sum READOUT function. Moreover, the validation accuracy and validation loss of plain GIN were not improved gradually, whereas all the other NEAR variants got improved well. The graphs in these tasks have the same neighborhood sets with different local structures. Therefore, our proposed models also have the ability to catch differences between various graph structures. Moreover, with our proposed toy dataset, a graph classification model’s basic abilities can be examined.
3.2 Graph classification tasks
Next, we perform graph classification tasks on 9 benchmark graph datasets. Four variants of NEAR (NEAR-c/e/m/h) are used as our proposed models in this experiment. Results of the recent algorithms based on GNNs and graph kernels are compared with ours.
3.2.1 Datasets and features
We use 6 bioinformatics datasets (COX2, DHFR, MUTAG, PTC-MR, NCI1, PROTEINS) with node labels/attributes and 3 social-network datasets (COLLAB, IMDB-BINARY/MULTI) with no node labels/attributes. Discrete labels on nodes were encoded into one-hot vectors, and continuous attributes were used without preprocessing. If there are no node labels or attributes, we generated constant dummy labels or attributes for all nodes. Degree was one-hot encoded and inserted as an additional node label.
3.2.2 Model settings
As in Section 3.1, we used 5 GNN layers, each contains COMBINE with 2 fully-connected layers. AGGREGATOR in every GNN layer is fixed to the summation in GIN and our proposed NEAR variants. Batch normalization is applied after every hidden layer. We trained 300 epochs for each fold. Hyperparameters such as hidden layers’ size, batch size, READOUT, learning rate, and dropout rate are chosen as follows: batch size, hidden layer’s dimension , dropout ratio for prediction layers , READOUT , learning rate with exponential decay . Cross entropy is used as a loss function.
|Node Features||Baseline algorithms||Proposed algorithm NEAR|
We performed 10-fold cross validation and reported average of 10 validation accuracies and their standard deviation with our proposed algorithms in Table 2. The existence of discrete node labels or continuous node attributes are also provided. If node features exist in the dataset, we use ’+’, otherwise ’-’. If continuous node attributes exist in the dataset, their dimension is noted in parenthesis. The best epoch’s average validation accuracy was reported as a result among those hyperparameter settings. Graph Isomorphism Network (GIN) , Return probability-based Graph Kernel (RGK) , Weisfeiler-Lehman subtree kernel (WL) 
, Deep Graph Convolutional Neural Networks (DGC), and PATCHY-SAN (PCSN)  were used as baseline algorithms. From the model settings above, GIN-0 is used as a baseline model of GIN for fair comparison. For COX2 and DHFR datasets, we performed 10-fold cross-validations and report average of 10 validation accuracies with GIN/WL/DGCNN/PSCN and our proposed algorithms. For the other datasets, their accuracies were reported directly. NEAR-m and NEAR-h failed to produce their results in time for COLLAB dataset, which contains a large number of edges.
Our proposed algorithms achieved 5 new state-of-the-art results for 9 benchmark datasets. We first highlighted the best accuracy among 5 baseline algorithms. Our results with NEAR are also highlighted if the results outperform the previous best results. Our combined model with GIN-0 and NEAR improved the results for comparable datasets, especially for datasets with node labels/attributes. Direct comparison with GIN-0 and our NEAR variants in Table 3 shows that our neighborhood-relation encoding indeed improved the model capacity of GIN-0. GIN-0 combined with our proposed algorithm NEAR shows notable improvement for bioinformatics datasets, that are highlighted in Table 3. Especially, our proposed algorithm NEAR with GIN-0 improved the classification results on the dataset with node labels/attributes, which claims that NEAR can be orthogonally combined with existing GNN models and has a powerful ability to handle several graph datasets with their node labels/attributes.
We proposed NEAR, a new GNN framework that aggregates edges in the neighborhood and enables to encode the local structures to hidden vectors. We constructed a family of graphs with the same neighborhoods and distribution of labels but with different local structures. By using the proposed edge-aggregating framework with GIN models, we showed that NEAR has the ability to encode local structures and we obtained exemplary results for several graph classification tasks. Our proposed algorithm NEAR shows a better model capacity to deal with both local structures of graphs and node labels/attributes. Possible future work would be finding a more powerful and computationally efficient function that can represent connections between two nodes. Additionally, encoding edge labels/attributes with NEAR and GNN layers would be fruitful for more complex graph classification tasks and graph embedding.
Abu-El-Haija et al. 
Abu-El-Haija, S., B. Perozzi, R. Al-Rfou, and A. A.
2018. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., Pp. 9198–9208.
Borgwardt, K. M. and H. Kriegel
2005. Shortest-path kernels on graphs. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), 27-30 November 2005, Houston, Texas, USA, Pp. 74–81. IEEE Computer Society.
et al. 
Defferrard, M., X. Bresson, and P. Vandergheynst
2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Pp. 3837–3845.
Gilmer et al. 
Gilmer, J., S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
2017. Neural message passing for quantum chemistry. In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Pp. 1263–1272.
Hamilton et al. 
Hamilton, W. L., Z. Ying, and J. Leskovec
2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, Pp. 1025–1035.
Ioffe, S. and C. Szegedy
2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Pp. 448–456.
Ivanov and Burnaev 
Ivanov, S. and E. Burnaev
2018. Anonymous walk embeddings. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Pp. 2191–2200.
Kersting et al. 
Kersting, K., N. M. Kriege, C. Morris, P. Mutzel, and
2016. Benchmark data sets for graph kernels.
Kingma, D. P. and J. Ba
2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Kipf and Welling 
Kipf, T. N. and M. Welling
2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
Lee et al. 
Lee, J. B., R. A. Rossi, and X. Kong
2018. Graph classification using structural attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Pp. 1666–1674.
Morris et al. 
Morris, C., M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and
2019. Weisfeiler and leman go neural: Higher-order graph neural networks. In
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., Pp. 4602–4609.
Niepert et al. 
Niepert, M., M. Ahmed, and K. Kutzkov
2016. Learning convolutional neural networks for graphs. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, Pp. 2014–2023.
et al. 
Shervashidze, N., P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M.
2011. Weisfeiler-lehman graph kernels. J. Mach. Learn. Res., 12:2539–2561.
et al. 
Shervashidze, N., S. V. N. Vishwanathan, T. Petri, K. Mehlhorn, and K. M.
2009. Efficient graphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, Pp. 488–495.
Srivastava et al. 
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and
2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958.
Vaswani et al. 
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin
2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, Pp. 5998–6008.
Velickovic et al. 
Velickovic, P., G. Cucurull, A. Casanova, A. Romero, P. Liò, and
2018. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
Weisfeiler, B. and A. A. Lehman
1968. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16.
Xie and Grossman 
Xie, T. and J. C. Grossman
2018. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical Review Letters, 120(14):145301.
Xu et al. 
Xu, K., W. Hu, J. Leskovec, and S. Jegelka
2019. How powerful are graph neural networks? In International Conference on Learning Representations.
Xu et al. 
Xu, K., C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and
2018. Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Pp. 5449–5458.
Ying et al. 
Ying, Z., J. You, C. Morris, X. Ren, W. L. Hamilton, and
2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., Pp. 4805–4815.
Zhang et al. [2018a]
Zhang, J., X. Shi, J. Xie, H. Ma, I. King, and
2018a. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. Pp. 339–349.
Zhang et al. [2018b]
Zhang, M., Z. Cui, M. Neumann, and Y. Chen
An end-to-end deep learning architecture for graph classification.In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Pp. 4438–4445.
Zhang et al. [2018c]
Zhang, Z., M. Wang, Y. Xiang, Y. Huang, and
2018c. Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., Pp. 3968–3978.