1 Introduction
Over the last decade Artificial Intelligence in general, and deep learning in particular, have been the focus of intensive research endeavors, gathered media attention and led to debates on their impacts both in academia and industry
[marcus2020], [raghavan19]. The recent AI Debate in Montreal with Yoshua Bengio and Gary Marcus ([marcus2020]), and the AAAI2020 fireside conversation with Nobel Laureate Daniel Kahneman and the 2018 Turing Award winners and deep learning pioneers Geoff Hinton, Yoshua Bengio and Yann LeCun have led to new perspectives on the future of AI. It has now been argued that if one aims to build richer AI systems, i.e. semantically sound, explainable, and reliable, one has to add a sound reasoning layer to deep learning. Kahneman has made this point clear when he stated at AAAI2020 that “…so far as I’m concerned, System 1 certainly knows language… System 2… does involve certain manipulation of symbols.” [fireside2020].Kahneman’s comments address recent parallels made by AI researchers between “Thinking, Fast and Slow” and the socalled “AI’s systems 1 and 2”, which could, in principle, be modelled by deep learning and symbolic reasoning, respectively.^{1}^{1}1“Thinking, Fast and Slow”, by Daniel Kahneman: New York, FSG, 2011, describes the author’s “… current understanding of judgment and decision making, which has been shaped by psychological discoveries of recent decades.”
In this paper, we present a survey and relate recent research results on: (1) NeuralSymbolic Computing, by summarizing the main approaches to rich knowledge representation and reasoning within deep learning, and (2) the approach pioneered by the authors and others of Graph Neural Networks (GNN) for learning and reasoning about problems that require relational structures or symbolic learning. Although recent papers have surveyed GNNs, including [battaglia2018relational] and [wu2019comprehensive] they have not focused on the relationship between GNNs and neuralsymbolic computing (NSC). [bengio2018tourdhorizon]
also touches particular topics related to some we discuss here, in particular to do with metatransfer learning. Recent surveys in neuralsymbolic computing
[JAL19], [Joe19] have not exploited the recent, highly relevant applications of GNNs in symbolic and relational learning, or the relationship between neuralsymbolic computing and graph neural networks.Our Contribution: As mentioned above, recent work have surveyed graph neural networks and neuralsymbolic computing, but to the best of our knowledge, no survey has reviewed and analysed the recent results on the specific relationship between GNNs and NSC. We also outline the promising directions for research and applications combining GNNs and NSC from the perspective of symbolic reasoning tasks. The abovereferenced surveys on GNNs, although comprehensive, all describe other application domains.
The remainder of the paper is organized as follows. In Section 2, we present an overview and taxonomy of approaches for neuralsymbolic computing. In Section 3, we discusses the main GNN models and their relationship to neuralsymbolic computing. We then outline the main GNN architectures and their use in several relational and symbolic reasoning tasks. Finally, we conclude and point out directions for further research.
2 NeuralSymbolic Computing Taxonomy
At this year’s Robert S. Engelmore Memorial Lecture, at the AAAI Conference on Artificial Intelligence, New York, February 10th, 2020, Henry Kautz introduced a taxonomy for neuralsymbolic computing as part of a talk entitled The Third AI Summer. Six types of neuralsymbolic integration are outlined: 1. symbolic Neuro symbolic, 2. Symbolic[Neuro], 3. Neuro;Symbolic, 4. Neuro:Symbolic Neuro, 5. Neuro_{Symbolic} and Neuro[Symbolic].
The origin of Graph Neural Networks [scarselli2008graph]
can be traced back to neuralsymbolic computing in that both sought to enrich the vector representations in the inputs of neural networks, first by accepting tree structures and then graphs more generally. In this sense, according to Kautz’s taxonomy, GNNs are a type 1 neuralsymbolic system. GNNs
[battaglia2018relational] were recently combined with convolutional networks in novel ways which have produced impressive results on data efficiency.In parallel, neuralsymbolic computing has focused much effort on the learning of adequate embeddings for the purpose of symbolic computation. This branch of neuralsymbolic computing, which includes Logic Tensor Networks
[LTN] and Tensor Product Representations [Smolensky] has been called in [JAL19] tensorizationmethods, which will be discussed in more detail in the next section. These have been classified by Henry Kautz as type 5 neuralsymbolic systems, as also discussed in what follows. A natural point of contact between GNNs and neuralsymbolic computing is therefore the provision of rich embeddings and attention mechanisms towards structured reasoning and efficient learning.
Type 1 neuralsymbolic integration is standard deep learning, which some may argue is a stretch to refer to as neuralsymbolic, but which is included here to note that the input and output of a neural network can be made of symbols e.g. in the case of language translation or question answering applications. Type 2 are hybrid systems such as DeepMind’s AlphaGo and other systems where the core neural network is looselycoupled with a symbolic problem solver such as Monte Carlo tree search. Type 3 is also a hybrid system whereby a neural network focusing on one task (e.g. object detection) interacts via input/output with a symbolic system specialising in a complementary task (e.g. query answering). Examples include the neurosymbolic concept learner [Mao_2019] and deepProbLog [Robin_2018]. In a type 4 neuralsymbolic system, symbolic knowledge is compiled into the training set of a neural network. Kautz offers [Lample2020Deep] as an example. Here, we would also include other tightlycoupled neuralsymbolic systems where various forms of symbolic knowledge, not restricted to ifthen rules only, can be translated into the initial architecture and set of weights of a neural network [garcez_book2]
, in some cases with guarantees of correctness. Type 5 are those tightlycoupled neuralsymbolic systems where a symbolic logic rule is mapped onto a distributed representation (an embedding) and acts as a softconstraint (a regularizer) on the network’s loss function. Examples of these are
[Smolensky] and [LTN]. Such systems are referred to as tensorization in the NSC survey [JAL19].Finally, a type 6 system should be capable, according to Kautz, of true symbolic reasoning inside a neural engine. It is what one could refer to as a fullyintegrated system. Early work in neuralsymbolic computing has achieved this see [garcez_book2] for a historical overview) and some type 4 systems are also capable of it [garcez_book2], but in a localist rather than a distributed architecture and using much simpler forms of embedding than type 5 systems. Kautz adds that a type 6 system should be capable of combinatorial reasoning and to use an attention schema to achieve it effectively, of which there are currently no concrete examples. This resonates with the recent proposal outlined by Yoshua Bengio during the AI debate of December 2019.
In what concerns the theory of neuralsymbolic computing, the study of type 6 systems is highly relevant. In practical terms, a tension exists between effective learning and sound reasoning which may prescribe the use of a more hybrid approach (types 3 to 5) or variations thereof such as the use of attention with tensorization. Orthogonal to the above taxonomy, but mostly associated so far with type 4, is the study of the limits of reasoning within neural networks w.r.t. full firstorder, higherorder and nonclassical logic theorem proving. In this paper, as we revisit the use of rich logic embeddings in type 5 systems, notably Logic Tensor Networks [LTN], alongside the use of attention mechanisms or convolutions in Graph Neural Networks, we will seek to propose a research agenda and specific applications of symbolic reasoning and statistical learning towards the sound development of type 6 systems.
3 Graph Neural Networks Meet NeuralSymbolic Computing
One of the key concepts in machine learning is that of
priors or inductive biases – the set of assumptions that a learner uses to compute predictions on test data. In the context of deep learning, the design of neural building blocks which enforce strong priors has been a major source of breakthroughs. For instance, the priors obtained through feedforward layers encourage the learner to combine features additively, while the ones obtained through dropout discourage it to overfit and the ones obtained through multitask learning encourage it to prefer sets of parameters that explain more than one task. One of the most influential neural building blocks, having helped pave the way for the deep learning revolution, is the convolutional layer. Convolutional architectures are successful for tasks defined over Euclidean signals because they enforce equivariance to spatial translation. This is a useful property to have when learning representations for objects regardless of their position in a scene.Analogously, Recurrent layers enforce equivariance in time which is useful for learning over sequential data. Recently, attention mechanisms, through the advent of Transformer networks, have enabled advancing the stateofart in many sequential tasks, notably in natural language processing
[devlin2018bert] and symbolic reasoning tasks such as solving Math equations and integrals^{2}^{2}2It is advisable to read [Lample2020Deep] alongside this critique of its limitations [ernie] [Lample2020Deep]. Attention encourages the learner to combine representations additively while also enforcing permutation invariance. All three architectures take advantage of sparse connectivity – another important design in deep learning which is key to enable the training of larger models. Sparse connectivity and neural building blocks with strong priors usually go hand in hand, as the latter leverage symmetries in the input space to cut down parameters through invariance to different types of transformations.(1) 
Neuralsymbolic architectures often combine the key design concepts from convolutional networks and attentionbased architectures to enforce permutation invariance over the elements of a set or the nodes of a graph (see Figure 1). Some neuralsymbolic architectures such as Pointer Networks [vinyals2015pointer] implement attention directly over a set of inputs coupled with a decoder which outputs a sequence of “pointers” to the input elements (hence the name). Note that both formalisations are defined over set inputs rather than sequential ones.
3.1 Logic Tensor Networks
Tensorisation is a class of approaches that embeds firstorder logic symbols such as constants, facts and rules into realvalued tensors. Normally, constants are represented as onehot vectors (first order tensor). Predicates and functions are matrices (secondorder tensor) or higherorder tensors.
In early work, embedding techniques were proposed to transform symbolic representations into vector spaces where reasoning can be done through matrix computation [Bordes_2011, LTN, Santoro_2017]
. Training embedding systems can be carried out as distance learning using backpropagation. Most research in this direction focuses on representing relational predicates in a neural network. This is known as “relational embedding”
[Bordes_2011, Santoro_2017, Ilya_2008]. For representation of more complex logical structures, i.e. first orderlogic formulas, a system named Logic Tensor Network (LTN) [LTN] is proposed by extending Neural Tensor Networks (NTN), a stateoftheart relational embedding method. Related ideas are discussed formally in the context of constraintbased learning and reasoning [JAL19]. Recent research in firstorder logic programs has successfully exploited the advantages of distributed representations of logic symbols for efficient reasoning, inductive programming
[Evans_18] and differentiable theorem proving [Rocktaschel_2016].3.2 Pointer Networks
The Pointer Network (PN) formalisation [vinyals2015pointer] is a neural architecture meant for computing a sized sequence over the elements of an input set . PN implement a simple modification over the traditional seq2seq model, augmenting it with a simplified variant of the attention mechanism whose outputs are interpreted as “pointers” to the input elements.
Traditional seq2seq models implement an encoderdecoder architecture in which the elements of the input sequence are consumed in order and used to update the encoder’s hidden state at each step. Finally, a decoder consumes the encoder’s hidden state and is used to yield a sequence of outputs, one at a time. It is known that seq2seq models tend to exhibit improved performance when augmented with an attention mechanism, a phenomenon noticeable from the perspective of Natural Language Processing (NLP) [devlin2018bert]. Traditional models however yield sequences of outputs over a fixedlength dictionary (for instance a dictionary of tokens for language models), which is not useful for tasks whose output is defined over the input set and hence require a variablelength dictionary.
PN tackle this problem by encoding the sized input set
with a traditional encoding architecture and decoding a probability distribution
over the set of indices at each step by computing a softmax over an attention layer parameterized by matrices and vector feeding on the decoder state and the encoder states :(2)  
The output pointers can then be used to compute loss functions over combinatorial optimization problems. In the original paper the authors define a PN to solve the Traveling Salesperson Problem (TSP) in which a beam search procedure is used to select cities given the probability distributions computed at each step and finally a loss function can computed for the output tour by adding the corresponding city distances.
Given their discrete nature, PNs are naturally suitable for many combinatorial problems (the original paper authors evaluate PN on the Traveling Salesperson, Delauney Triangulation and Convex Hull problems). Unfortunately, even though PNs can solve problems over sets, they cannot be directly applied to general (noncomplete) graphs.
3.3 Convolutions as Selfattention
The core building block of models in the graph neural network family is the graph convolution operation, which is a neural building block which enables one to perform learning over graph inputs. Empowering DL architectures with the capacity of feeding on graphbased data is particularly suitable for neuralsymbolic reasoning, as symbolic expressions can be easily represented with graphs (see Figure 2). Furthermore, graph representations have useful properties such as permutation invariance and flexibility for generalization over the input size (models in the graph neural network family can be fed with graphs regardless of their size in terms of number of vertices). We argue that graph convolutions can be seen as a variation of the more wellknown attention mechanism. A graph convolution is essentially an attention layer with two key differences:

There is no dotproduct for computing weights: encodings are simply added together with unit weights.^{3}^{3}3The Graph Attention network (GAT) however generalizes graph convolutions with dotproduct attention [velivckovic2017graph].

The sum is masked with an adjacency mask, or in other words the graph convolution generalizes attention for noncomplete graphs.
All models in the graph neural network family learn continuous representations for graphs by embedding nodes into hyperdimensional spaces, an insight motivated by graph embedding algorithms. A graph embedding corresponds to a function mapping from the set of vertices of a graph to ndimensional vectors. In the context of graph neural networks, we are interested in learning the parameters of a function . That is, a parameterized function over the set of graphs whose outputs are mappings from vertices to ndimensional vectors. In other words, graph neural networks learn functions to encode vertices in a generalized way. Note that since the output from a GNN is itself a function, there are no limitations for the number of vertices in the input graph. This useful property stems from the modular architecture of GNNs, which will be discussed at length in the sequel. We argue that this should be interesting to explore in the context of neuralsymbolic computing in the representation and manipulation of variables within neural networks.
Generally, instead of synthesizing a vertex embedding function from the ground up, GNNs choose an initial, simpler vertex embedding such as mapping each vertex to the same (learned) vector representation or sampling vectors from a multivariate normal distribution, and then learn to
refine this representation by iteratively updating representations for all vertices. The refinement process, which consists of each vertex aggregating information from its direct neighbors to update its own embedding is at the core of how GNNs learn properties over graphs. Over many refinement steps, vertices can aggregate structural information about progressively larger reachable subsets of the input graph. However we rely on a wellsuited transformation at each step to enable vertices to make use of this structural information to solve problems over graphs. The graph convolution layer, described next in Section 3.4, implements such transformation.(3)  
3.4 Graph Convolutional Networks
Graph convolutions are defined in analogy to convolutional layers over Euclidean data. Both architectures compute weighted sums over a neighborhood. For CNNs, this neighborhood is the well known 9connected or 25connected neighborhood defined over pixels. One can think of the set of pixels of an image as a graph with a grid topology in which each vertex is associated with a vector representation corresponding to the Red/Green/Blue channels. The internal activations of a CNN can also be thought of graphs with grid topologies, but the vector representations for each pixel are generally embedded in spaces of higher dimensionality (corresponding to the number of convolutional kernels learned at each layer).
In this context, Graph Convolutional Newtorks (GCNs) can be thought of as a generalization of CNNs for nongrid topologies. Generalizing CNNs this way is tricky because one cannot rely anymore on learning or kernels, for two reasons:

In grid topologies, pixels are embedded in 2dimensional Euclidean space, which enables one to learn a specific weight for each neighbor on the basis of its relative position (left, right, central, topright, etc.). This is not true for general graphs, and hence weights such as do not always have a clear interpretation.

In grid topologies each vertex has a fixed number of neighbors and weight sharing, but there is no such constraint for general graphs. Thus we cannot hope to learn a specific weight for each neighbor as the required number of such weights will vary with the input graph.
GCNs tackle this problem the following way: Instead of learning kernels corresponding to matrices of weights, they learn transformations for vector representations (embeddings) of graph vertices. Concretely, given a graph and a matrix of vertex representations (i.e. is the vector representation of vertex at the kth layer), a GCN computes the representations of vertex in the next layer as:
(4) 
In other words, we linearly transform the vector representation of each neighbor
by multiplying it with a learned matrix of weights , normalizing it by the square roots of the degrees of both vertices, aggregate all results additively and finally apply a nonlinearity . Note that denotes the learned weight matrix for GCN layer – in general one will stack different GCN layers together and hence learn the parameters of such matrices. Also note that one iterates over an extended neighborhood which includes itself. This is done to prevent “forgetting” the representation of the vertex being updated. Equation 4 can be summarized as: , where is the adjacency matrix plus selfloops (Iis the identity matrix) and
is the degree matrix of .3.5 Graph Neural Network Model
Although GCNs are conceptually simpler, the graph neural network model predates them by almost a decade, having been originally proposed by [scarselli2008graph]. The model is similar to GCNs, with two key differences:

One does not stack multiple independent layers as with GCNs. A single parameterized function is iterated many times, in analogy to recurrent neural networks, until convergence.

The transformations applied to neighbor vertex representations are not necessarily linear, and can be implemented by deep neural networks (e.g. by a multilayer perceptron).
Concretely, the graph neural network model defines parameterized functions and , named the transition function and the output function. In analogy to a graph convolution layer, the transition function defines a rule for updating vertex representations by aggregating transformations over representations of neighbor vertices. The vertex representation for vertex at time is computed as:
(5) 
Where , and are respectively the labels for nodes and and edge and are respectively the space of vertex representations and the output space. The model is defined over labelled graphs, but can still be implemented for unlabelled ones by supressing from the transition function. After a certain number of iterations one should expect that vertex embeddings are enriched with structural information about the input graph. At this point, the output function can be used to compute an output for each vertex, given its final representation:
In other words, the output at the end of the process is a set of vectors . This is useful for node classification tasks, in which one can have equal the number of node classes and enforce
to encode a probability distribution by incorporating a softmax layer into the output function
. If one would like to learn a function over the entire graph instead of its neighbors, there are many possibilities, of which one is to compute the output on an aggregation over all final vertex representations:3.6 Messagepassing Neural Network
Messagepassing neural networks implement a slight modification over the original GNN model, which is to define a specialized update function to update the representation for vertex given its current representation and an aggregation over transformations of neighbor vertex embeddings (which are referred to as “messages”, hence messagepassing neural networks), as an example:
(6) 
Also, the update procedure is carried out over a fixed number of steps and it is usual to implement
if using some type of recurrent network, such as LongShort Term Memory (LSTM) cells
[selsam2019neurosat], or Gated Recurrent Units (GRU).
3.7 Graph Attention Networks
The Graph Attention Networks (GAT) [velivckovic2017graph] augment models in the graph neural network family with an attention mechanism enabling vertices to weigh neighbor representations during their aggregation. As with other types of attention, a parameterized function is used to compute the weights dynamically, which enables the model to learn to weigh representations wisely.
The goal of the GAT is to compute a coefficient for each neighbor of a given vertex , so that the aggregation in Equation 5 becomes:
(7) 
To compute , the GAT introduces a weight matrix which is used to multiply vertex embeddings for and , which are concatenated and multiplied by a parameterized weight vector . Finally, a nonlinearity is applied to the computation in the above equation and then a softmax over the set of neighbors is applied over the exponential of the result, yielding:
The GAT is known to outperform typical GCN architectures for graph classification tasks, as demonstrated in the original paper [velivckovic2017graph].
4 Perspectives and Applications of GNNs to NeuralSymbolic Computing
In this paper, we have seen that GNNs endowed with attention mechanisms are a promising direction of research towards the provision of rich reasoning and learning in type 6 neuralsymbolic systems. Future work includes, of course, application and systematic evaluation of relevant specific tasks and data sets. These include what John McCarthy described as drosophila tasks for Computer Science: basic problems that can illustrate the value of a computational model.
Examples in the case of GNNs and NSC could be: (1) extrapolation of a learned classification of graphs as Hamiltonian to graphs of arbitrary size, (2) reasoning about a learned graph structure to generalise beyond the distribution of the training data, (3) reasoning about the relation to make sense of handwritten MNIST digits and nondigits. (4) using an adequate selfattention mechanism to make combinatorial reasoning computationally efficient. This last task relates to satisfiability including work on using GNNs to solve the Travelling Salesperson problem. The other tasks are related to metatransfer learning across domains, extrapolation and causality. In terms of domains of application, the following are relevant.
4.1 Relational Learning and Reasoning
Models in the GNN family have been successfully applied to a number of relational reasoning tasks. Despite the success of convolutional neural networks, visual scene understanding is still out of reach for pure CNN models, and hence are a fertile ground for GNNbased models. Hybrid CNN + GNN models in particular have been very successful in these tasks, having been applied to understanding humanobject interactions, localising objects, and challenging visual question answering problems
[Santoro_2017]. Relational reasoning has also been applied to physics, with models for extracting objects and relations in a unsupervised fashion [steenkiste2018relational] as well as graph neural networks coupled with differentiable ODE solvers which have been used to learn the Hamiltonian dynamics of physical systems given their interactions modelled as a dynamic graph [greydanus2019hamiltonian].The application of neural symbolic models to life sciences is very promising, as graphs are natural representations for molecules, including proteins. In this context, [STOKES2020688] have generated the first machinelearning discovered antibiotic (“halcin”) by training a GNN to predict the probability that a given input molecule has a growth inhibition effect on the bacterium E. coli and using it to rank randomlygenerated molecules. Protein Structure Prediction (PSP), which is concerned with predicting the threedimensional structure of a protein given its molecular description, is another promising problem for graphbased and neural symbolic models such as DeepMind’s AlphaFold and its variations [wei2019protein].
In Natural language processing tasks are usually defined over sequential data, but modeling textual data with graphs offers a series of advantages. Several approaches have defined graph neural networks over graphs of text coocurrences, showing that these architectures improve upon the stateoftheart for seq2seq models [yao2019graph]. As previously mentioned, attention mechanisms, which can be seen as a variation of models in the GNN family, have enabled substantial improvements over the stateoftheart in several NLP tasks through transfer learning over pretrained transformer language models [devlin2018bert]. The extent to which language models pretrained over huge amounts of data can perform language understanding however is substantially debated, as pointed out by both Marcus [marcus2020] and Kahneman [fireside2020]. Graphbased neural network models have also found a fertile field of application with software engineering: due to the structured and unambiguous nature of code, it can be represented naturally with graphs which are derived unambiguously via parsing. Several works have then utilised GNNs to perform analysis over graph representations of programs and obtained significant results. More specifically, Microsoft’s “Deep Program Understanding” research programme has used a GNN variant called Gated Graph Sequence Neural Networks [li2016gated] in a large number of applications, of which some examples are spotting errors or suggesting variable names, code completion [grockschmidt2019gencode], as well as edit representation and automatically applying edits to programs [yin2019edits].
4.2 Combinatorial Optimization and Constraint Satisfaction Problems
Many combinatorial optimization problems are relational in structure and thus are prime application targets to GNNbased models [bengio2018tourdhorizon]. For instance, [khalil2017combinatorial]
uses a GNNlike model to embed graphs and use these embeddings in their heuristic search for the Minimum Vertex Cover (MVC), Maximum Cut and Traveling Salesperson (TSP) problems. Regarding endtoend models,
[kool2019attention] trained a transformerbased Graph Neural Network model to embed TSP answers and extract solutions with an attentionbased decoder, while obtaining better performance than previous work. [li2018combinatorial] used a GCN as a heuristic to a search algorithm, applying this method on four canonical NPcomplete problems, namely Maximal Independent Set, MVC, Maximal Clique, and the Boolean Satisfiability Problem (SAT). Also interesting to note are the endtoend models in [selsam2019neurosat, prates2019tsp] which use messagepassing neural networks to train solvers for the decision variants of the boolean satisfiability, travelling salesperson and graph colouring problems, respectively, which allowed their models to be trained with a single bit of supervision on each instance, [selsam2019neurosat] being able to extract assignments from the trained model, [prates2019tsp]performing a binary search on the prediction probability to estimate the optimal route cost. More recently
[toenshoff2019runcsp] built an endtoend framework for dealing with (boolean) constraint satisfaction problems in general, extending the previous works and providing comparisons and performance increases, and [abboud2020] have proposed a GNNbased architecture which learns to perform approximate DNF counting.5 Conclusions
We presented a review on the relationship between Graph Neural Network (GNN) models and similar architectures and NeuralSymbolic Computing (NSC). In order to do so, we presented the main recent research results that highlight the potential applications of these related fields both in foundational and applied AI and Computer Science problems. The interplay between the two fields is beneficial to several areas. These range from combinatorial optimization/constraint satisfaction to relational reasoning, which has been the subject of increasing industrial relevance in natural language processing, life sciences and computer vision and image understanding
[raghavan19, marcus2020]. This is largely due to the fact that many learning tasks can be easily and naturally captured using graph representations, which can be seen as a generalization over the traditional sequential (RNN) and gridbased representations (CNN) in the family of deep learning building blocks. Finally, it is worth mentioning that the principled integration of both methodologies (GNNs and NeuralSymbolic Computing) offers a richer alternative to the construction of trustful, explainable and robust AI systems, which is clearly an invaluable research endeavor.