At its core, Neuro-Symbolic AI (NeSy) is “the combination of deep learning and symbolic reasoning" [Garcez_Lamb_2020]. The goal of NeSy is to address the weaknesses of each of symbolic and sub-symbolic approaches while preserving their strengths (see figure 1). Thus NeSy promises to deliver a best-of-both-worlds approach which embodies the “two most fundamental aspects of intelligent cognitive behavior: the ability to learn from experience, and the ability to reason from what has been learned"[Garcez_Lamb_2020, Valiant_2003].
Remarkable progress has been made on the learning side, especially in the area of Natural Language Processing (NLP) and in particular with deep learning architectures such as the transformer [vaswani2017attention]. However, these systems display certain intrinsic weaknesses which some researchers argue cannot be addressed by deep learning alone; and that in order to do even the most basic reasoning, we need rich representations which enable precise, human interpretable inference via mathematical logic.
Historically, rivalry between symbolic and connectionist, or sub-symbolic, AI research has stymied collaboration across these fields. Although it should be acknowledged that not everyone thought the two were incompatible. As early as 1991, Marvin Minsky presciently asked “Why is there so much excitement about Neural Networks today, and how is this related to research on Artificial Intelligence? Much has been said, in the popular press, as though these were conflicting activities. This seems exceedingly strange to me, because both are parts of the very same enterprise”[minsky1991logical].
More recently, a discussion between Gary Marcus and Yoshua Bengio at the 2019 Montreal AI Debate prompted some passionate exchanges in AI circles, with Marcus arguing that “expecting a monolithic architecture to handle abstraction and reasoning is unrealistic", while Bengio defended the stance that “sequential reasoning can be performed while staying in a deep learning framework"[AIDebate2019].
Spurred by this discussion, and almost ironically, by the success of deep learning (and ergo, the clarity into its limitations), research into hybrid solutions has seen a dramatic increase (see figure 2). At the same time, discussion in the AI community has culminated in “violent agreement" [Kautz] that the next phase of AI research will be about “combining neural and symbolic approaches in the sense of NeSy AI [which] is at least a path forward to much stronger AI systems" [Sarker_Zhou_Eberhart_Hitzler_2021]. Much of this discussion centers around the ability (or inability) of deep learning to reason, and in particular, to reason outside of the training distribution. Indeed, at IJCAI 2021, Yoshua Bengio affirms that “we need a new learning theory to deal with Out-of-Distribution generalization" [IJCAI2021]. Bengio’s talk is titled “System 2 Deep Learning: Higher-Level Cognition, Agency, Out-of-Distribution Generalization and Causality". Here, System 2 refers to the System 1/System 2 (S1/S2) dual process theory of human reasoning developed by psychologist and Nobel laureate Daniel Kahneman in his 2011 book “Thinking, Fast and Slow" [kahneman2011thinking]. AI researchers have drawn many parallels between the characteristics of sub-symbolic and symbolic AI systems and human reasoning with S1/S2. Broadly speaking, sub-symbolic (neural, deep-learning) architectures are said to be akin to the fast, intuitive, often biased and/or logically flawed S1. And the more deliberative, slow, sequential S2 can be thought of as symbolic or logical. But this is not the only theory of human reasoning as we’ll discuss later in this paper.
1.2 Reasoning & Language
“Language understanding in the broadest sense of the term, including question answering that requires commonsense reasoning, offers probably the most complete application area of neurosymbolic AI"[Garcez_Lamb_2020]. This makes a lot intuitive sense from a linguistic perspective. If we accept that language is compositional, with rules and structure, then it should be possible to obtain its meaning via logical processing. Compositionality in language was formalized by Richard Montague in the 1970s, in what is now referred to as Montague grammar. “The key idea is that compositionality requires the existence of a homomorphism between the expressions of a language and the meanings of those expressions."111https://plato.stanford.edu/entries/compositionality/ in other words, there is a direct relationship between syntax and semantics. This is in line with Noam Chomsky’s Universal grammar which states that there is a structure to natural language which is innate and universal to all humans regardless of cultural differences. The challenge lies in representing this structure in a way that both captures the semantics and is computationally efficient.
On the one hand, distributed representations are desirable because they can be efficiently processed by gradient descent (the backbone of deep learning). On the other, the meaning embedded in a distributed representation is difficult if not impossible to decompose. So while a large language model (LLM) may be very good at making certain types of predictions, it is not able to provide an explanation of how it got there. We’ve also seen that the larger the model - more parameters as well as more training data - the better the predictions. But even as these models get infeasibly large, they still fail on tasks requiring basic commonsense. The example in Figure3, given by Marcus and Davis in [Marcus_Davis] is a case in point.
On the other hand, traditional symbolic approaches have also failed to capture the essence of human reasoning. We don’t need a scholar to confirm that everyday commonsense reasoning is nothing like the rigorous mathematical logic whose goal is validity. But even when the objective isn’t commonsense, but rather tasks which require precise, deterministic answers such as expert reasoning or planning, traditional symbolic reasoners are slow, cumbersome, and computationally intractable at scale. Description Logics (DLs) such as OWL, for example, are used to reason over ontologies and knowledge graphs (KGs) on the Web. However, one must accept a harsh trade-off between expressivity and complexity when choosing a DL flavor. Improving the performance of reasoning over ontologies and knowledge graphs that power search and information retrieval across the Web is particularly relevant to the Semantic Web community. Hitzler et al. report on the current research in this area[Hitzler_Bianchi_Ebrahimi_Sarker_2020].
Several surveys have already been conducted which cover the overall NeSy landscape going as far back as 2005, and as recently as 2021, so we will not attempt to replicate that here [Garcez_Lamb_2020, Garcez_Gori_Lamb_Serafini_Spranger_Tran_2019, Sarker_Zhou_Eberhart_Hitzler_2021, Bader_Hitzler_2005, Hammer_Hitzler_2007, Garcez_Lamb_Gabbay_2009, Besold_Kuhnberger_2015, Gabrilovich_Guha_McCallum_Murphy_2015, garcez2015neural, besold2017neural, Belle_2020, Lamb_Garcez_Gori_Prates_Avelar_Vardi_2020, von_Rueden_Mayer_Beckh_Georgiev_Giesselbach_Heese_Kirsch_Walczak_Pfrommer_Pick_etal_2021, Yu_Yang_Liu_Wang_2021, Zhang_Chen_Zhang_Ke_Ding_2021, Tsamoura_Hospedales_Michael_2021]. In fact, our understanding of the field is guided by the works of these scholars. For a succinct overview we refer the reader to [Sarker_Zhou_Eberhart_Hitzler_2021]. And for a more in-depth analysis we recommend [Garcez_Gori_Lamb_Serafini_Spranger_Tran_2019]. Our aim is to synthesize recent work implementing NeSy in the language domain, and to verify if the goals of NeSy are being realized, what the challenges are, and future directions. To our knowledge this is the first attempt at this specific task. In the following sub sections, we briefly describe each of the following goals. These are similar to the benefits described in [Sarker_Zhou_Eberhart_Hitzler_2021]:
Reduced size of training data
1.3.1 Out-of-distribution (OOD) Generalization
OOD generalization refers to the ability of a model to extrapolate to phenomena not previously seen in the training data. The lack of OOD generalization in LLMs is often demonstrated by their inability perform commonsense reasoning, as in the example in figure 3.
As Machine Learning (ML) and AI become increasingly embedded in daily life, the need to hold ML/AI accountable is also growing. This is particularly true in sensitive domains such as healthcare, legal, and some business applications such as lending, where bias mitigation and fairness are critical. These concerns, among others, are the province of Explainable AI (XAI). XAI can be broken down into three main categories:
Explainability - the facility for an expert human to see how a model arrived at a prediction or inference - often represented as a set of rules derived from the model. It seeks to answer the question: How does the model work?
Interpretability - the facility for a non-expert human to see how the data used by the model led to a prediction or inference, usually in the form of a cause and effect articulation. It seeks to answer the question: Why did the model come to this conclusion?
Interactivity - the facility for a human to interrogate the model about counterfactuals. It seeks to answer the question: What would happen if the data was different?
In the literature, sometimes the term interpretability is used in place of explainability and vice versa, however, for our purposes, all three categories are subsumed under the general notion of interpretability. XAI is made possible by invoking an explicit reasoning module post hoc, or building interpretability into the system to begin with.
1.3.3 Reduced size of training data
SOTA language models utilize massive amounts of data for training. This can cost in the thousands or even millions of dollars, take a very long time, and is neither environmentally friendly nor accessible to most researchers or businesses. The ability to learn from less data brings obvious benefits. But apart from the practical implications, there is something innately disappointing in LLMs’ ‘bigger hammer’ approach. Science rewards parsimony and elegance, and NeSy promises to deliver results without the need for such massive scale.
Transferability is the ability of a model which was trained on one domain, to perform similarly well in a different domain. This can be particularly valuable, when the new domain has very few examples available for training. In such cases we might rely on knowledge transfer similar to the way a human might rely on abstract reasoning when faced with an unfamiliar situation.
According to Encyclopedia Britannica, “To reason is to draw inferences appropriate to the situation" [Britanica]. Reasoning is not only a goal in its own right, but also the means by which the other above mentioned goals can be achieved. Not only is it one of the most difficult problems in AI, it is one of the most contested. In section 4.1 we examine the uses of the term reasoning in more depth.
The remainder of this manuscript is structured as follows. Section 2 describes the research methods employed for searching and analysing relevant studies. In Section 3 we analyze the results of the data extraction, how the studies reviewed fit into Henry Kautz’s NeSy taxonomy, and propose a simplified nomenclature for describing Kautz’s NeSy categories. Section 4 discusses various existing implementation challenges. Section 5 presents limitations of the work and future directions for NeSy in NLP, followed by the conclusion in Section 6.
Our review methodology is guided by the principles described in [Kitchenham07guidelinesfor, Pare_Trudel_Jaana_Kitsiou_2015, Page_McKenzie_Bossuyt_Boutron_Hoffmann_Mulrow_Shamseer_Tetzlaff_Akl_Brennan_et_al._2021].
2.1 Research Questions
What are the existing studies on neurosymbolic AI (NeSy) in natural language processing (NLP)?
What are the current applications of NeSy in NLP?
How are symbolic and sub-symbolic techniques integrated and what are the advantages/disadvantages?
What are the challenges for NeSy and how might they be addressed (including existing proposals for future work)?
What areas of NLP might be likely to benefit from the NeSy approach in the future?
2.2 Search Process
We chose Scopus to perform our initial search, as Scopus indexes all the top journals and conferences we were interested in. This obviously precludes some niche publications and it is possible we missed some relevant studies. As our aim is to shed light on the field generally, our assumption is that the top journals are a good representation of the research area as a whole. Since we were looking for studies which combine neural and symbolic approaches, our query consists of combinations of neural and symbolic terms as well as variations of neuro-symbolic terms, listed in table 1:
|Neural Terms||Symbolic Terms||Neuro-Symbolic Terms|
|deep learning||logic||neuro symbolic|
The initial query was restricted to peer-reviewed English language journal articles from the last 10 years and conference papers from the last 3 years, which produced 2,412 results. The query and additional details can be found in our github repository 222https://github.com/kyleiwaniec/neuro-symbolic-ai-systematic-review
2.3 Study selection process
We further limit the journal articles to those published by the top 20 publishers as ranked by Scopus’s CiteScore, which is based on number of citations normalized by the document count over a 4 year window333https://service.elsevier.com/app/answers/detail/a_id/14880/kw/citescore/supporthub/scopus/, and SJR (SCImago Journal Rank), a measure of prestige inspired by the PageRank algorithm over the citation network444https://service.elsevier.com/app/answers/detail/a_id/14883/supporthub/scopus/related/1/, the union of which resulted in 29 publishers, and eliminated 669 articles, for a total of 1,510 journal articles and 232 conference papers for screening. Two researchers independently screened each of the 1,742 studies (articles and conference papers), based on the inclusion/exclusion criteria in Table 2. An overview of the selection process can be seen in Figure 4.
|Input format: unstructured or semi structured text||Input format: not text data, namely: images, speech, tabular data, categorical data, etc.|
|Output format: Any||Application: Theoretical Papers, Position Papers, Surveys (in other words, not implementations)|
|Application: Implementation||The search keywords match, but the actual content does not|
|Language: English||Full text not available (Authors were contacted in these cases)|
The first round of inclusion/exclusion was performed on the titles and abstracts of the 1,742 studies (1,510 articles and 232 papers) from the above identification step. The inclusion criteria at this stage was intentionally broad, as the process itself was meant to be exploratory, and to inform the researchers of relevant topics within NeSy.
This unsurprisingly led to some significant researcher disagreement on inclusion, especially since studies need not have been explicitly labeled as neuro-symbolic to be classified as such. Agreement between researchers can be measured using the Cohen Kappa statistic, with values ranging from [-1,1], where 0 represents the expected kappa score had the labels been assigned randomly, -1 indicates complete disagreement, and 1 indicates perfect agreement. Our score at this stage came to a rather low 0.33. Since this measure is not particularly intuitive, we include a Venn diagram of the number of studies included by each researcher - see Figure5.
To better understand the disagreement and researcher biases, we do an analysis on the term frequency - inverse document frequency (TF-IDF) of one-, bi-, and tri-grams on each of the three areas of the venn diagram: studies included by researcher 1 only, studies included by researcher 2 only, and studies included by both researchers. We calculated the TF-IDF from the abstracts of each of the three groupings, and generated word clouds. At a glance, it appears that researcher 1 chose to include documents where terms related to the symbolic dimension such as “symbolic", “rules", and general terms such as “framework", and “data" appeared more frequently, whereas researcher 2, leaned towards terms along the neural dimension such as “deep learning", “neural", and “networks". Given the academic background of each researcher, we reasoned that the discrepancy was due to bias towards each individual’s area of research. In the third grouping, where both researchers agreed on inclusion, a more balanced distribution can be seen with the terms “symbolic", “artificial", and “neural" carrying similar weight. (See Figure 6.)
We observed that it was not always clear from the abstract alone whether the sub-symbolic and symbolic methods were integrated in a way that meets the inclusion criteria, which may also have led to disagreement.
To facilitate the next round of review, we kept a shared glossary of symbolic and sub-symbolic concepts as they presented themselves in the literature. We each reviewed all 337 studies again, this time skimming the studies themselves. Any disagreement at this stage was discussed in person with respect to the shared glossary. This process led to the elimination of many studies for a final count of 75 studies marked for the next round of review.
2.4 Quality Assessment
The quality of each study was determined through the use of a nine-item questionnaire. Each question was answered with a binary value, and the study’s quality was determined by calculating the ratio of positive answers. Studies with a quality score below 50% were excluded.
Is there a clear and measurable research question?
Is the study put into context of other studies and research, and design decisions justified accordingly? (Number of references in the literature review/ introduction)
Is it clearly stated in the study which other algorithms the study’s algorithm(s) have been compared with?
Are the performance metrics used in the study explained and justified?
Is the analysis of the results relevant to the research question?
Does the test evidence support the findings presented?
Is the study algorithm sufficiently documented to be reproducible? (independent researchers arriving at the same results using their own data and methods)
Is code provided?
Are performance metrics provided? (hardware, training time, inference time)
More than 88% of the studies satisfy the requirements listed from Q1 to Q6. However, nearly 82% of the studies fail to provide source code or details related to the computing environment which makes the system difficult to reproduce. This leads to an overall reduction of the average quality score to 76.3% - see Figure 7.
Finally, each of the 34 studies selected for inclusion was evaluated, classified, and data extraction was performed for each of the features outlined in Table 3. For acceptable values of individual features see B. The lists of neural and symbolic terms referenced in the table constitute the glossary items learned from conducting the selection process.
Figure 8(a) shows the breakdown of conference papers vs journal articles, and Figure 8(b) shows the number of studies published each year. As evidenced by the graph, interest in NeSy has increased significantly since 2019 for NLP even more dramatically than the much more steady incline of interest in NeSy overall.
|Business application||Real world NLP task of the proposed study|
|Technical application||Type of model output - illustrated in Figure 9|
|Type of learning||Indicates learning method (supervised, unsupervised, etc.)|
|Knowledge representation||One of four categories described in Section 3.2|
|Type of reasoning||Indicates whether knowledge is represented implicitly (embedded) or explicitly (symbolic)|
|Language structure||Indicates whether linguistic structure is leveraged to facilitate reasoning|
|Relational structure||Indicates whether relational structure is leveraged to facilitate reasoning|
|Symbolic terms||List of symbolic techniques used by the models|
|Neural terms||List of neural architectures used by the models|
|Datasets||List of all datasets considered|
|Model description||Describes model architecture schematically|
|Evaluation Metrics||Evaluation metrics reported by the authors|
|Reported score||Model performance reported by the authors|
|Contribution||Novel contribution reported by the authors|
|Key-intake||Short description of the study|
|isNeSy||Indicates whether the authors label their study as Neuro-Symbolic|
|NeSy goals||For each of the goals listed in Section 1, indicates whether the goal is met as reported by the authors|
|Kautz category||List of categories from Kautz’s taxonomy|
|NeSy category||List of categories from the proposed nomenclature|
|Study quality||Percentage of positive answers in the quality assessment questionnaire|
3 Results, Data analysis, Taxonomies
In this section, we perform quantitative data analysis based on the extracted features in Table 3. Each study was labeled with terms from the aforementioned glossary, and each term in the glossary was classified as either symbolic, or neural. A bi-product of this process are two taxonomies built bottom-up of concepts relevant to the set of studies under review. The two taxonomies are a reflection of the definition of NeSy provided earlier: “the combination of deep learning and symbolic reasoning". Thus on the learning side, we have neural architectures (described in Section 3.2.1), and on the symbolic reasoning side we have knowledge representation (described in Section 3.2.2). These results are rendered in Table 4, with the addition of color representing a simple metric, or promise score, for each study. The promise score is simply the number of goals reported to have been satisfied by the model(s) in the study.
|3pt. Linear Models||SVM||[Diligenti2017143]||
3.1 Exploratory Data Analysis
We plot the relationships between the features extracted from the studies, and the goals from section1.3 in an effort to identify any correlations between them, and ultimately to identify patterns leading to higher promise scores.
3.1.1 Business and Technical Applications
A collection of common NLP tasks is shown in Figure 9. The subset of tasks belonging to Natural Language Understanding (NLU) and Natural Language Generation (NLG) are often regarded as more difficult, and presumed to require reasoning. Given that reasoning was one of the keywords used for search, it is not surprising that many studies report reasoning as a characteristic of their model(s). Also unsurprising is the fact that nearly half of the text classification studies (which do not belong to this subset) are not associated with any NeSy goals. The relationship between all tasks, or business applications, and NeSy goals is shown in Figure 10.
The business application largely determines the type of model output, or technical application, as can be seen in the almost one-to-one mapping in Figure 11. The exception being question answering, which has been tackled as both an inference and a classification problem. Question answering is the most frequently occurring task, and is associated mainly with reduced data, and to a much lesser degree, interpretability. On a philosophical level this seems somewhat disappointing, as one would hope that in receiving an answer, one could expect to understand why such an answer was given.
For completeness, the number of studies representing the technical applications and most frequently occurring business application is given in Figure 13, while Figure 12 combines business applications, technical applications, and goals.
3.1.2 Type of learning
Machine learning algorithms are classified as supervised, unsupervised, semi-supervised, curriculum or reinforcement learning, depending on the amount and type of supervision required during training[kang2018machine, bonaccorso2017machine, bengio2009curriculum]. Figure 14 demonstrates that the supervised method outweighs all other approaches.
3.1.3 Implicit vs Explicit Reasoning
How reasoning is performed often depends on the underlying representation and what it facilitates. Sometimes the representations are obtained via explicit rules or logic, but are subsequently transformed into non-decomposable embeddings for learning. As such, we can say that any reasoning during the learning process is done implicitly. Studies utilizing Graph Neural Networks (GNNs) [Lemos2020647, Zhou20212015, Huo2019159] would also be considered to be doing reasoning implicitly. The majority of the studies doing implicit reasoning leverage linguistic and/or relational structure to generate those internal representations. These studies meet 21 out of a possible 102 NeSy goals, where 102 = #goals * #studies, or 20.6%. For reasoning to be considered explicit, rules or logic must be applied during or after training. Studies which implement explicit reasoning perform slightly better, meeting 21 out of 72 goals, or 29.2% and generally require less training data. Additionally, 4 studies implement both implicit and explicit reasoning, at a NeSy promise rate of 40%. Of particular interest in this grouping is Bianchi et al. [Bianchi2019161]
’s implementation of Logic Tensor Networks (LTNs), originally proposed by Serafini and Garcez in[Serafini_Garcez_2016]. “LTNs can be be used to do after-training reasoning over combinations of axioms which it was not trained on. Since LTNs are based on Neural Networks, they reach similar results while also achieving high explainability due to the fact that they ground first-order logic.[Bianchi2019161]" Also in this grouping, Jiang et al. [Jiang2020] propose a model where embeddings are learned by following the logic expressions encoded in huffman trees to represent deep first-order logic knowledge. Each node of the tree is a logic expression, thus hidden layers are interpretable.
3.1.4 Linguistic and Relational Structure
In the previous section we described how linguistic and relational structures can be leveraged to generate internal representations for the purpose of reasoning. Here we plot the relationships between these structures and other extracted features and their interactions - see Figure 17. Perhaps the most telling chart is the mapping between structures and goals, where nearly half the studies leveraging linguistic structure don’t meet any of the goals. This runs counter to the intuition that language is a natural fit for NeSy. However, to properly test this intuition, future work on more targeted experiments would need to be performed.
3.1.5 Datasets and Benchmarks
Each study in our survey is based on a unique dataset, and a variety of metrics. Given that there are nearly as many business applications, or tasks, as there are studies, this is not surprising. As such it is not possible to compare the performance of the models reviewed. However, this brings up an interesting question, and that is how one might design a benchmark for NeSy in the first place. A discussion about benchmarks at the IBM Neuro-Symbolic AI Workshop 2022555https://video.ibm.com/recorded/131288165
resulted in general agreement that the most important characteristic of a good benchmark for NeSy is in the diversity of tasks tackled. Gary Marcus pointed out that current benchmarks can be solved extensionally, meaning they can be “gamed". With enough attempts, a model can become very good at a specific task without solving the fundamental reasoning challenge. Ostensibly, this leads to models which are not able to generalize out of distribution. In contrast, to solve a task intensionally is to demonstrate “understanding" which is transferable to different tasks. This view is controversial with advocates of purely connectionist approaches arguing that “understanding" is not only ill defined, but also a moving target. So instead of worrying about the semantics of “understanding", the panelists agreed that to make the benchmarks robust to gaming is to build in enormous variance in the types of tasks they tackle. Taking this a step further, Luis Lamb proposed that instead of designing benchmarks for testing models, we should be designing challenges which encourage people to work on important real world problems.
3.2 Taxonomies: Neural, Symbolic, & NeSy
We now describe the taxonomies introduced in Section 3 as well as NeSy categories.
3.2.1 Neural Architectures
In the main, the extracted neural terms refer to the neural architecture implemented in a given study. We group these into higher level categories such as Linear models, Early generation (which includes CNNs), Graphical models, and Sequence-to-Sequence. We also include here Neuro-Symbolic architectures such as Logic Tensor Networks (LTN) and Recursive Neural Knowledge Networks (RNKN) because they are suitable to optimization via gradient descent.
We include one study [Skrlj2021989]
which doesn’t strictly meet the neural criteria in the sense that it does not implement gradient descent, but rather Neuroevolution (NE). Neuroevolution involves genetic algorithms for modifying neural network weights, topologies, or ensembles of networks by taking inspiration from biological nervous systems[miikkulainen:encyclopedia10-ne, Lehman_Miikkulainen_2013]. Neuroevolution is often employed in the service of Reinforcement Learning (RL).
Studies which do not mention any specific models are categorised as Multilayer Perceptron (MLP).
3.2.2 Symbolic Knowledge Representation
The definition we adopted states that NeSy is the combination of deep learning and symbolic reasoning. Our neural taxonomy described above reflects the deep learning component. Since no reasoning can be done without knowledge, we use Knowledge Representation (KR) as a means of categorizing the symbolic reasoning component. Common KR categories include: production rules, logical representation, frames, and semantic networks [davis1993knowledge, bench2014knowledge, levesque1986knowledge, travis1990knowledge].
Production rules - The production rules are a set of condition-action pairs that represent knowledge [shortliffe2012computer].
Logical representation - Logical representation is the study of entailment relations—languages, truth conditions, and rules of inference [levesque1986knowledge].
Frames - Frames are objects which hold entities, their properties and methods.
Semantic networks - Semantic networks (frame networks) are graphs where nodes are frames, and edges are the relationships between nodes which can represent information.
3.2.3 NeSy Categories
NeSy systems can be categorized according to the nature of the combination of neural and symbolic techniques. At AAAI-20, Henry Kautz presented a 6 level taxonomy of Neuro-Symbolic architectures with a brief example of each [Kautz]. While Kautz has not provided any additional information beyond his talk at AAAI-20, several researchers have formed their own interpretations [Sarker_Zhou_Eberhart_Hitzler_2021, Garcez_Lamb_2020, Lamb_Garcez_Gori_Prates_Avelar_Vardi_2020]. We have categorized all the reviewed studies according to Kautz’s taxonomy only to discover that nearly half (N=15) of the studies belong to Level 1, which arguably is not neuro-symbolic, but rather “standard-operating-procedure" as Kautz himself put it.
Level 1 symbolic Neuro symbolic
is a special case where symbolic knowledge (such as words) is transformed into continuous vector space and thus encoded in the feature embeddings of an otherwise “standard" ML model. We opted to include these studies if the derived input features belong to the set of symbolic knowledge representations described in Section3.2. One could still argue that this is simply a case of good old fashioned feature engineering, and not particularly special, but we want to explore the idea that deep learning can perform reasoning, albeit implicitly, if provided with a rich knowledge representation in the pre-processing phase. We classify these studies as Sequential. Evaluating these studies as a group was particularly challenging as they have very little in common including different datasets, benchmarks and business applications. Half of the studies don’t mention reasoning at all, and the ones that do are mainly executing rules on candidate solutions output by the neural models post hoc. In aggregate, only 16 out of a total of 75 (15 studies * 5 goals), or 21.3%, possible NeSy goals were met.
Level 2 Symbolic[Neuro] is what we describe as a Nested
architecture, where a symbolic reasoning system is the primary system with neural components driving certain internal decisions. AlphaGo is the example provided by Kautz, where the symbolic system is a Monte Carlo Tree Search with neural state estimators nominating next states. We found three studies that fit this architecture[Belkebir201643, Chaturvedi2019264, Chen2021328].
Level 3 Neuro; Symbolic appears to hold more promise in terms of NeSy goals met. Three of the five studies with a Promise score of 3 or more belong to this category (the remaining two belong to levels 4 and 5, which we will discuss in the next subsection). There are six studies in this category, all but one of which leverage relational structure in some manner. A common theme is the use of graph representations and/or GNNs which aligns with recent research directions proposed by Garcez et al. in [Lamb_Garcez_Gori_Prates_Avelar_Vardi_2020], as well as Zhang et al. in [Zhang_Chen_Zhang_Ke_Ding_2021]. We call Level 3 Cooperative, as it is conceptually similar to Reinforcement Learning (RL). Here, a neural network focuses on one task (e.g. object detection) and interacts via input/output with a symbolic reasoner specializing in a complementary task (e.g. query answering). Unstructured input is converted into symbolic representations which can be solved by a symbolic reasoner, which in turn informs the neural component which learns from the errors of the symbolic component. The Neuro-symbolic Concept Learner [Mao2019] is an example of Level 3, meeting 4 out of the 5 NeSy goals, where reasoning is performed explicitly in a “symbolic and deterministic manner." Its ability to perform well with reduced data is particularly impressive: “Using only 10% of the training images, our model is able to achieve comparable results with the baselines trained on the full dataset." Similarly, [Yao201842] report perfect performance on small datasets which they also attribute to the use of explicit and precise reasoning. Both studies display similar limitations, the use of synthetic datasets, and the need for handcrafted logic, a DSL (Domain Specific Language) in the case of [Mao2019], and Image Schemas in [Yao201842].
Levels 4 and 5, Neuro: Symbolic → Neuro and Neuro_Symbolic
respectively, were originally presented by Kautz under one heading. After his presentation, Kautz modified the slide deck separating these two levels into systems where knowledge is compiled into the weights, and where knowledge is compiled into the loss function.
Deep Learning For Mathematics [lample2019deep] is an example of Level 4 where the input and output to the model are mathematical expressions. The model performs symbolic differentiation or integration, for example, given as input, the model outputs . The model exploits the tree structure of mathematical expressions, which are fed into a sequence-to-sequence architecture. This seems like a particularly fitting paradigm for natural language applications on the basis that structures such as parse trees can be similarly leveraged to output other meaningful structures such as for example: cause and effect relationships, argument schemes, or rhetorical devices, to name a few.
Level 5 comprises Tensor Product Representations (TPRs)[Smolensky_1990], Logic Tensor Networks (LTNs)[Serafini_dAvila_Garcez_2016], Neural Tensor Networks (NTN)[Socher_Chen_Manning_Ng_2013] and more broadly is referred to as tensorization, where logic acts as a constraint. In Levels 4 and 5, reasoning can be performed both implicitly and explicitly, in that it is calculated via gradient descent, but can also be performed post hoc. We have grouped studies belonging to these two categories under the moniker of Compiled systems, of which there are ten.
An example of a Compiled system in our set of studies is proposed in [Jiang2020] which we mentioned in Section 3.1.3. Here, knowledge is encoded in the form of huffman trees made of triples, and logic expressions, in order to jointly learn embeddings and model weights. The first layer of the tree consists of entities, the second layer consists of relations . Higher layers compute logic rules. The root node is the final embedding representing a document (in this case a single health record). Back propagation is used for optimization with softmax for calculating class probabilities. The model is intended for medical diagnosis decision support, where a much desirable characteristic is interpretability, and this model meets that goal.
Figure 18 illustrates our mapping from Kautz’s levels to our proposed nomenclature. There are no studies in Level 6, nor in the ensemble major category, but which we include for completeness. Figure 19 shows the number of studies per category, and Figure 20 illustrates the relationship between categories and goals. Table 5 shows the number of studies in each category per goal.
All studies report performance either on par or above benchmarks, but we can’t compare studies based on performance as nearly every study uses a different dataset and benchmark as discussed Section 3.1.5. Our focus is instead on whether the goals of NeSy are being met. Our Promise Score metric is not necessarily what the studies’ authors were optimizing for or even reporting, especially studies which have not labeled themselves as NeSy per se. So we want to make it very clear that our analysis is not a judgement of the success of any particular study, but rather we seek to understand if the hypotheses about NeSy are materializing (namely, that the combination of symbolic and sub-symbolic techniques will fulfill the goals described in Section 1.3. And the short answer is we’re not there yet, as can be seen in Figure 21. For a detailed breakdown of each goal and study see Table 6.
In Section 1.3.5 we put forward the hypothesis that reasoning is the means by which the other goals can be achieved. This is not evidenced in the studies we reviewed. Some possible explanations for this finding are: 1) The kind of reasoning required to fulfill the other goals is not the kind being implemented; 2) The approaches are theoretically promising, but the technical solutions need further development. Next we look at each of these possibilities.
4.1 Reasoning Challenges
Twenty two out of the thirty four studies included for review mention reasoning as a characteristic of their solution. But there is a lot of variation in how reasoning is described and implemented. Given the overwhelming evidence of the fallibility of human reasoning, to understand language, AI researchers have sought guidance from disciplines such as psychology, cognitive linguistics, neuroscience, and philosophy. The challenge is that there are multiple competing theories of human reasoning and logic both across and within these disciplines. What we’ve discovered in our review, is a blurring of the lines between various types of logic, human reasoning, and mathematical reasoning, as well as counter-productive assumptions about which theory to adopt. For example, drawing inspiration from “how people think", accepting that how people think is flawed, and subsequently attempting to build a model with a logical component, which by definition, is rooted in validity seems counter productive to us. This logical component doesn’t have to be binary or even monotonic - the sheer fact that it is implemented in code necessitates a ‘valid’ outcome, however that may be defined by the particular logic theory. Additionally, the justification of “because that’s how people think" is inconsistent. Some examples from the studies we reviewed include:
describe human reasoning in terms of a dual process of “subsymbolic commonsense" (strongly correlated with associative learning), and “axiomatic" knowledge (predicates and logic formulas) for structured inference.
In [Hussain20181662] humans reason by way of analogy, and commonsense knowledge is represented in ConceptNet, a graphical representation of common concepts and their relationships.
For [Yao201842] human reasoning can be modeled by Image Schemas (IS). Schemas are made up of logical rules on (Entity1,Relation,Entity2) tuples, such as transitivity, or inversion.
[Es-Sabery202117943] explain their choice of fuzzy logic for “its resemblance to human reasoning and natural language." This is a probabilistic approach which attempts to deal with uncertainty.
[Ayyanar2019] propose that human thought constructs can be modelled as cause-effect pairs. Commonsense is often described as the ability to draw causal conclusions from basic knowledge, for example: If I drop the glass, it will break.
And [Chen20201544] state that “when people perform explicit reasoning, they can typically describe the way to the conclusion step by step via relational descriptions."
But the most plausible hypothesis in our view is that of Schon et al.[Schon2019293]: in order to emulate human reasoning, systems need to be flexible, be able to deal with contradicting evidence, evolving evidence, have access to enormous amounts of background knowledge, and include a combination of different techniques and logics. Most notably, no particular theory of reasoning is given. The argument put forward by Leslie Kaelbling at IBM Neuro-Symbolic AI Workshop 2022 is similarly appealing. She points to the over-reliance on the System1/System2 analogy, and advocates for a much more diverse and dynamic approach. We posit that the type of reasoning employed shouldn’t be based solely on how we think people think, but on the attendant objective. This is in line with the “goal oriented" theory from neuroscience, in that reasoning involves many sub-systems: perception, information retrieval, decision making, planning, controlling, and executing, utilizing working memory, calculation, and pragmatics. But here the irony is not lost on us, and we acknowledge that by resorting to neuroscience for inspiration, we have just committed the same mischief for which we have been decrying our peers! But if we must resort to analogies with human reasoning then it is imperative to be as rigorous as possible. In their recent book, A FORMAL THEORY OF COMMONSENSE PSYCHOLOGY How People Think People Think [gordon_hobbs_2017], Gordon and Hobbs present a “large-scale logical formalization of commonsense psychology in support of humanlike artificial intelligence" to act as a baseline for researchers building intelligent AI systems. Santos et al [santos_kejriwal_mulvehill_forbush_mcguinness_2021] take this a step in the direction we are advocating, by testing whether there is human annotator agreement when categorizing texts into Gordon and Hobbs’ theories. “Our end-goal is to advocate for better design of commonsense benchmarks [and to] support the development of a formal logic for commonsense reasoning". It is difficult to imagine a single formal logic which would afford all of Gordon and Hobbs’ 48 categories of reasoning tasks. Besold at al. [besold2017neural] dedicate several pages to this topic under the heading of Neural-Symbolic Integration in and for Cognitive Science: Building Mental Models. In short, computational modelling of cognitive tasks and especially language processing is still considered a hard challenge.
Others contend that to understand language, one should approach it through the lens of argumentation; that language is a communication tool, and as such should be understood as the interaction of two or more interlocutors engaged in the exercise of persuasion. Or at a minimum there is a messenger and a receiver, each with a responsibility of encoding and decoding the information exchange. While deduction and induction have traditionally been the realm of symbolic and sub-symbolic systems respectively, abduction and non-monotonic logic are the tools of computational argumentation systems. Certain of Gordon and Hobbs’ theories lend themselves well to argumentation, such as Causality, Envisioning, or Explanation. None of the studies we reviewed undertook this approach. We believe this is a gap worth exploring in NeSy for NLP.
4.2 Technical challenges
There is strong agreement that a successful NeSy system will be characterized by compositionality. Compositionality allows for the construction of new meaning from learned building blocks thus enabling extrapolation beyond the training distribution. To paraphrase Garcez et al., one should be able to query the trained network using a rich description language at an adequate level of abstraction [Garcez_Lamb_2020]. The challenge is to come up with dense/compact differentialble representations while preserving the ability to decompose, or unbind, the learned representations for downstream reasoning tasks.
Bianchi et al. [Bianchi2019161] propose , an extention of Logic Tensor Networks (LTNs), in which pre-trained embeddings are fed into the LTN. They show promising results on small datasets which have the important characteristic of being capable of after-training logical inferences. However, is limited by heavy computational requirements as the logic becomes more expressive, for example by the use of quantifiers.
Several other studies [Mao2019, Yao201842] introduce logical inference within their solutions, but all require manually designed rules, and are limited by the domain expertise of the designer. Learning rules from data, or structure learning [Embar_Sridhar_Farnadi_Getoor_2018] is an ongoing research topic as pointed out by [von_Rueden_Mayer_Beckh_Georgiev_Giesselbach_Heese_Kirsch_Walczak_Pfrommer_Pick_etal_2021]. In [Chaturvedi2019264] Chaturvedi et al. use fuzzy logic for emotion classification where explicit membership functions are learned. However, as stated by the authors, the classifier becomes very slow with the number of functions.
Other (compiled) approaches involve translating logic into differentialble functions, which are either directly included as network nodes as in [Jiang2020], or added as a constraint to the loss function, as in [Diligenti2017143]. To achieve this, First Order Logic (FOL) can be operationalized using t-norms for example. However, even if we can’t agree on how people reason, we can probably agree how they do not, and that’s with FOL. To address the many types of reasoning as discussed in the previous section, we need to be able incorporate other types of logic, such as temporal logic, modal logics, epistemic logic, non-monotonic logics and more, but which have no obvious differentiable form.
In summary, formulating logic, or more broadly reasoning, in a differentiable fashion remains challenging.
5 Limitations & Future Work
We organized our analysis according to the characteristics extracted from the studies to test whether there were any patterns leading to NeSy goals. Another approach would be to reverse this perspective, and look at each goal separately to understand the characteristics leading to its fulfillment. However, each goal is really an entire field of study in and of itself, and we don’t think we could have done justice to any of them by taking this approach. We spent a lot of time looking for signal in a very noisy environment where the studies we reviewed had very little in common. More can be said about what we did not find, than what we did. Another approach might be to narrow the criteria for the type of NLP task, while expanding the technical domain. In particular, a subset of tasks from the NLU domain could be a good starting point, as these tasks are often said to require reasoning.
We tried to be comprehensive in respect to the selected studies which led to the trade-off of less space dedicated to technical details or additional context from the neuro-symbolic discussion. There are a lot of ideas and concepts which we did not cover, such as, and in no particular order, Relational Statistical Learning (RSL), Inductive Logic Programming (ILP), DeepProbLog[manhaeve2018deepproblog]
, Connectionist Modal Logics (CML), Extreme Learning Machines (ELM), Genetic Programming, grounding and proposinalization, Case Based Reasoning (CBR), Abstract Meaning Representation (AMR), to name but a few, some of which are covered in detail in other surveys[Garcez_Gori_Lamb_Serafini_Spranger_Tran_2019, besold2017neural].
Furthermore, we argued that we need differentiable forms of different types of logic, but we did not discuss how they might be implemented. A comprehensive point of reference such as this would be a very valuable contribution to the NeSy community, especially if the implementations were anchored in cognitive science and linguistics as discussed in 4.1.
Finally, the need for common datasets and benchmarks cannot be overstated.
We analyzed recent studies implementing NeSy for NLP in order to test whether the promises of NeSy are materializing in NLP. We attempted to find a pattern in a small and widely variable set of studies, and ultimately we do not believe there are enough results to draw definitive conclusions. Only 34 studies met the criteria for our review, and many of them (in the Sequential category) we would not consider truly integrated NeSy systems. The one thing studies which show promise [Bianchi2019161, Skrlj2021989, Yao201842, Mao2019, Huo2019159] in meeting the specified goals have in common is that they all belong to the tightly integrated set of NeSy categories, Cooperative and Compiled which is good news for NeSy. Two out of these five report lower computational cost than baselines, and performance on par or slightly above baselines, though we must reiterate that performance comparisons are not possible as discussed in Section 3.1.5. We’ve seen that studies implementing both implicit and explicit reasoning meet the most NeSy goals, but suffer from high computational cost. Furthermore, explicit reasoning still often requires hand crafted domain specific rules and logic which makes them difficult to scale or generalize to other applications. Indeed, of the five goals, transferability to new domains was the least frequently satisfied.
Our view is that the lack of consensus around theories of reasoning and appropriate benchmarks is hindering our ability to evaluate progress. Hence we advocate for the development of robust reasoning theories and formal logics as well as the development of challenging benchmarks which not only measure the performance of specific implementations, but have the potential to address real world problems. Systems capable of capturing the nuances of natural language (ie., ones that “understand" human reasoning) while returning sound conclusions (ie., perform logical reasoning) could help combat some of the most consequential issues of our times such as mis- and dis-information, corporate propaganda such as in climate change denialism, divisive political speech, and other harmful rhetoric in the social discourse.
This publication has emanated from research supported in part by a grant from Science Foundation Ireland under Grant number 18/CRT/6183. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Appendix A Acronyms
|CNN||Convolutional Neural Network|
|DBN||Deep Belief Network|
|GNN||Graph Neural Network|
|LLM||Large Language Models|
|LTN||Logic Tenson Network|
|NLP||Natural Language Processing|
|NLU||Natural Language Understanding|
|RcNN||Recursive Neural Network|
|RNKN||Recursive Neural Knowledge Network|
|RNN||Recurrent Neural Network|
|SVM||Support Vector Machine|
|TF-IDF||Term frequency - Inverse document frequency|
|TPR||Tensor Product Representation|
Appendix B Allowed Values
|Type of learning||
|Type of reasoning||Implicit, Explicit, Both|
|Language structure||Yes, No|
|Relational structure||Yes, No|
|NeSy category||Sequential, Cooperative, Compiled, Nested|