1 Introduction
Integrating robust connectionist learning and sound symbolic reasoning is a key challenge in modern Artificial Intelligence. Deep neural networks
(lecun2015deep; lecun1995convolutional; hochreiter1997long) provide us powerful and flexible representation learning that has achieved stateoftheart performances across a variety of AI tasks such as image classification (krizhevsky2012imagenet; szegedy2015going; he2016deep), machine translation (sutskever2014sequence), and speech recognition (graves2013speech). However, it turns out that many aspects of human cognition, such as systematic compositionality and generalization (fodor1988connectionism; marcus1998rethinking; fodor2002compositionality; calvo2014architecture; marcus2018algebraic; Lake2018Generalization), cannot be captured by neural networks. On the other hand, symbolic reasoning supports strong abstraction and generalization but is fragile and inflexible. Consequently, many methods have focused on building neuralsymbolic models to combine the best of deep representation learning and symbolic reasoning (sun1994integrating; garcez2008neural; bader2009extracting; besold2017neural; yi2018neural).Recently, this neuralsymbolic paradigm has been extensively explored in the tasks of the visual question answering (VQA) (yi2018neural; vedantam2019probabilistic; mao2019neuro), visionlanguage navigation (anderson2018vision; fried2018speaker), embodied question answering (das2018embodied; das2018neural), and semantic parsing (liang2016neural; yin2018structvae), often with weak supervision. Concretely, for these tasks, neural networks are used to map raw signals (images/questions/instructions) to symbolic representations (scenes/programs/actions), which are then used to perform symbolic reasoning/execution to generate final outputs. Weak supervision in these tasks usually provides pairs of raw inputs and final outputs, with intermediate symbolic representations unobserved. Since symbolic reasoning is nondifferentiable, previous methods usually learn the neuralsymbolic models by policy gradient methods like REINFORCE. The policy gradient methods generate samples and update the policy based on the generated samples that happen to hit high cumulative rewards. No efforts are made to improve each generated sample to increase its cumulative reward. Thus the learning has been proved to be timeconsuming because it requires generating a large number of samples over a large latent space of symbolic representations with sparse rewards, in the hope that some samples may be lucky enough to hit high rewards so that such lucky samples can be utilized for updating the policy. As a result, policy gradients methods converge slowly or even fail to converge without pretraining the neural networks on fullysupervised data.
To model the recursive compositionality in a sequence of symbols, we introduce the grammar model to bridge neural perception and symbolic reasoning. The structured symbolic representation often exhibits compositional and recursive properties over individual symbols in it. Correspondingly, the grammar models encode symbolic prior about composition rules, thus can dramatically reduce the solution space by parsing the sequence of symbols into valid sentences. For example, in the handwritten formula recognition problem, the grammar model ensures that the predicted formula is always valid, as shown in Figure 1.
To make the neuralsymbolic learning more efficient, we propose a novel backsearch strategy which mimics human’s ability to learn from failures via abductive reasoning (magnani2009abductive; zhou2019abductive)
. Specifically, the backsearch algorithm propagates the error from the root node to the leaf nodes in the reasoning tree and finds the most probable
correction that can generate the desired output. The correction is further used as a pseudo label for training the neural network. Figure 1 shows an exemplar backward pass of the backsearch algorithm. We argue that the backsearch algorithm makes a first step towards closing the learning loop by propagating the error through the nondifferentiable grammar parsing and symbolic reasoning modules. We also show that the proposed multistep backsearch algorithm can serve as a MetropolisHastings sampler which samples the posterior distribution of the symbolic representations in the maximum likelihood estimation in Subsubsection 3.2.3.We conduct experiments on two weaklysupervised neuralsymbolic tasks: (1) handwritten formula recognition on the newly introduced HWF dataset (HandWritten Formula), where the input image and the formula result are given during training, while the formula is hidden; (2) visual question answering on the CLEVR dataset. The question, image, and answer are given, while the functional program generated by the question is hidden. The evaluation results show that the proposed NeuralGrammarSymbolic (NGS) model with backsearch significantly outperforms the baselines in terms of performance, convergence speed, and data efficiency. The ablative experiments also demonstrate the efficacy of the multistep backsearch algorithm and the incorporation of grammar in the neuralsymbolic model.
2 Related Work
Neuralsymbolic Integration. Researchers have proposed to combine statistical learning and symbolic reasoning in the AI community, with pioneer works devoted to different aspects including representation learning and reasoning (sun1994integrating; garcez2008neural; manhaeve2018deepproblog), abductive learning (dai2017combining; dai2019bridging; zhou2019abductive), knowledge abstraction (hinton2006fast; bader2009extracting), knowledge transfer (falkenhainer1989structure; yang2009heterogeneous), etc. Recent research shifts the focus to the application of neuralsymbolic integration, where a large amount of heterogeneous data and knowledge descriptions are needed, such as neuralsymbolic VQA (yi2018neural; vedantam2019probabilistic; mao2019neuro)
, semantic parsing in Natural Language Processing (NLP)
(liang2016neural; yin2018structvae), math word problem (lample2019deep; lee2019mathematical) and program synthesis (evans2018learning; kalyan2018neural; manhaeve2018deepproblog). Different from previous methods, the proposed NGS model considers the compositionality and recursivity in natural sequences of symbols and brings together the neural perception and symbolic reasoning module with a grammar model.Grammar Model. Grammar model has been adopted in various tasks for its advantage in modeling compositional and recursive structures, like image parsing (zhao2011image), video parsing (gupta2009understanding; qi2018generalized)
(huang2018holistic; jiang2018configurable), and task planning (xu2018unsupervised). By integrating the grammar into the neuralsymbolic task as a symbolic prior for the first time, the grammar model ensures the desired dependencies and structures for the symbol sequence and generates valid sentences for symbolic reasoning. Furthermore, it shrinks the search space greatly during the backsearch algorithm, thus improve the learning efficiency significantly.Policy Gradient. Policy gradient methods like REINFORCE (williams1992simple) are the most commonly used algorithm for the neuralsymbolic tasks to connect the learning gap between neural networks and symbolic reasoning (mascharka2018transparency; mao2019neuro; andreas2017modular; das2018neural; bunel2018leveraging; guu2017language)
. However, original REINFORCE algorithm suffers from large sample estimate variance, sparse rewards from cold start and exploitationexploration dilemma, which lead to unstable learning dynamics and poor data efficiency. Many papers propose to tackle this problem
(liang2016neural; guu2017language; Liang2018MemoryAP; wang2018mathdqn; agarwal2019learning). Specifically, liang2016neural uses iterative maximum likelihood to find pseudogold symbolic representations, and then add these representations to the REINFORCE training set. guu2017language combines the systematic beam search employed in maximum marginal likelihood with the greedy randomized exploration of REINFORCE. Liang2018MemoryAP proposes Memory Augmented Policy Optimization (MAPO) to express the expected return objective as a weighted sum of an expectation over the highreward history trajectories, and a separate expectation over new trajectories. Although utilizing positive representations from either beam search or past training process, these methods still cannot learn from negative samples and thus fail to explore the solution space efficiently. On the contrary, we propose to diagnose and correct the negative samples through the backsearch algorithm under the constraint of grammar and symbolic reasoning rules. Intuitively speaking, the proposed backsearch algorithm traverses around the negative sample and find a nearby positive sample to help the training.3 NeuralGrammarSymbolic Model (NGS)
In this section, we will first describe the inference and learning algorithms of the proposed neuralgrammarsymbolic (NGS) model. Then we provide an interpretation of our model based on maximum likelihood estimation (MLE) and draw the connection between the proposed backsearch algorithm and MetropolisHastings sampler. We further introduce the taskspecific designs in Section 4.
3.1 Inference
In a neuralsymbolic system, let be the input (e.g.an image or question), be the hidden symbolic representation, and be the desired output inferred by . The proposed NGS model combines neural perception, grammar parsing, and symbolic reasoning modules efficiently to perform the inference.
Neural Perception. The neural network is used as a perception module which maps the highdimensional input
to a normalized probability distribution of the hidden symbolic representation
:(1)  
(2) 
where is a scoring function or a negative energy function represented by a neural network with parameters .
Grammar Parsing. Take as a sequence of individual symbols: , where denotes the vocabulary of possible symbols. The neural network is powerful at modeling the mapping between and , but the recursive compositionality among the individual symbols is not well captured. Grammar is a natural choice to tackle this problem by modeling the compositional properties in sequence data.
Take the contextfree grammar (CFG) as an example. In formal language theory, a CFG is a type of formal grammar containing a set of production rules that describe all possible sentences in a given formal language. Specifically, a contextfree grammar in Chomsky Normal Form is defined by a 4tuple , where

is a finite set of nonterminal symbols that can be replaced by/expanded to a sequence of symbols.

is a finite set of terminal symbols that represent actual words in a language, which cannot be further expanded. Here is the vocabulary of possible symbols.

is a finite set of production rules describing the replacement of symbols, typically of the form or , where and . A production rule replaces the lefthand side nonterminal symbols by the righthand side expression. For example, means that can be replaced by either or .

is the start symbol.
Given a formal grammar, parsing is the process of determining whether a string of symbolic nodes can be accepted according to the production rules in the grammar. If the string is accepted by the grammar, the parsing process generates a parse tree. A parse tree represents the syntactic structure of a string according to certain CFG. The root node of the tree is the grammar root. Other nonleaf nodes correspond to nonterminals in the grammar, expanded according to grammar production rules. The leaf nodes are terminal nodes. All the leaf nodes together form a sentence.
In neuralsymbolic tasks, the objective of parsing is to find the most probable that can be accepted by the grammar:
(3) 
where denotes the language of , i.e., the set of all valid that accepted by .
Traditional grammar parsers can only work on symbolic sentences. qi2018generalized proposes a generalized version of Earley Parser, which takes a probability sequence as input and outputs the most probable parse. We use this method to compute the best parse in Equation 3.
Symbolic Reasoning. Given the parsed symbolic representation , the symbolic reasoning module performs deterministic inference with and the domainspecific knowledge . Formally, we want to find the entailed sentence given and :
(4) 
Since the inference process is deterministic, we rewrite the above equation as:
(5) 
where denotes complete inference rules under the domain . The inference rules generate a reasoning path that leads to the predicted output from and . The reasoning path has a tree structure with the root node and the leaf nodes from or .
3.2 Learning
It is challenging to obtain the ground truth of the symbolic representation , and the rules (i.e.
grammar rules and the symbolic inference rules) are usually designed explicitly by human knowledge. We formulate the learning process as a weaklysupervised learning of the neural network model
where the symbolic representation is missing, and the grammar model , domainspecific language , the symbolic inference rules are given.3.2.1 1step backsearch (Bs)
As shown in Figure 1, previous methods using policy gradient to learn the model discard all the samples with zero reward and learn nothing from them. It makes the learning process inefficient and unstable. However, humans can learn from the wrong predictions by diagnosing and correcting the wrong answers according to the desired outputs with topdown reasoning. Based on such observation, we propose a 1step backsearch (BS) algorithm which can correct wrong samples and use the corrections as pseudo labels for training. The BS algorithm closes the learning loop since the error can also be propagated through the nondifferentiable grammar parsing and symbolic reasoning modules. Specifically, we find the most probable correction for the wrong prediction by backtracking the symbolic reasoning tree and propagating the error from the root node into the leaf nodes in a topdown manner.
The BS algorithm is implemented with a priority queue as shown in Algorithm 1. The BS gradually searches down the reasoning tree starting from the root node to the leaf nodes. Specifically, each element in the priority queue represents a valid change, defined as a 3tuple :

[noitemsep]

is the current visiting node.

is the expected value on this node, which means if the value of is changed to , will execute to the groundtruth answer , i.e..

is the visiting priority, which reflects the potential of changing the value of .
Formally, the priority for this change is defined as the probability ratio:
(6) 
where is calculated as Equation 1,if ; otherwise, it is defined as the product of the probabilities of all leaf nodes in . If and , it means we need to correct the terminal node to a value that is not in the vocabulary. Therefore, this change is not possible and thus should be discarded.
The error propagation through the reasoning tree is achieved by a function, which aims at computing the expected value of the child node from the expected value of its parent node , i.e., finding satisfying . Please refer to the supplementary material for some illustrative examples of the BS process.
In the BS, we make a greedy assumption that only one symbol can be replaced at a time. This assumption implies only searching the neighborhood of at onestep distance. In Subsubsection 3.2.3, the BS is extended to the multistep backsearch algorithm, which allows searching beyond onestep distance.
3.2.2 Maximum Likelihood Estimation
Since is conditioned on and is conditioned on , the likelihood for the observation marginalized over is:
(7) 
The learning goal is to maximize the observeddata log likelihood .
By taking derivative, the gradient for the parameter is given by
(8) 
where is the posterior distribution of given . Since is computed by the symbolic reasoning module and can only be 0 or 1, can be written as:
(9) 
where is the set of that generates . Usually is a very small subset of the whole space of .
Equation 9 indicates that is sampled from the posterior distribution , which only has nonzero probabilities on , instead of the whole space of . Unfortunately, computing the posterior distribution is not efficient as evaluating the normalizing constant for this distribution requires summing over all possible , and the computational complexity of the summation grows exponentially.
Nonetheless, it is feasible to design algorithms that sample from this distribution using Markov chain Monte Carlo (MCMC). Since is always trapped in the modes where , the remaining question is how we can sample the posterior distribution efficiently to avoid redundant random walk at states with zero probabilities.
3.2.3 BS as MetropolisHastings Sampler
In order to perform efficient sampling, we extend the 1step back search to a multistep back search (BS), which serves as a MetropolisHastings sampler.
A MetropolisHastings sampler for a probability distribution is a MCMC algorithm that makes use of a proposal distribution from which it draws samples and uses an acceptance/rejection scheme to define a transition kernel with the desired distribution . Specifically, given the current state , a sample drawn from is accepted as the next state with probability
(10) 
Since it is impossible to jump between the states with zero probability, we define as a smoothing of by adding a small constant to :
(11) 
As shown in Algorithm 2, in each step, the BS proposes BS search with probability of () and random walk with probability of . The combination of BS and random walk helps the sampler to traverse all the states with nonzero probabilities and ensures the Markov chain to be ergodic.
Random Walk
: Defining a Poisson distribution for the random walk as
(12) 
where denotes the edit distance between , and is equal to the expected value of and also to its variance. is set as 1 in most cases due to the preference for a shortdistance random walk. The acceptance ratio for sampling a from is , where
(13) 
BS: While proposing the with BS, we search a that satisfies . If is proposed, the acceptance ratio for is , where
(14)  
is denoted as the numerator of . With an enough small , , , we will always accept .
Notably, the BS algorithm tries to transit the current state into a state where BS
, making movements in directions of increasing the posterior probability. Similar to the gradientbased MCMCs like Langevin dynamics
(duane1986theory; welling2011bayesian), this is the main reason that the proposed method can sample the posterior efficiently.3.2.4 Comparison with Policy Gradient
Since grammar parsing and symbolic reasoning are nondifferentiable, most of the previous approaches for neuralsymbolic learning use policy gradient like REINFORCE to learn the neural network. Treat as the policy function and the reward given can be written as:
(15) 
The learning objective is to maximize the expected reward under current policy :
(16) 
Then the gradient for is:
(17) 
We can approximate the expectation using one sample at each time, and then we get the REINFORCE algorithm:
(18) 
Subsubsection 3.2.4 reveals the gradient is nonzero only when the sampled satisfies . However, among the whole space of , only a very small portion can generate the desired , which implies that the REINFORCE will get zero gradients from most of the samples. This is why the REINFORCE method converges slowly or even fail to converge, as also shown from the experiments in Section 4.
4 Experiments and Results
4.1 Handwritten Formula Recognition
4.1.1 Experimental Setup
Task definition. The handwritten formula recognition task tries to recognize each mathematical symbol given a raw image of the handwritten formula. We learn this task in a weaklysupervised manner, where raw image of the handwritten formula is given as input data , and the computed results of the formulas is treated as outputs . The symbolic representation that represent the groundtruth formula composed by individual symbols is hidden. Our task is to predict the formula, which could further be executed to calculate the final result.
HWF Dataset. We generate the HWF dataset based on the CROHME 2019 Offline Handwritten Formula Recognition Task^{1}^{1}1https://www.cs.rit.edu/~crohme2019/task.html. First, we extract all symbols from CROHME and only keep ten digits (09) and four basic operators (,,, ). Then we generate formulas by sampling from a predefined grammar that only considers arithmetic operations over singledigit numbers. For each formula, we randomly select symbol images from CROHME. Overall, our dataset contains 10K training formulas and 2K test formulas.
Evaluation Metrics. We report both the calculation accuracy (i.e.whether the calculation of predicted formula yields to the correct result) and the symbol recognition accuracy (i.e.whether each symbol is recognized correctly from the image) on the synthetic dataset.
Models. In this task, we use LeNet (lecun2015lenet) as the neural perception module to process the handwritten formula. Before feeding into LeNet, the original image of an formula is presegmented into a sequence of subimages, and each subimage contains only one symbol. The symbolic reasoning module works like a calculator, and each inference step computes the parent value given the values of two child nodes (left/right) and the operator. The function in 1step backsearch algorithm works in the following way for mathematical formulas:

If is ’s left or right child, we directly solve the equation or to get , where denotes the operator.

If is an operator node, we try all other operators and check whether the new formula can generate the correct result.
We conduct experiments by comparing the following variants of the proposed model:

NGSRL: learning the NGS model with REINFORCE.

NGSMAPO: learning the NGS model by Memory Augmented Policy Optimization (MAPO) (Liang2018MemoryAP), which leverages a memory buffer of rewarding samples to reduce the variance of policy gradient estimates.

NGSRLPretrain: NGSRL with LeNet pretrained on a small set of fullysupervised data.

NGSMAPOPretrain: NGSMAPO with pretrained LeNet.

NGSmBS: learning the NGS model with the proposed mstep backsearch algorithm.
4.1.2 Results and Analyses
Learning Curve. Figure 2 shows the learning curves of different models. The proposed NGSmBS converges much faster and achieves higher accuracy compared with other models. NGSRL fails without pretraining and rarely improves during the entire training process. NGSMAPO can learn the model without pretraining, but it takes a long time to start efficient learning, which indicates that MAPO suffers from the coldstart problem and needs time to accumulate rewarding samples. Pretraining the LeNet solves the cold start problem for NGSRL and NGSMAPO. However, the training curves for these two models are quite noisy and are hard to converge even after 100k iterations. Our NGSmBS model learns from scratch and avoids the coldstart problem. It converges quickly with nearly perfect accuracy, with a much smoother training curve than the RL baselines.
BackSearch Step. Figure 3 illustrates the comparison of the various number of steps in the multistep backsearch algorithm. Generally, increasing the number of steps will increase the chances of correcting wrong samples, thus making the model converge faster. However, increasing the number of steps will also increase the time consumption of each iteration.
Data Efficiency. Table 1 and Table 2 show the accuracies on the test set while using various percentage of training data. All models are trained with 15K iterations. It turns out the NGSmBS is much more dataefficient than the RL methods. Specifically, when only using 25% of the training data, NGSmBS can get a calculation accuracy of 93.3%, while NGSMAPO only gets 5.1%.
Model  25%  50 %  75 %  100% 

NGSRL  0.035  0.036  0.034  0.034 
NGSMAPO  0.051  0.095  0.305  0.717 
NGSRLPretrain  0.534  0.621  0.663  0.685 
NGSMAPOPretrain  0.687  0.773  0.893  0.956 
NGSmBS  0.933  0.957  0.975  0.985 
Model  25%  50 %  75 %  100% 

NGSRL  0.170  0.170  0.170  0.170 
NGSMAPO  0.316  0.481  0.785  0.967 
NGSRLPretrain  0.916  0.945  0.959  0.964 
NGSMAPOPretrain  0.962  0.983  0.985  0.991 
NGSmBS  0.988  0.992  0.995  0.997 
Qualitative Results. Figure 4 illustrates four examples of correcting the wrong predictions with BS. In the first two examples, the backsearch algorithm successfully corrects the wrong predictions by changing a digit and an operator, respectively. In the third example, the backsearch fails to correct the wrong sample. However, if we increase the number of search steps, the model could find a correction for the example. In the fourth example, the backsearch finds a spurious correction, which is not the same as the groundtruth formula but generates the same result. Such spurious correction brings a noisy gradient to the neural network update. It remains an open problem for how to avoid similar spurious corrections.
4.2 NeuralSymbolic Visual Question Answering
4.2.1 Experimental Setup
Task. Following (yi2018neural), the neuralsymbolic visual question answering task tries to parse the question into functional program and then use a program executor that runs the program on the structural scene representation to obtain the answer. The functional program is hidden.
Dataset. We evaluate the proposed method on the CLEVR dataset (johnson2017clevr). The CLEVR dataset is a popular benchmark for testing compositional reasoning capability of VQA models in previous works (johnson2017inferring; vedantam2019probabilistic). CLEVR consists of a training set of 70K images and 700K questions, and a validation set of 15K images and 150K questions. We use the VQA accuracy as the evaluation metric.
Models. We adopt the NSVQA model in (yi2018neural) and replace the attentionbased seq2seq question parser with a Pointer Network (Vinyals2015PointerN). We store a dictionary to map the keywords in each question to the corresponding functional modules. For example, “red”“filter color [red]”, “how many” “count”, and “what size” “query size” etc. Therefore, the Pointer Network can point to the functional modules that are related to the input question. The grammar model ensures that the generated sequence of function modules can form a valid program, which indicates the inputs and outputs of these modules can be strictly matched with their forms. We conduct experiments by comparing following models: NSRL, NGSRL, NGS1BS, NGSmBS.
4.2.2 Results and Analyses
Learning Curve. Figure 5 shows the learning curves of different model variants. NGSBS converges much faster and achieves higher VQA accuracy on the test set compared with the RL baselines. Though taking a long time, NGSRL does converge, while NSRL fails. This fact indicates that the grammar model plays a critical role in this task. Conceivably, the latent functional program space is combinatory, but the grammar model rules out all invalid programs that cannot be executed by the symbolic reasoning module. It largely reduces the solution space in this task.
BackSearch Step. As shown in Figure 5, NGS10BS performs slightly better than the NGS1BS, which indicates that searching multiple steps does not help greatly in this task. One possible reason is that there are more ambiguities and more spurious examples compared with the handwritten formula recognition task, making it less efficient to do the BS. For example, for the answer “yes”, there might be many possible programs for this question that can generate the same answer given the image.
Data Efficiency Table 3 shows the accuracies on the CLEVR validation set when different portions of training data are used. With less training data, the performances decrease for both NGSRL and NGSmBS, but NGSmBS still consistently obtains higher accuracies.
Model  25%  50 %  75 %  100% 

NSRL  0.090  0.091  0.099  0.125 
NGSRL  0.678  0.839  0.905  0.969 
NGSmBS  0.873  0.936  1.000  1.000 
5 Conclusions
In this work, we propose a neuralgrammarsymbolic model and a backsearch algorithm to close the loop of neuralsymbolic learning. We demonstrate that the grammar model can dramatically reduce the solution space by eliminating invalid possibilities in the latent representation space. The backsearch algorithm endows the NGS model with the capability of learning from wrong samples, making the learning more stable and efficient. One future direction is to learn the symbolic prior (i.e.the grammar rules and symbolic inference rules) automatically from the data.
Acknowledgements.
We thank Baoxiong Jia for helpful discussion on the generalized Earley Parser. This work reported herein is supported by ARO W911NF1810296, DARPA XAI N660011724029, and ONR MURI N000141612007.