Integrating robust connectionist learning and sound symbolic reasoning is a key challenge in modern Artificial Intelligence. Deep neural networks(lecun2015deep; lecun1995convolutional; hochreiter1997long) provide us powerful and flexible representation learning that has achieved state-of-the-art performances across a variety of AI tasks such as image classification (krizhevsky2012imagenet; szegedy2015going; he2016deep), machine translation (sutskever2014sequence), and speech recognition (graves2013speech). However, it turns out that many aspects of human cognition, such as systematic compositionality and generalization (fodor1988connectionism; marcus1998rethinking; fodor2002compositionality; calvo2014architecture; marcus2018algebraic; Lake2018Generalization), cannot be captured by neural networks. On the other hand, symbolic reasoning supports strong abstraction and generalization but is fragile and inflexible. Consequently, many methods have focused on building neural-symbolic models to combine the best of deep representation learning and symbolic reasoning (sun1994integrating; garcez2008neural; bader2009extracting; besold2017neural; yi2018neural).
Recently, this neural-symbolic paradigm has been extensively explored in the tasks of the visual question answering (VQA) (yi2018neural; vedantam2019probabilistic; mao2019neuro), vision-language navigation (anderson2018vision; fried2018speaker), embodied question answering (das2018embodied; das2018neural), and semantic parsing (liang2016neural; yin2018structvae), often with weak supervision. Concretely, for these tasks, neural networks are used to map raw signals (images/questions/instructions) to symbolic representations (scenes/programs/actions), which are then used to perform symbolic reasoning/execution to generate final outputs. Weak supervision in these tasks usually provides pairs of raw inputs and final outputs, with intermediate symbolic representations unobserved. Since symbolic reasoning is non-differentiable, previous methods usually learn the neural-symbolic models by policy gradient methods like REINFORCE. The policy gradient methods generate samples and update the policy based on the generated samples that happen to hit high cumulative rewards. No efforts are made to improve each generated sample to increase its cumulative reward. Thus the learning has been proved to be time-consuming because it requires generating a large number of samples over a large latent space of symbolic representations with sparse rewards, in the hope that some samples may be lucky enough to hit high rewards so that such lucky samples can be utilized for updating the policy. As a result, policy gradients methods converge slowly or even fail to converge without pre-training the neural networks on fully-supervised data.
To model the recursive compositionality in a sequence of symbols, we introduce the grammar model to bridge neural perception and symbolic reasoning. The structured symbolic representation often exhibits compositional and recursive properties over individual symbols in it. Correspondingly, the grammar models encode symbolic prior about composition rules, thus can dramatically reduce the solution space by parsing the sequence of symbols into valid sentences. For example, in the handwritten formula recognition problem, the grammar model ensures that the predicted formula is always valid, as shown in Figure 1.
To make the neural-symbolic learning more efficient, we propose a novel back-search strategy which mimics human’s ability to learn from failures via abductive reasoning (magnani2009abductive; zhou2019abductive)
. Specifically, the back-search algorithm propagates the error from the root node to the leaf nodes in the reasoning tree and finds the most probablecorrection that can generate the desired output. The correction is further used as a pseudo label for training the neural network. Figure 1 shows an exemplar backward pass of the back-search algorithm. We argue that the back-search algorithm makes a first step towards closing the learning loop by propagating the error through the non-differentiable grammar parsing and symbolic reasoning modules. We also show that the proposed multi-step back-search algorithm can serve as a Metropolis-Hastings sampler which samples the posterior distribution of the symbolic representations in the maximum likelihood estimation in Subsubsection 3.2.3.
We conduct experiments on two weakly-supervised neural-symbolic tasks: (1) handwritten formula recognition on the newly introduced HWF dataset (Hand-Written Formula), where the input image and the formula result are given during training, while the formula is hidden; (2) visual question answering on the CLEVR dataset. The question, image, and answer are given, while the functional program generated by the question is hidden. The evaluation results show that the proposed Neural-Grammar-Symbolic (NGS) model with back-search significantly outperforms the baselines in terms of performance, convergence speed, and data efficiency. The ablative experiments also demonstrate the efficacy of the multi-step back-search algorithm and the incorporation of grammar in the neural-symbolic model.
2 Related Work
Neural-symbolic Integration. Researchers have proposed to combine statistical learning and symbolic reasoning in the AI community, with pioneer works devoted to different aspects including representation learning and reasoning (sun1994integrating; garcez2008neural; manhaeve2018deepproblog), abductive learning (dai2017combining; dai2019bridging; zhou2019abductive), knowledge abstraction (hinton2006fast; bader2009extracting), knowledge transfer (falkenhainer1989structure; yang2009heterogeneous), etc. Recent research shifts the focus to the application of neural-symbolic integration, where a large amount of heterogeneous data and knowledge descriptions are needed, such as neural-symbolic VQA (yi2018neural; vedantam2019probabilistic; mao2019neuro)
, semantic parsing in Natural Language Processing (NLP)(liang2016neural; yin2018structvae), math word problem (lample2019deep; lee2019mathematical) and program synthesis (evans2018learning; kalyan2018neural; manhaeve2018deepproblog). Different from previous methods, the proposed NGS model considers the compositionality and recursivity in natural sequences of symbols and brings together the neural perception and symbolic reasoning module with a grammar model.
Grammar Model. Grammar model has been adopted in various tasks for its advantage in modeling compositional and recursive structures, like image parsing (zhao2011image), video parsing (gupta2009understanding; qi2018generalized)huang2018holistic; jiang2018configurable), and task planning (xu2018unsupervised). By integrating the grammar into the neural-symbolic task as a symbolic prior for the first time, the grammar model ensures the desired dependencies and structures for the symbol sequence and generates valid sentences for symbolic reasoning. Furthermore, it shrinks the search space greatly during the back-search algorithm, thus improve the learning efficiency significantly.
Policy Gradient. Policy gradient methods like REINFORCE (williams1992simple) are the most commonly used algorithm for the neural-symbolic tasks to connect the learning gap between neural networks and symbolic reasoning (mascharka2018transparency; mao2019neuro; andreas2017modular; das2018neural; bunel2018leveraging; guu2017language)
. However, original REINFORCE algorithm suffers from large sample estimate variance, sparse rewards from cold start and exploitation-exploration dilemma, which lead to unstable learning dynamics and poor data efficiency. Many papers propose to tackle this problem(liang2016neural; guu2017language; Liang2018MemoryAP; wang2018mathdqn; agarwal2019learning). Specifically, liang2016neural uses iterative maximum likelihood to find pseudo-gold symbolic representations, and then add these representations to the REINFORCE training set. guu2017language combines the systematic beam search employed in maximum marginal likelihood with the greedy randomized exploration of REINFORCE. Liang2018MemoryAP proposes Memory Augmented Policy Optimization (MAPO) to express the expected return objective as a weighted sum of an expectation over the high-reward history trajectories, and a separate expectation over new trajectories. Although utilizing positive representations from either beam search or past training process, these methods still cannot learn from negative samples and thus fail to explore the solution space efficiently. On the contrary, we propose to diagnose and correct the negative samples through the back-search algorithm under the constraint of grammar and symbolic reasoning rules. Intuitively speaking, the proposed back-search algorithm traverses around the negative sample and find a nearby positive sample to help the training.
3 Neural-Grammar-Symbolic Model (NGS)
In this section, we will first describe the inference and learning algorithms of the proposed neural-grammar-symbolic (NGS) model. Then we provide an interpretation of our model based on maximum likelihood estimation (MLE) and draw the connection between the proposed back-search algorithm and Metropolis-Hastings sampler. We further introduce the task-specific designs in Section 4.
In a neural-symbolic system, let be the input (e.g.an image or question), be the hidden symbolic representation, and be the desired output inferred by . The proposed NGS model combines neural perception, grammar parsing, and symbolic reasoning modules efficiently to perform the inference.
Neural Perception. The neural network is used as a perception module which maps the high-dimensional input
to a normalized probability distribution of the hidden symbolic representation:
where is a scoring function or a negative energy function represented by a neural network with parameters .
Grammar Parsing. Take as a sequence of individual symbols: , where denotes the vocabulary of possible symbols. The neural network is powerful at modeling the mapping between and , but the recursive compositionality among the individual symbols is not well captured. Grammar is a natural choice to tackle this problem by modeling the compositional properties in sequence data.
Take the context-free grammar (CFG) as an example. In formal language theory, a CFG is a type of formal grammar containing a set of production rules that describe all possible sentences in a given formal language. Specifically, a context-free grammar in Chomsky Normal Form is defined by a 4-tuple , where
is a finite set of non-terminal symbols that can be replaced by/expanded to a sequence of symbols.
is a finite set of terminal symbols that represent actual words in a language, which cannot be further expanded. Here is the vocabulary of possible symbols.
is a finite set of production rules describing the replacement of symbols, typically of the form or , where and . A production rule replaces the left-hand side non-terminal symbols by the right-hand side expression. For example, means that can be replaced by either or .
is the start symbol.
Given a formal grammar, parsing is the process of determining whether a string of symbolic nodes can be accepted according to the production rules in the grammar. If the string is accepted by the grammar, the parsing process generates a parse tree. A parse tree represents the syntactic structure of a string according to certain CFG. The root node of the tree is the grammar root. Other non-leaf nodes correspond to non-terminals in the grammar, expanded according to grammar production rules. The leaf nodes are terminal nodes. All the leaf nodes together form a sentence.
In neural-symbolic tasks, the objective of parsing is to find the most probable that can be accepted by the grammar:
where denotes the language of , i.e., the set of all valid that accepted by .
Traditional grammar parsers can only work on symbolic sentences. qi2018generalized proposes a generalized version of Earley Parser, which takes a probability sequence as input and outputs the most probable parse. We use this method to compute the best parse in Equation 3.
Symbolic Reasoning. Given the parsed symbolic representation , the symbolic reasoning module performs deterministic inference with and the domain-specific knowledge . Formally, we want to find the entailed sentence given and :
Since the inference process is deterministic, we re-write the above equation as:
where denotes complete inference rules under the domain . The inference rules generate a reasoning path that leads to the predicted output from and . The reasoning path has a tree structure with the root node and the leaf nodes from or .
It is challenging to obtain the ground truth of the symbolic representation , and the rules (i.e.
grammar rules and the symbolic inference rules) are usually designed explicitly by human knowledge. We formulate the learning process as a weakly-supervised learning of the neural network modelwhere the symbolic representation is missing, and the grammar model , domain-specific language , the symbolic inference rules are given.
3.2.1 1-step back-search (-Bs)
As shown in Figure 1, previous methods using policy gradient to learn the model discard all the samples with zero reward and learn nothing from them. It makes the learning process inefficient and unstable. However, humans can learn from the wrong predictions by diagnosing and correcting the wrong answers according to the desired outputs with top-down reasoning. Based on such observation, we propose a 1-step back-search (-BS) algorithm which can correct wrong samples and use the corrections as pseudo labels for training. The -BS algorithm closes the learning loop since the error can also be propagated through the non-differentiable grammar parsing and symbolic reasoning modules. Specifically, we find the most probable correction for the wrong prediction by back-tracking the symbolic reasoning tree and propagating the error from the root node into the leaf nodes in a top-down manner.
The -BS algorithm is implemented with a priority queue as shown in Algorithm 1. The -BS gradually searches down the reasoning tree starting from the root node to the leaf nodes. Specifically, each element in the priority queue represents a valid change, defined as a 3-tuple :
is the current visiting node.
is the expected value on this node, which means if the value of is changed to , will execute to the ground-truth answer , i.e..
is the visiting priority, which reflects the potential of changing the value of .
Formally, the priority for this change is defined as the probability ratio:
where is calculated as Equation 1,if ; otherwise, it is defined as the product of the probabilities of all leaf nodes in . If and , it means we need to correct the terminal node to a value that is not in the vocabulary. Therefore, this change is not possible and thus should be discarded.
The error propagation through the reasoning tree is achieved by a function, which aims at computing the expected value of the child node from the expected value of its parent node , i.e., finding satisfying . Please refer to the supplementary material for some illustrative examples of the -BS process.
In the -BS, we make a greedy assumption that only one symbol can be replaced at a time. This assumption implies only searching the neighborhood of at one-step distance. In Subsubsection 3.2.3, the -BS is extended to the multi-step back-search algorithm, which allows searching beyond one-step distance.
3.2.2 Maximum Likelihood Estimation
Since is conditioned on and is conditioned on , the likelihood for the observation marginalized over is:
The learning goal is to maximize the observed-data log likelihood .
By taking derivative, the gradient for the parameter is given by
where is the posterior distribution of given . Since is computed by the symbolic reasoning module and can only be 0 or 1, can be written as:
where is the set of that generates . Usually is a very small subset of the whole space of .
Equation 9 indicates that is sampled from the posterior distribution , which only has non-zero probabilities on , instead of the whole space of . Unfortunately, computing the posterior distribution is not efficient as evaluating the normalizing constant for this distribution requires summing over all possible , and the computational complexity of the summation grows exponentially.
Nonetheless, it is feasible to design algorithms that sample from this distribution using Markov chain Monte Carlo (MCMC). Since is always trapped in the modes where , the remaining question is how we can sample the posterior distribution efficiently to avoid redundant random walk at states with zero probabilities.
3.2.3 -BS as Metropolis-Hastings Sampler
In order to perform efficient sampling, we extend the 1-step back search to a multi-step back search (-BS), which serves as a Metropolis-Hastings sampler.
A Metropolis-Hastings sampler for a probability distribution is a MCMC algorithm that makes use of a proposal distribution from which it draws samples and uses an acceptance/rejection scheme to define a transition kernel with the desired distribution . Specifically, given the current state , a sample drawn from is accepted as the next state with probability
Since it is impossible to jump between the states with zero probability, we define as a smoothing of by adding a small constant to :
As shown in Algorithm 2, in each step, the -BS proposes -BS search with probability of () and random walk with probability of . The combination of -BS and random walk helps the sampler to traverse all the states with non-zero probabilities and ensures the Markov chain to be ergodic.
: Defining a Poisson distribution for the random walk as
where denotes the edit distance between , and is equal to the expected value of and also to its variance. is set as 1 in most cases due to the preference for a short-distance random walk. The acceptance ratio for sampling a from is , where
-BS: While proposing the with -BS, we search a that satisfies . If is proposed, the acceptance ratio for is , where
is denoted as the numerator of . With an enough small , , , we will always accept .
Notably, the -BS algorithm tries to transit the current state into a state where -BS
, making movements in directions of increasing the posterior probability. Similar to the gradient-based MCMCs like Langevin dynamics(duane1986theory; welling2011bayesian), this is the main reason that the proposed method can sample the posterior efficiently.
3.2.4 Comparison with Policy Gradient
Since grammar parsing and symbolic reasoning are non-differentiable, most of the previous approaches for neural-symbolic learning use policy gradient like REINFORCE to learn the neural network. Treat as the policy function and the reward given can be written as:
The learning objective is to maximize the expected reward under current policy :
Then the gradient for is:
We can approximate the expectation using one sample at each time, and then we get the REINFORCE algorithm:
Subsubsection 3.2.4 reveals the gradient is non-zero only when the sampled satisfies . However, among the whole space of , only a very small portion can generate the desired , which implies that the REINFORCE will get zero gradients from most of the samples. This is why the REINFORCE method converges slowly or even fail to converge, as also shown from the experiments in Section 4.
4 Experiments and Results
4.1 Handwritten Formula Recognition
4.1.1 Experimental Setup
Task definition. The handwritten formula recognition task tries to recognize each mathematical symbol given a raw image of the handwritten formula. We learn this task in a weakly-supervised manner, where raw image of the handwritten formula is given as input data , and the computed results of the formulas is treated as outputs . The symbolic representation that represent the ground-truth formula composed by individual symbols is hidden. Our task is to predict the formula, which could further be executed to calculate the final result.
HWF Dataset. We generate the HWF dataset based on the CROHME 2019 Offline Handwritten Formula Recognition Task111https://www.cs.rit.edu/~crohme2019/task.html. First, we extract all symbols from CROHME and only keep ten digits (09) and four basic operators (,,, ). Then we generate formulas by sampling from a pre-defined grammar that only considers arithmetic operations over single-digit numbers. For each formula, we randomly select symbol images from CROHME. Overall, our dataset contains 10K training formulas and 2K test formulas.
Evaluation Metrics. We report both the calculation accuracy (i.e.whether the calculation of predicted formula yields to the correct result) and the symbol recognition accuracy (i.e.whether each symbol is recognized correctly from the image) on the synthetic dataset.
Models. In this task, we use LeNet (lecun2015lenet) as the neural perception module to process the handwritten formula. Before feeding into LeNet, the original image of an formula is pre-segmented into a sequence of sub-images, and each sub-image contains only one symbol. The symbolic reasoning module works like a calculator, and each inference step computes the parent value given the values of two child nodes (left/right) and the operator. The function in 1-step back-search algorithm works in the following way for mathematical formulas:
If is ’s left or right child, we directly solve the equation or to get , where denotes the operator.
If is an operator node, we try all other operators and check whether the new formula can generate the correct result.
We conduct experiments by comparing the following variants of the proposed model:
NGS-RL: learning the NGS model with REINFORCE.
NGS-MAPO: learning the NGS model by Memory Augmented Policy Optimization (MAPO) (Liang2018MemoryAP), which leverages a memory buffer of rewarding samples to reduce the variance of policy gradient estimates.
NGS-RL-Pretrain: NGS-RL with LeNet pre-trained on a small set of fully-supervised data.
NGS-MAPO-Pretrain: NGS-MAPO with pre-trained LeNet.
NGS-m-BS: learning the NGS model with the proposed m-step back-search algorithm.
4.1.2 Results and Analyses
Learning Curve. Figure 2 shows the learning curves of different models. The proposed NGS-m-BS converges much faster and achieves higher accuracy compared with other models. NGS-RL fails without pre-training and rarely improves during the entire training process. NGS-MAPO can learn the model without pre-training, but it takes a long time to start efficient learning, which indicates that MAPO suffers from the cold-start problem and needs time to accumulate rewarding samples. Pre-training the LeNet solves the cold start problem for NGS-RL and NGS-MAPO. However, the training curves for these two models are quite noisy and are hard to converge even after 100k iterations. Our NGS-m-BS model learns from scratch and avoids the cold-start problem. It converges quickly with nearly perfect accuracy, with a much smoother training curve than the RL baselines.
Back-Search Step. Figure 3 illustrates the comparison of the various number of steps in the multi-step back-search algorithm. Generally, increasing the number of steps will increase the chances of correcting wrong samples, thus making the model converge faster. However, increasing the number of steps will also increase the time consumption of each iteration.
Data Efficiency. Table 1 and Table 2 show the accuracies on the test set while using various percentage of training data. All models are trained with 15K iterations. It turns out the NGS-m-BS is much more data-efficient than the RL methods. Specifically, when only using 25% of the training data, NGS-m-BS can get a calculation accuracy of 93.3%, while NGS-MAPO only gets 5.1%.
|Model||25%||50 %||75 %||100%|
|Model||25%||50 %||75 %||100%|
Qualitative Results. Figure 4 illustrates four examples of correcting the wrong predictions with -BS. In the first two examples, the back-search algorithm successfully corrects the wrong predictions by changing a digit and an operator, respectively. In the third example, the back-search fails to correct the wrong sample. However, if we increase the number of search steps, the model could find a correction for the example. In the fourth example, the back-search finds a spurious correction, which is not the same as the ground-truth formula but generates the same result. Such spurious correction brings a noisy gradient to the neural network update. It remains an open problem for how to avoid similar spurious corrections.
4.2 Neural-Symbolic Visual Question Answering
4.2.1 Experimental Setup
Task. Following (yi2018neural), the neural-symbolic visual question answering task tries to parse the question into functional program and then use a program executor that runs the program on the structural scene representation to obtain the answer. The functional program is hidden.
Dataset. We evaluate the proposed method on the CLEVR dataset (johnson2017clevr). The CLEVR dataset is a popular benchmark for testing compositional reasoning capability of VQA models in previous works (johnson2017inferring; vedantam2019probabilistic). CLEVR consists of a training set of 70K images and 700K questions, and a validation set of 15K images and 150K questions. We use the VQA accuracy as the evaluation metric.
Models. We adopt the NS-VQA model in (yi2018neural) and replace the attention-based seq2seq question parser with a Pointer Network (Vinyals2015PointerN). We store a dictionary to map the keywords in each question to the corresponding functional modules. For example, “red”“filter color [red]”, “how many” “count”, and “what size” “query size” etc. Therefore, the Pointer Network can point to the functional modules that are related to the input question. The grammar model ensures that the generated sequence of function modules can form a valid program, which indicates the inputs and outputs of these modules can be strictly matched with their forms. We conduct experiments by comparing following models: NS-RL, NGS-RL, NGS-1-BS, NGS-m-BS.
4.2.2 Results and Analyses
Learning Curve. Figure 5 shows the learning curves of different model variants. NGS-BS converges much faster and achieves higher VQA accuracy on the test set compared with the RL baselines. Though taking a long time, NGS-RL does converge, while NS-RL fails. This fact indicates that the grammar model plays a critical role in this task. Conceivably, the latent functional program space is combinatory, but the grammar model rules out all invalid programs that cannot be executed by the symbolic reasoning module. It largely reduces the solution space in this task.
Back-Search Step. As shown in Figure 5, NGS-10-BS performs slightly better than the NGS-1-BS, which indicates that searching multiple steps does not help greatly in this task. One possible reason is that there are more ambiguities and more spurious examples compared with the handwritten formula recognition task, making it less efficient to do the -BS. For example, for the answer “yes”, there might be many possible programs for this question that can generate the same answer given the image.
Data Efficiency Table 3 shows the accuracies on the CLEVR validation set when different portions of training data are used. With less training data, the performances decrease for both NGS-RL and NGS-m-BS, but NGS-m-BS still consistently obtains higher accuracies.
|Model||25%||50 %||75 %||100%|
In this work, we propose a neural-grammar-symbolic model and a back-search algorithm to close the loop of neural-symbolic learning. We demonstrate that the grammar model can dramatically reduce the solution space by eliminating invalid possibilities in the latent representation space. The back-search algorithm endows the NGS model with the capability of learning from wrong samples, making the learning more stable and efficient. One future direction is to learn the symbolic prior (i.e.the grammar rules and symbolic inference rules) automatically from the data.
We thank Baoxiong Jia for helpful discussion on the generalized Earley Parser. This work reported herein is supported by ARO W911NF1810296, DARPA XAI N66001-17-2-4029, and ONR MURI N00014-16-1-2007.