Learning to Organize Knowledge with N-Gram Machines

11/17/2017 ∙ by Fan Yang, et al. ∙ Google Mosaix Carnegie Mellon University 0

Deep neural networks (DNNs) had great success on NLP tasks such as language modeling, machine translation and certain question answering (QA) tasks. However, the success is limited at more knowledge intensive tasks such as QA from a big corpus. Existing end-to-end deep QA models (Miller et al., 2016; Weston et al., 2014) need to read the entire text after observing the question, and therefore their complexity in responding a question is linear in the text size. This is prohibitive for practical tasks such as QA from Wikipedia, a novel, or the Web. We propose to solve this scalability issue by using symbolic meaning representations, which can be indexed and retrieved efficiently with complexity that is independent of the text size. More specifically, we use sequence-to-sequence models to encode knowledge symbolically and generate programs to answer questions from the encoded knowledge. We apply our approach, called the N-Gram Machine (NGM), to the bAbI tasks (Weston et al., 2015) and a special version of them ("life-long bAbI") which has stories of up to 10 million sentences. Our experiments show that NGM can successfully solve both of these tasks accurately and efficiently. Unlike fully differentiable memory models, NGM's time complexity and answering quality are not affected by the story length. The whole system of NGM is trained end-to-end with REINFORCE (Williams, 1992). To avoid high variance in gradient estimation, which is typical in discrete latent variable models, we use beam search instead of sampling. To tackle the exponentially large search space, we use a stabilized auto-encoding objective and a structure tweak procedure to iteratively reduce and refine the search space.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge management and reasoning is an important task in Artificial Intelligence. It involves organizing information in the environment into structured object (e.g. knowledge storage). Moreover, the structured object is designed to enable complex querying by agents. In this paper, we focus on the case where information is represented in text. An exemplar task is question answering from large corpus. Traditionally, the study of knowledge management and reasoning is divided into independent subtasks, such as Information Extraction 

dong2014vault ; mitchell2015nell and Semantic Parsing dong2016language ; jia2016data ; liang2017nsm . Though great progress has been made on each individual tasks, dividing the tasks upfront (i.e. designing the structure or schema) is costly, as it heavily relies on human experts, and sub-optimal, as it cannot adapt to the query statistics. To remove the bottleneck of dividing the task, end-to-end models have been proposed for question answering, such as Memory Networks miller2016key ; weston2014memory . However, these networks lack scalability– the complexity of reasoning with the learned memory is linear of the corpus size, which prohibits applying them to large web-scale corpus.

We present a new QA system that treats both the schema and the content of a structured storage as discrete hidden variables, and infers these structures automatically from weak supervisions (such as QA pair examples). The structured storage we consider is simply a set of “n-grams”, which we show can represent a wide range of semantics, and can be indexed for efficient computations at scale. We present an end-to-end trainable system which combines an text auto-encoding component for encoding knowledge, and a memory enhanced sequence to sequence component for answering questions from the encoded knowledge. The system we present illustrates how end-to-end learning and scalability can be made possible through a symbolic knowledge storage.

1.1 Question Answering: Definition and Challenges

We first define question answering as producing the answer given a corpus , which is a sequence of text piece’s, and a question . Both and are sequences of words. We focus on extractive question answering, where the answer is always a word in one of the sentences. In Section 4 we illustrate how this assumption can be relaxed by named entity annotations. Despite its simple form, question answering can be incredibly challenging. We identify three main challenges in this process, which our new framework is designed to meet.


A typical QA system, such as Watson ferrucci2010watson or any of the commercial search engines brin1998search , processes millions or even billions of documents for answering a question. Yet the response time is restricted to a few seconds or even fraction of seconds. Answering any possible question that is answerable from a large corpus with limited time means that the information need to be organized and indexed for fast access and reasoning.


A fundamental building block for text understanding is paraphrasing. Consider answering the question “Who was Adam Smith’s wife?’ from the Web. There exists the following snippet from a reputable website “Smith was born in Kirkcaldy, … (skipped 35 words) … In 1720, he married Margaret Douglas”. An ideal system needs to identify that “Smith” in this text is equivalent to “Adam Smith” in the question; “he” is referencing “Smith”; and text expressions of the form “X married Y” answer questions of the form “Who was X’s wife?”.

By observing users’ interactions, a system may capture certain equivalence relationships among expressions in questions baker2010understand . However, given these observations, there is still a wide range of choices for how the meaning of expressions can be represented. Open information extraction approaches angeli2015oie represent expressions by themselves, and rely on corpus statistics to calculate their similarities. This approach leads to data sparsity, and brittleness on out-of-domain text. Vector space approaches mikolov2013composition ; weston2014memory ; neelakantan2015neural ; miller2016key embeds text expressions into latent continuous spaces. They allow flexible matching of semantics for arbitrary expressions, but are hard to scale to knowledge intensive tasks, which require inference with large amount of data.


The essence of reasoning is to combine pieces of information together. For example, from co-reference(“He”, “Adam Smith”) and has_spouse(“He”, “Margaret Douglas”) to has_spouse(“Adam Smith”, “Margaret Douglas”). As the number of relevant pieces grows, the search space grows exponentially – making it a hard search problem lao2011random . Since reasoning is closely coupled with how the text meaning is stored, an optimal representation should be learned end-to-end (i.e. jointly) in the process of knowledge storing and reasoning.

1.2 N-Gram Machines: A Scalable End-to-End Approach

Figure 1: End-to-end QA system with symbolic representation. Both the knowledge store and the program are non-differentiable and hidden.

We propose to solve the scalability issue of neural network text understanding models by learning to represent the meaning of text as a symbolic knowledge storage. Because the storage can be indexed before being used for question answering, the inference step can be done very efficiently with complexity that is independent of the original text size. More specifically the structured storage we consider is simply a set of “n-grams”, which we show can represent complex semantics presented in bAbI tasks weston2015towards and can be indexed for efficient computations at scale. Each n-gram consists of a sequence of tokens, and each token can be a word, or any predefined special symbol. Different from conventional n-grams, which are contiguous chunks of text, the “n-grams” considered here can be any combination of arbitrary words and symbols. The whole system (Figure 1) consists of learnable components which convert text into symbolic knowledge storage and questions into programs (details in Section 2.1). A deterministic executor executes the programs against the knowledge storage and produces answers. The whole system is trained end-to-end with no human annotation other than the expected answers to a set of question-text pairs.

2 N-Gram Machines

In this section we first describe the N-Gram Machine (NGM) model structure, which contains three sequence to sequence modules, and an executor that executes programs against knowledge storage. Then we describe how this model can be trained end-to-end with reinforcement learning. We use the bAbI dataset 

weston2015towards as running examples.

2.1 Model Structure

Knowledge storage

Given a corpus NGM produces a knowledge storage , which is a list of n-grams. An n-gram is a sequence of symbols, where each symbol is either a word from text piece or a symbol from the model vocabulary. The knowledge storage is probabilistic – each text piece

produces a distribution over n-grams, and the probability of a knowledge storage can be factorized as the product of n-gram probabilities(Equation 

1). Example knowledge storages are shown in Table 2 and Table 5. For certain tasks the model needs to reason over time. So we associate each n-gram with a time stamp with is simply its id in corpus .

Table 1: Functions in NGMs. is the knowledge storage, and the input n-gram is . We use to denote the first symbols in , and to denote the last symbols of in reverse order.

The programs in NGM are similar to those introduced in Neural Symbolic Machines liang2017nsm , except that NGM functions operate on n-grams instead of Freebase triples. NGM functions specify how symbols can be retrieved from a knowledge storage as in Table 1. Pref and Suff return symbols from all the matched n-grams, while PrefMax and SuffMax return from the latest matches.

More formally a program is a list of statement , where is either a special expression Return indicating the end of the program, or is of the form where is a function in Table 1 and are input arguments of . When an expression is executed, it returns a set of symbols by matching its arguments in , and stores the result in a new variable symbol (e.g., V1) to reference the result (see Table 2 for an example). Though executing a program on a knowledge storage as described above is deterministic, probabilities are assigned to the execution results, which are the products of probabilities of the corresponding program and knowledge storage. Since the knowledge storage can be indexed using data structures such as hash tables, the program execution time is independent of the size of the knowledge storage.

Seq2Seq components

NGM uses three sequence-to-sequence sutskever2014sequence

neural network models to define probability distributions over n-grams and programs:

  • A knowledge encoder that converts text pieces to n-grams and defines a distribution . It is also conditioned on context which helps to capture long range dependencies such as document title or co-references111Ideally it should condition on the partially constructed at time , but that makes it hard to do LSTM batch training and is beyond the scope of this work.. The probability of a knowledge storage is defined as the product of its n-grams’ probabilities:

  • A knowledge decoder that converts n-grams back to text pieces and defines a distribution . It enables auto-encoding training, which is crucial for efficiently finding good knowledge representations (See Section 2.2).

  • A programmer that converts questions to programs and defines a distribution . It is conditioned on the knowledge storage for code assistance liang2017nsm – before generating each token the programmer can query for valid next tokens given a n-gram prefix, and therefore avoid writing invalid programs.

We use the CopyNet gu2016incorporating architecture, which has copy vinyals2015pointer and attention bahdanau2014neural mechanisms. The programmer is also enhanced with a key-variable memory liang2017nsm for compositing semantics.

2.2 Optimization

Given an example from the training set, NGM maximizes the expected reward


where the reward function returns 1 if executing on produces and 0 otherwise. Since the training explores an exponentially large latent spaces, it is very challenging to optimize . To reduce the variance of inference we approximate the expectations with beam searches instead of sampling. The summation over all programs is approximated by summing over programs found by a beam search according to . For the summation over knowledge storages , we first run beam search for each text piece based on , and then sample a set of knowledge storages by independently sampling from the n-grams of each text piece. We further introduce two techniques to iteratively reduce and improve the search space:

Stabilized Auto-Encoding (AE)

We add an auto-encoding objective to NGM, similar to the text summarization model proposed by Miao et al

miao2016lang . The auto-encoding objective can be optimized by variational inference kingma2014vae ; mnih2014nvi :


where is text, and is the hidden discrete structure. However, it suffers from instability due to the strong coupling between encoder and decoder – the training of the decoder relies solely on a distribution parameterized by the encoder , which changes throughout the course of training. To improve the training stability, we propose to augment the decoder training with a more stable objective – predict the data back from noisy partial observations of , which are independent of . More specifically, for NGM we force the knowledge decoder to decode from a fixed set of hidden sequences , which includes all n-gram of length that consist only words from text :


The knowledge decoder converts knowledge tuples back to sentences and the reconstruction log-likelihoods approximate how informative the tuples are, which can be used as reward for the knowledge encoder. We also drop the KL divergence (last two terms in Equation 3) between language model and the encoder, since the ’s are produced for NGM computations instead of human reading, and does not need to be in fluent natural language.

   Input: knowledge storage ; statement from an uninformed programmer.
  if  or  then
  Let be the longest n-gram prefix matched in , and be the result n-grams.
  for  do
     Add to Output tweaked n-grams .
Algorithm 1 Structure tweak.
Structure Tweaking (ST)

NGM contains two discrete hidden variables – the knowledge storage , and the program . The training procedure only gets rewarded if these two representations agree on the symbols used to represent certain concept (e.g., "X is the producer of a movie Y"). To help exploring the huge search space more efficiently, we apply structure tweak, a procedure which is similar to code assist liang2017nsm , but works in an opposite direction – while code assist uses the knowledge storage to inform the programmer, structure tweak adjusts the knowledge encoder to cooperate with an uninformed programmer. Together they allow the decisions in one part of the model to be influence by the decisions from other parts – similar to Gibbs sampling.

More specifically, during training the programmer always performs an extra beam search with code assist turned off. If the result programs lead to execution failure, the programs can be used to propose tweaked n-grams (Algorithm 1). For example, when executing Pref john journeyed on the knowledge storage in Table 2 matching the prefix john journeyed fails at symbol journeyed and returns empty result. At this point, journeyed can be used to replace inconsistent symbols in the partially matched n-grams (i.e. john to bedroom), and produces john journeyed bathroom. These tweaked tuples are then added into the replay buffer for the knowledge encoder ( Appendix A.2.2), which helps it to adopt a vocabulary which is consistent with the programmer.

Now the whole model has parameters , and the training objective function is


Because the knowledge storage and the program are non-differentiable discrete structures, we optimize our objective by a coordinate ascent approach – optimizing the three components in alternation with REINFORCE williams1992simple . See Appendix A.2.2 for detailed training update rules.

Sentences Knowledge Tuples
Mary went to the kitchen mary to kitchen
She picked up the milk mary the milk
John went to the bedroom john to bedroom
Mary journeyed to the garden mary to garden
Table 2: Example knowledge storage for bAbI tasks. To deal with coreference resolution we alway append the previous sentence as extra context to the left of the current sentence during encoding. assuming that variable V1 stores {mary} from previous executions. Executing the expression Pref V1 to returns a set of two symbols {kitchen, garden}. Similarly, executing PrefMax V1 to would instead produces {garden}.

3 bAbI Reasoning Tasks

We apply the N-Gram Machine (NGM) to solve a set of text reasoning tasks in the Facebook bAbI dataset weston2015towards . We first demonstrate that the model can learn to build knowledge storage and generate programs that accurately answer the questions. Then we show the scalability advantage of NGMs by applying it to longer stories up to 10 million sentences.

The Seq2Seq components are implemented as one-layer recurrent neural networks with Gated Recurrent Unit 


. The hidden dimension and the vocabulary embedding dimension are both 8. We use beam size 2 for the knowledge encoder, sample size 5 for the knowledge store, and beam size 30 for the programmer. We take a staged training procedure by first train with only the auto-encoding objective for 1k epochs, then add the question answering objective for 1k epochs, and finally add the structured tweak for 1k epochs. For all tasks, we set the n-gram length to 3.

T1 T2 T11 T15 T16
MemN2N 0.0 83.0 84.0 0.0 44.0
QA 0.7 2.7 0.0 0.0 9.8
QA + AE 70.9 55.1 100.0 24.6 100.0
QA + AE + ST 100.0 85.3 100.0 100.0 100.0
Table 3: Test accuracy on bAbI tasks with auto-encoding (AE) and structure tweak (ST)

3.1 Extractive bAbI Tasks

The bAbI dataset contains 20 tasks in total. We consider the subset of them that are extractive question answering tasks (as defined in Section1.1). Each task is learned separately. In Table 3, we report results on the test sets. NGM outperforms MemN2N sukhbaatar2015end on all tasks listed. The results show that auto-encoding is essential to bootstrap learning– without auto-encoding the expected rewards are near zero; but auto-encoding alone is not sufficient to achieve high rewards (See Section 2.2). Since multiple discrete latent structures (i.e. knowledge tuples and programs) need to agree with each other over the choice of their representations for QA to succeed, the search becomes combinatorially hard. Structure tweaking is an effective way to refine the search space – improving the performance of more than half of the tasks. Appendix A.3 gives detailed analysis of auto-encoding and structure tweaks.

3.2 Life-long bAbI

To demonstrate the scalability advantage of NGM we conduct experiments on question answering from large synthetic corpus. More specifically we generated longer bAbI stories using the open-source script from Facebook222https://github.com/facebook/bAbI-tasks. We measure the answering time and answer quality of MemN2N sukhbaatar2015end 333https://github.com/domluna/memn2n and NGM at different scales. The answering time is measured by the amount of time used to produce an answer when a question is given. For MemN2N, this is the neural network inference time. For NGM, because the knowledge storage can be built and indexed in advance, the response time is dominated by LSTM decoding.

Figure 2: Scalability comparison. Story length is the number of sentences in each QA pair.

Figure 2 compares query response time of MemN2N and NGM. We can see that MemN2N scales poorly – the inference time increases linearly as the story length increases. In comparison the answering time of NGM is not affected by story length. 444The crossover of the two lines is when the story length is around 1000, which is due to the difference in neural network architectures – NGM uses recurrent networks while MemN2N uses feed-forward networks.

To compare the answer quality at scale, we apply MemN2N and NGM to solve three life-long bAbI tasks (Task 1, 2, and 11). For each life-long task, MemN2N is run for 10 trials and the test accuracy of the trial with the best validation accuracy is used. For NGM, we use the same models trained on regular bAbI tasks. We compute the average and standard deviation of test accuracy from these three tasks. MemN2N performance is competitive with NGM when story length is no greater than 400, but decreases drastically when story length further increases. On the other hand, NGM answering quality is the same for all story lengths. These scalability advantages of NGM are due to its “machine” nature – the symbolic knowledge storage can be computed and indexed in advance, and the program execution is robust on stories of various lengths.

4 Schema Induction from Wikipedia

We conduct experiments on the WikiMovies dataset to test NGM’s ability to induce an relatively simple schema from natural language text (Wikipedia) with only weak supervision (question-answer pairs), and correctly answer question from the constructed schema.

The WikiMovies benchmark miller2016key consists of question-answer pairs in the domain of movies. It is designed to compare the performance of various knowledge sources. For this study we focus on the document QA setup, for which no predefined schema is given, and the learning algorithm is required to form an internal representation of the knowledge expressed in Wikipedia text in order to answer questions correctly. It consists of 17k Wikipedia articles about movies. These questions are created ensuring that they are answerable from the Wikipedia pages. In total there are more than 100,000 questions which fall into 13 general classes. See Table 4 for an example document and related questions. Following previous work miller2016key we split the questions into disjoint training, development and test sets with 96k, 10k and 10k examples, respectively.

4.1 Text Representation

The WikiMovies dataset comes with a list of entities (movie titles, dates, locations, persons, etc.), and we use Stanford CoreNLP manning-EtAl:2014:P14-5 to annotate these entities in text. Following previous practices in semantic parsing dong2016language ; liang2017nsm we leverage the annotations, and replaced named entity tokens with their tags for LSTM encoders, which significantly reduces the vocabulary size of LSTM models, and improves their generalization ability. Different than those of the bAbI tasks, the sentences in Wikipedia are generally very long, and their semantics cannot fit into a single tuple. Therefore, following the practices in miller2016key , instead of treating a full sentence as a text piece, we treat each annotated entity (we call it the anchor entity) plus a small window of 3 words in front of it as a text piece. We expect this small window to encode the relationship between the central entity (i.e. the movie title of the Wikipedia page) and the anchor entity. So we skip annotated entities in front of the anchor entity when creating the text windows. We also append the movie title as the context during encoding. Table 5 gives examples of the final document and query representation after annotation and window creation.

Example document: Blade Runner
Blade Runner is a 1982 American neo-noir dystopian science fiction film directed by Ridley Scott and starring Harrison Ford, Rutger Hauer, Sean Young, and Edward James Olmos. The screenplay, written by Hampton Fancher and David Peoples, is a modified film adaptation of the 1968 …
Example questions and answers
Ridley Scott directed which films? Gladiator, Alien, Prometheus, Blade Runner, … (19 films in total)
What year was the movie Blade Runner released? 1982
What films can be described by android? Blade Runner, A.I. Artificial Intelligence
Table 4: Example document and question-answer pairs from the WikiMovies task.
Annotated Text Windows(size=4): Blade Runner Knowledge Tuples
is a [1982] [Blade Runner] when [1982]
film directed by [Ridley Scott] [Blade Runner] director [Ridley Scott]
by and starring [Harrison Ford] [Blade Runner] acted [Harrison Ford]
and starring and [Edward James Olmos] [Blade Runner] acted [Edward James Olmos]
screenplay written by [Hampton Fancher] [Blade Runner] writer [Hampton Fancher]
written by and [David Peoples] [Blade Runner] writer [David Peoples]
Annotated Questions Programs
[Ridley Scott] directed which films Suff [Ridley Scott] director
What year was the movie [Blade Runner] released Pref [Blade Runner] when
What is movie written by [Hampton Fancher] Suff [Hampton Fancher] writer
Table 5: Annotated document and questions (left), and corresponding knowledge tuples and programs generated by NGM (right). We always append the tagged movie title ([Blade Runner]) to the left of text windows as context during LSTM encoding.
Question Type KB IE DOC NGM U.B.
Director To Movie 0.90 0.78 0.91 0.90 0.91
Writer To Movie 0.97 0.72 0.91 0.85 0.89
Actor To Movie 0.93 0.66 0.83 0.77 0.86
Movie To Director 0.93 0.76 0.79 0.82 0.91
Movie To Actors 0.91 0.64 0.64 0.63 0.74
Movie To Writer 0.95 0.61 0.64 0.53 0.86
Movie To Year 0.95 0.75 0.89 0.84 0.92
Avg (extractive) 0.93 0.80 0.70 0.76 0.87
Movie To Genre 0.97 0.84 0.86 0.72 0.75
Movie To Language 0.96 0.62 0.84 0.68 0.74
Movie To Tags 0.94 0.47 0.48 0.43 0.59
Tag To Movie 0.85 0.35 0.49 0.30 0.59
Movie To Ratings 0.94 0.75 0.92 - -
Movie To Votes 0.92 0.92 0.92 - -
Avg (non-extractive) 0.93 0.75 0.66 0.36 0.42
No schema design
No data curation
Scalable inference
Table 6: Scalability and test accuracy on WikiMovie tasks.666We calculate macro average, which is not weighted by the number of queries per type.U.B. is the recall upper bound for NGM, which assumes that the answer appears in the relevant text, and has been identified by certain named entity annotator.

4.2 Experiment Setup

Each training example consists of the first paragraph of a movie Wikipedia page; a question from the WikiMovies for which its answer appears in the paragraph 777To speedup training we only consider the first answer of a question if there are multiple answers to this question. E.g., only consider “Gladiator” for the question “Ridley Scott directed which films?”. We applied the same staged training procedure as we did for the bAbI tasks, but use LSTMs with larger capacities (with 200 dimensions). 100 dimension GloVe embeddings Pennington2014GloveGV are used as the input layer. After training we apply knowledge encoder to all Wikipedia text with greedy decoding, and then aggregate all the n-grams into a single knowledge storage . Then we apply programmer with to every test question and greedy decode a program to execute and calculate F1 measure using the expected answers.

4.3 Results

We compare NGM with other approaches with different knowledge representations miller2016key . KB is the least scalable approach among all–needing human to provide both the schema and the contents of the structured knowledge. IE is more scalable by automatically populate the contents of the structured knowledge, using information extractors pre-trained by human annotated examples. DOC represents end-to-end deep models, which do not require any supervision other than question answer pairs, but are not scalable at answering time, because of the differentiable knowledge representations.

Table 6 shows the performance of different approaches. We separate the questions into two categories. The first category consists of questions which are extractive – assuming that the answer appears in the relevant Wikipedia text, and has been identified by Stanford CoreNLP888https://stanfordnlp.github.io/CoreNLP/ named entity annotator. NGM performance is comparable to IE and DOC approaches, but there is still a gap from the KB approach. This is because the NGM learned schema might not be perfect – e.g., mixing writer and director as the same relation. The second category consists of questions which are not extractive (e.g., for IMDB rating or vote predictions, the answers never appear in the Wikipedia page.), or we don’t have a good entity annotator to identify potential answers. Here we implemented simple annotators for genre and language which have 59% and 74% coverage respectively. We don’t have a tag annotator, but the tags can be partially covered by other annotators such as language or genre. It remains as challenging open question of how to expand NGM’s capability to deal with non-extractive questions, and define text pieces without good coverage entity annotators.

5 Related Work

Training highly expressive discrete latent variable models on large datasets is a challenging problem due to the difficulties posed by inference hinton2006fla ; mnih2014nvi –specifically the huge variance in the gradient estimation. Mnih et almnih2014nvi applies REINFORCE williams1992simple to optimize a variational lower-bound of the data log-likelihood, but relies on complex schemes to reduce variance in the gradient estimation. We use a different set of techniques to learn N-Gram Machines, which are simpler and with less model assumptions. Instead of Monte Carlo integration, which is known for high variance and low data efficiency, we apply beam search. Beam search is very effective for deterministic environments with sparse reward liang2017nsm ; guu2017bridging , but it leads to a search problem. At inference time, since only a few top hypotheses are kept in the beam, search could get stuck and not receive any reward, preventing learning. We solve this hard search problem by having 1) a stabilized auto-encoding objective to bias the knowledge encoder to more interesting hypotheses; and 2) a structural tweak procedure which retrospectively corrects the inconsistency among multiple hypotheses so that reward can be achieved.

The question answering part of NGM our model (Figure 3) is similar to the Neural Symbolic Machine (NSM) liang2017nsm , which is a memory enhanced sequence-to-sequence model that translates questions into programs in -calculus liang11dcs

. The programs, when executed on a knowledge graph, can produce answers to the questions. Our work extends NSM by removing the assumption of a given knowledge bases or schema, and instead learns to generate storage by end-to-end training to answer questions.

6 Conclusion

We present an end-to-end trainable system for efficiently answering questions from large corpus of text. The system combines an text auto-encoding component for encoding the meaning of text into symbolic representations, and a memory enhanced sequence-to-sequence component that translates questions into programs. We show that the method achieves good scaling properties and robust inference on syntactic and natural language text. The system we present here illustrates how a bottleneck in knowledge management and reasoning can be by alleviated by end-to-end learning of a symbolic knowledge storage.


  • [1] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. Leveraging linguistic structure for open domain information extraction. In ACL (1), pages 344–354. The Association for Computer Linguistics, 2015.
  • [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [3] Steven Baker. Helping computers understand language. Official Google Blog, 2010.
  • [4] Thomas M Bartol, Cailey Bromer, Justin Kinney, Michael A Chirillo, Jennifer N Bourne, Kristen M Harris, Terrence J Sejnowski, and Sacha B Nelson. Nanoconnectomic upper bound on the variability of synaptic plasticity. In eLife, 2015.
  • [5] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In EMNLP, volume 2, page 6, 2013.
  • [6] Jack Block. Assimilation, accommodation, and the dynamics of personality development. Child Development, 53(2):281–295, 1982.
  • [7] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. ACM, 2008.
  • [8] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107–117, April 1998.
  • [9] G.A. Carpenter and S. Grossberg. Adaptive resonance theory. In The Handbook of Brain Theory and Neural Networks, pages 87–90. MIT Press, 2003.
  • [10] Gail A. Carpenter. Distributed learning, recognition, and prediction by art and artmap neural networks. Neural Netw., 10(9):1473–1494, November 1997.
  • [11] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL), 2017.
  • [12] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • [13] Li Dong and Mirella Lapata. Language to logical form with neural attention. In Association for Computational Linguistics (ACL), 2016.
  • [14] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 601–610, New York, NY, USA, 2014. ACM.
  • [15] David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher A. Welty. Building watson: An overview of the deepqa project. AI Magazine, 31(3):59–79, 2010.
  • [16] John Garcia and Robert A. Koelling. Relation of cue to consequence in avoidance learning. Psychonomic Science, 4:123–124, 1966.
  • [17] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016.
  • [18] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.
  • [19] Kelvin Guu, Panupong Pasupat, Evan Zheran Liu, and Percy Liang. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In ACL (1), pages 1051–1062. Association for Computational Linguistics, 2017.
  • [20] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
  • [21] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  • [22] Robin Jia and Percy Liang. Data recombination for neural semantic parsing. In Association for Computational Linguistics (ACL), 2016.
  • [23] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
  • [24] Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated. Trends in Cognitive Sciences, 20(7):512–534, July 2016.
  • [25] Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. Association for Computational Linguistics, 2011.
  • [26] Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao.

    Neural symbolic machines: Learning semantic parsers on freebase with weak supervision.

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23–33, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  • [27] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pages 590–599, 2011.
  • [28] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
  • [29] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102:419–457, 1995.
  • [30] Yishu Miao and Phil Blunsom. Language as a latent variable: Discrete generative models for sentence compression. the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  • [32] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
  • [33] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
  • [34] Tom M. Mitchell, William W. Cohen, Estevam R. Hruschka Jr., Partha Pratim Talukdar, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant Krishnamurthy, Ni Lao, Kathryn Mazaitis, Thahir Mohamed, Ndapandula Nakashole, Emmanouil Antonios Platanios, Alan Ritter, Mehdi Samadi, Burr Settles, Richard C. Wang, Derry Tanti Wijaya, Abhinav Gupta, Xinlei Chen, Abulhair Saparov, Malcolm Greaves, and Joel Welling. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2302–2310, 2015.
  • [35] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1791–II–1799. JMLR.org, 2014.
  • [36] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
  • [37] Randall C. O’Reilly, Rajan Bhattacharyya, Michael D. Howard, and Nicholas Ketz. Complementary learning systems. Cognitive Science, 38(6):1229–1248, 2014.
  • [38] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
  • [39] Jean Piaget. The Development of Thought: Equilibration of Cognitive Structures. Viking Press, 1st edition, November 1977.
  • [40] Richard Qian. Understand your world with bing. Bing Blog, 2013.
  • [41] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. Relation extraction with matrix factorization and universal schemas. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 74–84, 2013.
  • [42] Edmund T Rolls. Cerebral cortex: principles of operation. Oxford University Press, 2016.
  • [43] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico, 2016.
  • [44] Amit Singhal. Introducing the knowledge graph: things, not strings. Official Google Blog, 2012.
  • [45] Parag Singla and Pedro Domingos. Entity resolution with markov logic. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 572–582. IEEE, 2006.
  • [46] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 697–706, New York, NY, USA, 2007. ACM.
  • [47] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
  • [48] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [49] E. Tulving. Elements of Episodic Memory. Oxford psychology series. Clarendon Press, 1983.
  • [50] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
  • [51] W. Wang, B. Subagdja, A. H. Tan, and J. A. Starzyk. Neural modeling of episodic memory: Encoding, retrieval, and forgetting. IEEE Transactions on Neural Networks and Learning Systems, 23(10):1574–1586, Oct 2012.
  • [52] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  • [53] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
  • [54] Forrest Wickman. Your brain’s technical specs. AI Magazine, 2012.
  • [55] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

Appendix A Supplementary Material

a.1 More Related Work

In this section analyzes why existing text understanding technologies are fundamentally limited and draws connections of our proposed solution to psychology and neuroscience.

a.1.1 Current Practice in Text Understanding

In recent years, several large-scale knowledge bases (KBs) have been constructed, such as YAGO [46], Freebase [7], NELL [34], Google Knowledge Graph [44], Microsoft Satori [40], and others. However, all of them suffer from a completeness problem [14] relative to large text corpora, which stems from an inability to convert arbitrary text to graph structures, and accurately answering questions from these structures. There are two core issues in this difficulty:

The schema (or representation) problem

Traditional text extraction approaches need some fixed and finite target schema [41]. For example, the definition of marriage999https://en.wikipedia.org/wiki/Marriage is very complex and includes a whole spectrum of concepts, such as common-law marriage, civil union, putative marriage, handfasting, nikah mut’ah, and so on. It is prohibitively expensive to have experts to clearly define all these differences in advance, and annotate enough utterances from which they are expressed. However, given a particular context, many of this distinctions might be irrelevant. For example when a person say “I just married Lisa.” and then ask “Who is my wife?”, a system should be able to respond correctly, without first learning about all types of marriage. A desirable solution should induce schema automatically from the corpus such that knowledge encoded with this schema can help downstream tasks such as QA.

The end-to-end training (or inference) problem

The state-of-the-art approaches break down QA solutions into independent components, and manually annotate training examples for each of them. Typical components include schema definition [34, 7], relation extraction [33], entity resolution [45], and semantic parsing [5, 26]. However, these components need to work together and depend on each other. For example, the meaning of text expressions (e.g., “father of IBM”, “father of Adam Smith”, “father of electricity”) depend on entity types. The type information, however, depends on the entity resolution decisions, and the content of the knowledge base (KB). The content of the KBs, if populated from text, will in return depend on the meaning of text expressions. Existing systems [14, 5] leverage human annotation and supervised training for each component separately, which create consistency issues, and preclude directly optimizing the performance of the QA task. A desirable solution should use few human annotations of intermediate steps, and rely on end-to-end training to directly optimize the QA quality.

a.1.2 Text Understanding with Deep Neural Nets

More recently there has been a lot of progress in applying deep neural networks (DNNs) to text understanding [31, 53, 36, 17, 32]. The key ingredient to these solutions is embedding text expressions into a latent continuous space. This removes the need to manually define a schema, while the vectors still allow complex inference and reasoning. Continuous representations greatly simplified the system design of QA systems, and enabled end-to-end training, which directly optimizes the QA quality. However, a key issue that prevents these models from being applied to many applications is scalability. After receiving a question, all text in a corpus need to be analyzed by the model. Therefore it leads to at least complexity, where is the text size. Approaches which rely on a search subroutine (e.g., DrQA [11]) lose the benefit of end-to-end training, and are limited by the quality of the retrieval stage, which itself is difficult.

a.1.3 Relationship to Neuroscience

Previous studies [54, 4] of hippocampus suggests that human memory capacity may be somewhere between 10 terabytes and 100 terabytes. This huge capacity is very advantageous for humans’ survival, yet puts human brain in the same situation of a commercial search engine [8] – facing the task of organizing vast amounts of information in a form which support fast retrieval and inference. Good representations can only be learned gradually with a lot of computations, and the aggregated statistics of a lot of experiences.

Our model (Figure 3) resembles the complementary learning theory [29, 37, 24], which hypothesizes that intelligent agents possess two learning systems, instantiated in mammals in the hippocampus and neocortex. The first quickly learns the specifics of individual experiences, which is essential for rapid adaptations [16]

, while the second gradually acquires good knowledge representations for scalable and effective inference. The stories in training examples (episodic memories) and knowledge store (keys to episodic memories) represents the fast learning neurons, while the sequence to sequence (seq2seq) DNN models represent the slow learning neurons (semantic memory). The DNN training procedure, which goes over the past experience (stories) over and over again, resembles the “replay” of hippocampal memories that allows goal-dependent weighting of experience statistics 


The auto-encoder (Section 2.2) is related to autoassociative memory, such as the CA3 recurrent collateral system in the hipposcampus [42]. Similar to autoassociative networks, such as the Hopfield network [21], that can recall the memory when provided with a fragment of it, the knowledge decoder in our framework learns to recover the full sentence given noisy partial observations of it, i.e. all knowledge tuples of length that consist of only words from the text. It also embodies [49]’s hypothesis that the episodic memory is not just stored as engrams (n-grams), but also in the procedure (seq2seq), which reconstructs experience from the engrams.

The structure tweak procedure (Section 2.2) in our model is critical for its success, and resemblances the reconstructive (or generative) memory theory [39, 6], which hypothesizes that by employing reconstructive processes, individuals supplement other aspects of available personal knowledge into the gaps found in episodic memory in order to provide a fuller and more coherent version. Our analysis of the QA task showed that at knowledge encoding stage information is often encoded without understanding how it is going to be used. So it might be encoded in an inconsistent way. At the later QA stage an inference procedure tries to derive the expected answer by putting together several pieces of information, and fails due to inconsistency. Only at that time can a hypothesis be formed to retrospectively correct the inconsistency in memory. These “tweaks” in memory later participate in training the knowledge encoder in the form of experience replay.

Our choice of using n-grams as the unit of knowledge storage is partially motivated by the Adaptive Resonance Theory (ART) [10, 9, 51], which models each piece of human episodic memory as a set of symbols, each representing one aspect of the experience such as location, color, object etc.

a.2 N-Gram Machines Details

a.2.1 Model Structure Details

Figure 3 shows the overall model structure of an n-gram machine.

Figure 3: N-Gram Machine. The model contains two discrete hidden structures, the knowledge storage and the program, which are generated from the story and the question respectively. The executor executes programs against the knowledge storage to produce answers. The three learnable components, knowledge encoder, knowledge decoder, and programmer, are trained to maximize the answer accuracy as well as minimize the reconstruction loss of the story. Code assist and structure tweak help the knowledge encoder and programmer to communicate and cooperate with each other.

a.2.2 Optimization Details

The training objective function is


where is 1 if only contains tokens from and 0 otherwise.

For training stability and to overcome search failures, we augment this objective with experience replay [43], and the gradients with respect to each set of parameters are:


where is the total expected reward for a set of valid knowledge stores , is the set of knowledge stores which contains the tuple , and is the set of knowledge stores which contains the tuple through tweaking.


where is the experience replay buffer for . is a constant. During training, the program with the highest weighted reward (i.e. ) is added to the replay buffer.

a.3 Details of bAbI Tasks

a.3.1 Details of auto-encoding and structured tweak

To illustrate the effect of auto-encoding, we show in Figure 4 how informative the knowledge tuples are by computing the reconstruction log-likelihood using the knowledge decoder for the sentence "john went back to the garden". As expected, the tuple is the most informative. Other informative tuples include and . Therefore, with auto-encoding training, useful hypotheses have large chance to be found by a small knowledge encoder beam size (2 in our case).

Figure 4: Visualization of the knowledge decoder’s assessment of how informative the knowledge tuples are. Yellow means high and red means low.

Table 7 lists sampled knowledge storages learned with different objectives and procedures. Knowledge storages learned with auto-encoding are much more informative compared to the ones without. After structure tweaking, the knowledge tuples converge to use more consistent symbols – e.g., using went instead of back or travelled. Our experiment results show the tweaking procedure can help NGM to deal with various linguistic phenomenons such as singular/plural (“cats” vs “cat”) and synonyms (“grabbed” vs “got”). More examples are included in the supplementary material A.3.2.

went went went daniel went office daniel went office
mary mary mary mary back garden mary went garden
john john john john back kitchen john went kitchen
mary mary mary mary grabbed football mary got football
there there there sandra got apple sandra got apple
cats cats cats cats afraid wolves cat afraid wolves
mice mice mice mice afraid wolves mouse afraid wolves
is is cat gertrude is cat gertrude is cat
Table 7: Sampled knowledge storage with question answering (QA) objective, auto-encoding (AE) objective, and structure tweak (ST) procedure. Using AE alone produces similar tuples to QA+AE. The differences between the second and the third column are underlined.

a.3.2 Model generated knowledge storages and programs for bAbI tasks

The following tables show one example solution for each type of task. Only the tuple with the highest probability is shown for each sentence.

Story Knowledge Storage
Daniel travelled to the office. Daniel went office
John moved to the bedroom. John went bedroom
Sandra journeyed to the hallway. Sandra went hallway
Mary travelled to the garden. Mary went garden
John went back to the kitchen. John went kitchen
Daniel went back to the hallway. Daniel went hallway
Question Program
Where is Daniel? PrefMax Daniel went
Table 8: Task 1 Single Supporting Fact
Story Knowledge Storage
Sandra journeyed to the hallway. Sandra journeyed hallway
John journeyed to the bathroom. John journeyed bathroom
Sandra grabbed the football. Sandra got football
Daniel travelled to the bedroom. Daniel journeyed bedroom
John got the milk. John got milk
John dropped the milk. John got milk
Question Program
Where is the milk? SuffMax milk got
PrefMax V1 journeyed
Table 9: Task 2 Two Supporting Facts
Story Knowledge Storage
John went to the bathroom. John went bathroom
After that he went back to the hallway. John he hallway
Sandra journeyed to the bedroom Sandra Sandra bedroom
After that she moved to the garden Sandra she garden
Question Program
Where is Sandra? PrefMax Sandra she
Table 10: Task 11 Basic Coreference
Story Knowledge Storage
Sheep are afraid of cats. Sheep afraid cats
Cats are afraid of wolves. Cat afraid wolves
Jessica is a sheep. Jessica is sheep
Mice are afraid of sheep. Mouse afraid sheep
Wolves are afraid of mice. Wolf afraid mice
Emily is a sheep. Emily is sheep
Winona is a wolf. Winona is wolf
Gertrude is a mouse. Gertrude is mouse
Question Program
What is Emily afraid of? Pref Emily is
Pref V1 afraid
Table 11: Task 15 Basic Deduction
Story Knowledge Storage
Berhard is a rhino. Bernhard a rhino
Lily is a swan. Lily a swan
Julius is a swan. Julius a swan
Lily is white. Lily is white
Greg is a rhino. Greg a rhino
Julius is white. Julius is white
Brian is a lion. Brian a lion
Bernhard is gray. Bernhard is gray
Brian is yellow. Brian is yellow
Question Program
What color is Greg? Pref Greg a
Suff V1 a
Pref V2 is
Table 12: Task 16 Basic Induction