Neural Compositional Denotational Semantics for Question Answering

08/29/2018 ∙ by Nitish Gupta, et al. ∙ Facebook University of Pennsylvania 0

Answering compositional questions requiring multi-step reasoning is challenging. We introduce an end-to-end differentiable model for interpreting questions about a knowledge graph (KG), which is inspired by formal approaches to semantics. Each span of text is represented by a denotation in a KG and a vector that captures ungrounded aspects of meaning. Learned composition modules recursively combine constituent spans, culminating in a grounding for the complete sentence which answers the question. For example, to interpret "not green", the model represents "green" as a set of KG entities and "not" as a trainable ungrounded vector---and then uses this vector to parameterize a composition function that performs a complement operation. For each sentence, we build a parse chart subsuming all possible parses, allowing the model to jointly learn both the composition operators and output structure by gradient descent from end-task supervision. The model learns a variety of challenging semantic operators, such as quantifiers, disjunctions and composed relations, and infers latent syntactic structure. It also generalizes well to longer questions than seen in its training data, in contrast to RNN, its tree-based variants, and semantic parsing baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compositionality is a mechanism by which the meanings of complex expressions are systematically determined from the meanings of their parts, and has been widely assumed in the study of both artificial and natural languages Montague (1973)

as a means for allowing speakers to generalize to understanding an infinite number of sentences. Popular neural network approaches to question answering use a restricted form of compositionality, typically encoding a sentence word-by-word, and then executing the complete sentence encoding against a knowledge source

(Perez et al., 2017). Such models can fail to generalize from training data in surprising ways. Inspired by linguistic theories of compositional semantics, we instead build a latent tree of interpretable expressions over a sentence, recursively combining constituents using a small set of neural modules. Our model outperforms RNN encoders, particularly when test questions are longer than training questions.

Our approach resembles Montague semantics, in which a tree of interpretable expressions is built over the sentence, with nodes combined by a small set of composition functions. However, both the structure of the sentence and the composition functions are learned by end-to-end gradient descent. To achieve this, we define the parametric form of small set of composition modules, and then build a parse chart over each sentence subsuming all possible trees. Each node in the chart represents a span of text with a distribution over groundings (in terms of booleans and knowledge base nodes and edges), as well as a vector representing aspects of the meaning that have not yet been grounded. The representation for a node is built by taking a weighted sum over different ways of building the node (similar to Maillard2017). The trees induced by our model are linguistically plausible, in contrast to prior work on structure learning from semantic objectives (Williams et al., 2018).

Typical neural approaches to grounded question answering first encode a question with a recurrent neural network (RNN), and then evaluate the encoding against an encoding of the knowledge source (for example, a knowledge graph or image)

(Santoro et al., 2017). In contrast to classical approaches to compositionality, constituents of complex expressions are not given explicit interpretations in isolation. For example, in Which cubes are large or green?, an RNN encoder will not explicitly build an interpretation for the phrase large or green. We show that such approaches can generalize poorly when tested on more complex sentences than they were trained on. Our approach instead imposes independence assumptions that give a linguistically motivated inductive bias. In particular, it enforces that phrases are interpreted independently of surrounding words, allowing the model to generalize naturally to interpreting phrases in different contexts. In our model, large or green will be represented as a particular set of entities in a knowledge graph, and be intersected with the set of entities represented by the cubes node.

Another perspective on our work is as a method for learning layouts of Neural Module Networks (NMNs) (Andreas et al., 2016b)

. Work on NMNs has focused on construction of the structure of the network, variously using rules, parsers and reinforcement learning

(Andreas et al., 2016a; Hu et al., 2017). Our end-to-end differentiable model jointly learns structures and modules by gradient descent.

Our model is a new combination of classical and neural methods, which maintains the interpretability and generalization behaviour of semantic parsing, while being end-to-end differentiable.

2 Model Overview

Our task is to answer a question , with respect to a Knowledge Graph (KG) consisting of nodes (representing entities) and labelled directed edges (representing relationship between entities). In our task, answers are either booleans, or specific subsets of nodes from the KG.

Our model builds a parse for the sentence, in which phrases are grounded in the KG, and a small set of composition modules are used to combine phrases, resulting in a grounding for the complete question sentence that answers it. For example, in Figure 1, the phrases not and cylindrical are interpreted as a function word and an entity set, respectively, and then not cylindrical is interpreted by computing the complement of the entity set. The node at the root of the parse tree is the answer to the question. Our model answers questions by:


Grounding individual tokens in a KG, that can either be grounded as particular sets of entities and relations in the KG, as ungrounded vectors, or marked as being semantically vacuous. For each word, we learn parameters that are used to compute a distribution over semantic types and corresponding denotations in a KG (§ 3.1).


Combining representations for adjacent phrases into representations for larger phrases, using trainable neural composition modules (§ 3.2). This produces a denotation for the phrase.


Assigning a binary-tree structure to the question sentence, which determines how words are grounded, and which phrases are combined using which modules. We build a parse chart subsuming all possible structures, and train a parsing model to increase the likelihood of structures leading to the correct answer to questions. Different parses leading to a denotation for a phrase of type are merged into an expected denotation, allowing dynamic programming (§ 4).


Answering the question, with the most likely grounding of the phrase spanning the sentence.

3 Compositional Semantics

3.1 Semantic Types

Our model classifies spans of text into different semantic types to represent their meaning as explicit denotations, or ungrounded vectors. All phrases are assigned a distribution over semantic types. The semantic type determines how a phrase is grounded, and which composition modules can be used to combine it with other phrases. A phrase spanning

has a denotation for each semantic type . For example, in Figure 1, red corresponds to a set of entities, left corresponds to a set of relations, and not is treated as an ungrounded vector.

The semantic types we define can be classified into three broad categories.

Grounded Semantic Types:

Spans of text that can be fully grounded in the KG.

  1. Entity (E): Spans of text that can be grounded to a set of entities in the KG, for example, red sphere or large cube. E-type span grounding is represented as an attention value for each entity, , where . This can be viewed as a soft version of a logical set-valued denotation, which we refer to as a soft entity set.

  2. Relation (R): Spans of text that can be grounded to set of relations in the KG, for example, left of or not right of or above. R-type span grounding is represented by a soft adjacency matrix where denotes a directed edge from .

  3. Truth (T): Spans of text that can be grounded with a Boolean denotation, for example, Is anything red?, Is one ball green and are no cubes red?. T-type span grounding is represented using a real-value

    that denotes the probability of the span being true.

Ungrounded Semantic Types:

Spans of text whose meaning cannot be grounded in the KG.

  1. Vector (V): This type is used for spans representing functions that cannot yet be grounded in the KG (e.g. words such as and or every). These spans are represented using different real-valued vectors - -, that are used to parameterize the composition modules described in §3.2.

  2. Vacuous (): Spans that are considered semantically vacuous, but are necessary syntactically, e.g. of in left of a cube. During composition, these nodes act as identity functions.

Partially-Grounded Semantic Types:

Spans of text that can only be partially grounded in the knowledge graph, such as and red or are four spheres. Here, we represent the span by a combination of a grounding and vectors, representing grounded and ungrounded aspects of meaning respectively. The grounded component of the representation will typically combine with another fully grounded representation, and the ungrounded vectors will parameterize the composition module. We define 3 semantic types of this kind: EV, RV and TV, corresponding to the combination of entities, relations and boolean groundings respectively with an ungrounded vector. Here, the word represented by the vectors can be viewed as a binary function, one of whose arguments has been supplied.

3.2 Composition Modules

Next, we describe how we compose phrase representations (from § 3.1) to represent larger phrases. We define a small set of composition modules, that take as input two constituents of text with their corresponding semantic representations (grounded representations and ungrounded vectors), and outputs the semantic type and corresponding representation of the larger constituent. The composition modules are parameterized by the trainable word vectors. These can be divided into several categories:

Composition modules resulting in fully grounded denotations:

Described in Figure 2.

(a) E + E E: This module performs a function on a pair of soft entity sets, parameterized by the model’s global parameter vector to produce a new soft entity set. The composition function for a single entity’s resulting attention value is shown. Such a composition module can be used to interpret compound nouns and entity appositions. For example, the composition module shown above learns to output the intersection of two entity sets.
(b) V + E E: This module performs a function on a soft entity set, parameterized by a word vector, to produce a new soft entity set. For example, the word not learns to take the complement of a set of entities. The entity attention representation of the resulting span is computed by using the indicated function that takes the vector of the V constituent as a parameter argument and the entity attention vector of the E constituent as a function argument.
(c) EV + E E: This module combines two soft entity sets into a third set, parameterized by the word vector. This composition function is similar to a linear threshold unit and is capable of modeling various mathematical operations such as logical conjunctions, disjunctions, differences etc. for different values of . For example, the word or learns to model set union.
(d) R + E E: This module composes a set of relations (represented as a single soft adjacency matrix) and a soft entity set to produce an output soft entity set. The composition function uses the adjacency matrix representation of the R-span and the soft entity set representation of the E-span.
(e) V + E T: This module maps a soft entity set onto a soft boolean, parameterized by word vector (). The module counts whether a sufficient number of elements are in (or out) of the set. For example, the word any should test if a set is non-empty.
(f) EV + E T: This module combines two soft entity sets into a soft boolean, which is useful for modelling generalized quantifiers. For example, in is every cylinder blue, the module can use the inner sigmoid to test if an element is in the set of cylinders () but not in the set of blue things (), and then use the outer sigmoid to return a value close to 1 if the sum of elements matching this property is close to 0.
(g) TV + T T: This module maps a pair of soft booleans into a soft boolean using the word vector to parameterize the composition function. Similar to EV + E  E, this module facilitates modeling a range of boolean set operations. Using the same functional form for different composition functions allows our model to use the same ungrounded word vector () for compositions that are semantically analogous.
(h) RV + R R: This module composes a pair of soft set of relations to a produce an output soft set of relations. For example, the relations left and above are composed by the word or to produce a set of relations such that entities and are related if either of the two relations exists between them. The functional form for this composition is similar to EV + E  E and TV + T  T modules.
Figure 2: Composition Modules that compose two constituent span representations into the representation for the combined larger span, using the indicated equations.

Composition with -typed nodes:

Phrases with type are treated as being semantically transparent identity functions. Phrases of any other type can combined with these nodes, with no change to their type or representation.

Composition modules resulting in partially grounded denotations:

We define several modules that combine fully grounded phrases with ungrounded phrases, by deterministically taking the union of the representations, giving phrases with partially grounded representations (§ 3.1). These modules are useful when words act as binary functions; here they combine with their first argument. For example, in Fig. 1, or and not cylindrical combine to make a phrase containing both the vectors for or and the entity set for not cylindrical.

4 Parsing Model

Here, we describe how our model classifies question tokens into semantic type spans and computes their representations (§ 4.1), and recursively uses the composition modules defined above to parse the question into a soft latent tree that provides the answer (§ 4.2). The model is trained end-to-end using only question-answer supervision (§ 4.3).

4.1 Lexical Representation Assignment

Each token in the question sentence is assigned a distribution over the semantic types, and a grounded representation for each type. Tokens can only be assigned the E, R, V, and types. For example, the token cylindrical in the question in Fig. 1 is assigned a distribution over the semantic types (one shown) and for the E type, its representation is the set of cylindrical entities.

Semantic Type Distribution for Tokens:

To compute the semantic type distribution, our model represents each word , and each semantic type using an embedding vector; . The semantic type distribution is assigned with a softmax:

Grounding for Tokens:

For each of the semantic type, we need to compute their representations:

  1. E-Type Representation: Each entity , is represented using an embedding vector based on the concatenation of vectors for its properties. For each token , we use its word vector to find the probability of each entity being part of the E-Type grounding:

    For example, in Fig. 1, the word red will be grounded as all the red entities.

  2. R-Type Representation: Each relation , is represented using an embedding vector . For each token , we compute a distribution over relations, and then use this to compute the expected adjacency matrix that forms the R-type representation for this token.

    e.g. the word left in Fig. 1 is grounded as the subset of edges with label ‘left’.

  3. V-Type Representation: For each word , we learn four vectors , and use these as the representation for words with the V-Type.

  4. -Type Representation: Semantically vacuous words that do not require a representation.

4.2 Parsing Questions

. To learn the correct structure for applying composition modules, we use a simple parsing model. We build a parse-chart over the question encompassing all possible trees by applying all composition modules, similar to a standard CRF-based PCFG parser using the CKY algorithm. Each node in the parse-chart, for each span of the question, is represented as a distribution over different semantic types with their corresponding representations.

Phrase Semantic Type Potential ():

The model assigns a score, , to each span, for each semantic type t. This score is computed from all possible ways of forming the span with type t. For a particular composition of span of type and of type , using the module, the composition score is:

where is a trainable vector and is a simple feature function. Features consist of a conjunction of the composition module type and: the words before () and after () the span, the first () and last word () in the left constituent, and the first () and last () word in the right constituent.

The final t-type potential of is computed by summing scores over all possible compositions:

Combining Phrase Representations ():

To compute ’s t-type denotation, , we compute an expected output representation from all possible compositions that result in type t.

where , is the t-type representation of the span and is the representation resulting from the composition of with using the composition module.

Answer Grounding:

By recursively computing the phrase semantic-type potentials and representations, we can infer the semantic type distribution of the complete question and the resulting grounding for different semantic types , .


The answer-type (boolean or subset of entities) for the question is computed using:


The corresponding grounding is , which answers the question.

4.3 Training Objective

Given a dataset of (question, answer, knowledge-graph) tuples, , we train our model to maximize the log-likelihood of the correct answers. We maximize the following objective:


Further details regarding the training objective are given in Appendix A.

Model Boolean Questions Entity Set Questions Relation Questions Overall
LSTM (No KG) 50.7 14.4 17.5 27.2
LSTM 88.5 99.9 15.7 84.9
Bi-LSTM 85.3 99.6 14.9 83.6
Tree-LSTM 82.2 97.0 15.7 81.2
Tree-LSTM (Unsup.) 85.4 99.4 16.1 83.6
Relation Network 85.6 89.7 97.6 89.4
Our Model (Pre-parsed) 94.8 93.4 70.5 90.8
Our Model 99.9 100 100 99.9
Table 1: Results for Short Questions (CLEVRGEN): Performance of our model compared to baseline models on the Short Questions test set. The LSTM (No KG) has accuracy close to chance, showing that the questions lack trivial biases. Our model almost perfectly solves all questions showing its ability to learn challenging semantic operators, and parse questions only using weak end-to-end supervision.

5 Dataset

We experiment with two datasets, 1) Questions generated based on the CLEVR (Johnson et al., 2017) dataset, and 2) Referring Expression Generation (GenX) dataset (FitzGerald et al., 2013), both of which feature complex compositional queries.


We generate a dataset of question and answers based on the CLEVR dataset (Johnson et al., 2017), which contains knowledge graphs containing attribute information of objects and relations between them.

We generate a new set of questions as existing questions contain some biases that can be exploited by models.111 Johnson_2017_CVPR found that many spatial relation questions can be answered only using absolute spatial information, and many long questions can be answered correctly without performing all steps of reasoning. We employ some simple tests to remove trivial biases from our dataset. We generate 75K questions for training and 37.5K for validation. Our questions test various challenging semantic operators. These include conjunctions (e.g. Is anything red and large?), negations (e.g. What is not spherical?), counts (e.g. Are five spheres green?), quantifiers (e.g. Is every red thing cylindrical?), and relations (e.g. What is left of and above a cube?). We create two test sets:

  1. Short Questions: Drawn from the same distribution as the training data (37.5K).

  2. Complex Questions: Longer questions than the training data (22.5K). This test set contains the same words and constructions, but chained into longer questions. For example, it contains questions such as What is a cube that is right of a metallic thing that is beneath a blue sphere? and Are two red cylinders that are above a sphere metallic? Solving these questions require more multi-step reasoning.

Referring Expressions (GenX)

(FitzGerald et al., 2013): This dataset contains human-generated queries, which identify a subset of objects from a larger set (e.g. all of the red items except for the rectangle

). It tests the ability of models to precisely understand human-generated language, which contains a far greater diversity of syntactic and semantic structures. This dataset does not contain relations between entities, and instead only focuses on entity-set operations. The dataset contains 3920 questions for training, 600 for development and 940 for testing. Our modules and parsing model were designed independently of this dataset, and we re-use hyperparameters from


6 Experiments

Our experiments investigate the ability of our model to understand complex synthetic and natural language queries, learn interpretable structure, and generalize compositionally. We also isolate the effect of learning the syntactic structure and representing sub-phrases using explicit denotations.

6.1 Experimentation Setting

We describe training details, and the baselines.

Training Details:

Training the model is challenging since it needs to learn both good syntactic structures and the complex semantics of neural modules—so we use Curriculum Learning (Bengio et al., 2009) to pre-train the model on an easier subset of questions. Appendix B contains the details of curriculum learning and other training details.

Baseline Models:

We compare to the following baselines. (a) Models that assume linear structure of language, and encode the question using linear RNNs—LSTM (No KG), LSTM, Bi-LSTM, and a Relation-Network (Santoro et al., 2017) augmented model. 222We use this baseline only for CLEVRGEN since GenX does not contain relations. (b) Models that assume tree-like structure of language. We compare two variants of Tree-structured LSTMs Zhu et al. (2015); Tai et al. (2015)Tree-LSTM, that operates on pre-parsed questions, and Tree-LSTM(Unsup.), an unsupervised Tree-LSTM model Maillard et al. (2017) that learns to jointly parse and represent the sentence. For GenX, we also use an end-to-end semantic parsing model from Pasupat and Liang (2015). Finally, to isolate the contribution of the proposed denotational-semantics model, we train our model on pre-parsed questions. Note that, all LSTM based models only have access to the entities of the KG but not the relationship information between them. See Appendix C for details.

LSTM (No KG) 46.0 39.6 41.4
LSTM 62.2 49.2 52.2
Bi-LSTM 55.3 47.5 49.2
Tree-LSTM 53.5 46.1 47.8
Tree-LSTM (Unsup.) 64.5 42.6 53.6
Relation Network 51.1 38.9 41.5
Our Model (Pre-parsed) 94.7 74.2 78.8
Our Model 81.8 85.4 84.6
Table 2: Results for Complex Questions (CLEVRGEN): All baseline models fail to generalize well to questions requiring longer chains of reasoning than those seen during training. Our model substantially outperforms the baselines, showing its ability to perform complex multi-hop reasoning, and generalize from its training data. Analysis suggests that most errors from our model are due to assigning incorrect structures, rather than mistakes by the composition modules.

6.2 Experiments

Short Questions Performance:

Table 1 shows that our model perfectly answers all test questions, demonstrating that it can learn challenging semantic operators and induce parse trees from end task supervision. Performance drops when using external parser, showing that our model learns an effective syntactic model for this domain. The Relation Network also achieves good performance, particularly on questions involving relations. LSTM baselines work well on questions not involving relations.333Relation questions are out of scope for these models.

Complex Questions Performance:

Table 2 shows results on complex questions, which are constructed by combining components of shorter questions. These require complex multi-hop reasoning, and the ability to generalize robustly to new types of questions. We use the same models as in Table 1, which were trained on short questions. All baselines achieve close to random performance, despite high accuracy for shorter questions. This shows the challenges in generalizing RNN encoders beyond their training data. In contrast, the strong inductive bias from our model structure allows it to generalize well to complex questions. Our model outperforms Tree-LSTM (Unsup.) and the version of our model that uses pre-parsed questions, showing the effectiveness of explicit denotations and learning the syntax, respectively.

Model Accuracy
LSTM (No KG) 0.0
LSTM 64.9
Bi-LSTM 64.6
Tree-LSTM 43.5
Tree-LSTM (Unsup.) 67.7
Sempre 48.1
Our Model (Pre-parsed) 67.1
Our Model 73.7
Table 3: Results for Human Queries (GenX) Our model outperforms LSTM and semantic parsing models on complex human-generated queries, showing it is robust to work on natural language. Better performance than Tree-LSTM (Unsup.) shows the efficacy in representing sub-phrases using explicit denotations. Our model also performs better without an external parser, showing the advantages of latent syntax.

Performance on Human-generated Language:

Table 3 shows the performance of our model on complex human-generated queries in GenX. Our approach outperforms strong LSTM and semantic parsing baselines, despite the semantic parser’s use of hard-coded operators. These results suggest that our method represents an attractive middle ground between minimally structured and highly structured approaches to interpretation. Our model learns to interpret operators such as except that were not considered during development. This shows that our model can learn to parse human language, which contains greater lexical and structural diversity than synthetic questions. Trees induced by the model are linguistically plausible (see Appendix D).

Error Analysis:

We find that most model errors are due to incorrect assignments of structure, rather than semantic errors from the modules. For example, in the question Are four red spheres beneath a metallic thing small?, our model’s parse composes metallic thing small into a constituent instead of composing red spheres beneath a metallic thing into a single node. Future work should explore more sophisticated parsing models.


While our model shows promising results, there is significant potential for future work. Performing exact inference over large KGs is likely to be intractable, so approximations such as KNN search, beam search, feature hashing or parallelization may be necessary. To model the large number of entities in KGs such as Freebase, techniques proposed by recent work 

(Verga et al., 2017; Gupta et al., 2017) that explore representing entities as composition of its properties, such as, types, description etc. could be used.. The modules in this work were designed in a way to provide good inductive bias for the kind of composition we expected them to model. For example, EV + E  E is modeled as a linear composition function making it easy to represent words such as and and or. These modules can be exchanged with any other function with the same ‘type signature’, with different trade-offs—for example, more general feed-forward networks with greater representation capacity would be needed to represent a linguistic expression equivalent to xor. Similarly, more module types would be required to handle certain constructions—for example, a multiword relation such as much larger than needs a V + V  V module. This is an exciting direction for future research.

7 Related Work

Many approaches have been proposed to perform question-answering against structured knowledge sources. Semantic parsing models have learned structures over pre-defined discrete operators, to produce logical forms that can be executed to answer the question. Early work trained using gold-standard logical forms (Zettlemoyer and Collins, 2005; Kwiatkowski et al., 2010), whereas later efforts have only used answers to questions (Clarke et al., 2010; Liang et al., 2011; Krishnamurthy and Kollar, 2013). A key difference is that our model must learn semantic operators from data, which may be necessary to model the fuzzy meanings of function words like many or few.

Another similar line of work is neural program induction models, such as Neural Programmer (Neelakantan et al., 2017) and Neural Symbolic Machine (Liang et al., 2017). These models learn to produce programs composed of predefined operators using weak supervision to answer questions against semi-structured tables.

Neural module networks have been proposed for learning semantic operators (Andreas et al., 2016b) for question answering. This model assumes that the structure of the semantic parse is given, and must only learn a set of operators. Dynamic Neural Module Networks (D-NMN) extend this approach by selecting from a small set of candidate module structures (Andreas et al., 2016a). We instead learn a model over all possible structures.

Our work is most similar to N2NMN (Hu et al., 2017)

model, which learns both semantic operators and the layout in which to compose them. However, optimizing the layouts requires reinforcement learning, which is challenging due to the high variance of policy gradients, whereas our chart-based approach is end-to-end differentiable.

8 Conclusion

We have introduced a model for answering questions requiring compositional reasoning that combines ideas from compositional semantics with end-to-end learning of composition operators and structure. We demonstrated that the model is able to learn a number of complex composition operators from end task supervision, and showed that the linguistically motivated inductive bias imposed by the structure of the model allows it to generalize well beyond its training data. Future work should explore scaling the model to other question answering tasks, using more general composition modules, and introducing additional module types.


We would like to thank Shyam Upadhyay and the anonymous reviewers for their helpful suggestions.


  • Andreas et al. (2016a) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016a. Learning to compose neural networks for question answering. In NAACL-HLT.
  • Andreas et al. (2016b) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In CVPR, pages 39–48.
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML.
  • Clarke et al. (2010) James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. 2010. Driving semantic parsing from the world’s response. In CoNLL.
  • FitzGerald et al. (2013) Nicholas FitzGerald, Yoav Artzi, and Luke S. Zettlemoyer. 2013. Learning distributions over logical forms for referring expression generation. In EMNLP.
  • Gupta et al. (2017) Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encoding of types, descriptions, and context. In EMNLP.
  • Hu et al. (2017) Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In ICCV.
  • Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
  • Klein and Manning (2003) Dan Klein and Christopher D Manning. 2003. Accurate unlexicalized parsing. In ACL.
  • Krishnamurthy and Kollar (2013) Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL, 1:193–206.
  • Kwiatkowski et al. (2010) Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic ccg grammars from logical form with higher-order unification. In EMNLP.
  • Liang et al. (2017) Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. 2017.

    Neural symbolic machines: Learning semantic parsers on freebase with weak supervision.

    In ACL.
  • Liang et al. (2011) Percy Liang, Michael I Jordan, and Dan Klein. 2011. Learning dependency-based compositional semantics. In ACL.
  • Maillard et al. (2017) Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly learning sentence embeddings and syntax with unsupervised tree-lstms. CoRR, abs/1705.09189.
  • Montague (1973) Richard Montague. 1973. The proper treatment of quantification in ordinary English. In K. J. J. Hintikka, J. Moravcsic, and P. Suppes, editors, Approaches to Natural Language, pages 221–242. Reidel, Dordrecht.
  • Neelakantan et al. (2017) Arvind Neelakantan, Quoc V. Le, Martín Abadi, Andrew McCallum, and Dario Amodei. 2017. Learning a natural language interface with neural programmer. In ICLR.
  • Pasupat and Liang (2015) Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In ACL.
  • Perez et al. (2017) Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, and Aaron C. Courville. 2017. Learning visual reasoning without strong priors. CoRR, abs/1707.03017.
  • Santoro et al. (2017) Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Timothy P. Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015.

    Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.

    In ACL.
  • Verga et al. (2017) Patrick Verga, Arvind Neelakantan, and Andrew McCallum. 2017. Generalizing to unseen entities and entity pairs with row-less universal schema. In EACL.
  • Williams et al. (2018) Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018. Do latent tree learning models identify meaningful structure in sentences? TACL, 6:253–267.
  • Zettlemoyer and Collins (2005) Luke S Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. UAI.
  • Zhu et al. (2015) Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In ICML.

Appendix A Training Objective

Given a dataset of (question, answer, knowledge-graph) tuples, , we train our model to maximize the log-likelihood of the correct answers. Answers are either booleans (), or specific subsets of entities () from the KG. We denote the semantic type of the answer as . The model’s answer is found by taking the complete question representation, containing a distribution over types and the representation for each type. We maximize the following objective:


where and are respectively the objective functions for questions with boolean answers and entity set answers.


We also add -regularization for the scalar parsing features introduced in § 4.2.

Appendix B Training Details

Representing Entities:

Each entity in CLEVRGEN and GenX datasets consists of attributes. For each attribute-value, we learn an embedding vector and concatenate these vectors to form the representation for the entity.

Training details:

For curriculum learning, for the CLEVRGEN dataset we use a 2-step schedule where we first train our model on simple attribute match (What is a red sphere?), attribute existence (Is anything blue?) and boolean composition (Is anything green and is anything purple?) questions and in the second step on all questions. For GenX we use a 5-step, question-length based schedule, where we first train on shorter questions and eventually on all questions.

We tune hyper-parameters using validation accuracy on the CLEVRGEN dataset, and use the same hyper-parameters for both datasets. We train using SGD with a learning rate of , a mini-batch size of 4, and regularization constant of . When assigning the semantic type distribution to the words at the leaves, we add a small positive bias of for -type and a small negative bias of for the E-type score before the softmax. Our trainable parameters are: question word embeddings (-dimensional), relation embeddings (-dimensional), entity attribute-value embeddings (-dimensional), four vectors per word for V-type representations, parameter vector for the parsing model that contains six scalar feature scores per module per word, and the global parameter vector for the E+EE module.

Appendix C Baseline Models

c.1 LSTM (No KG)

We use a LSTM network to encode the question as a vector . We also define three other parameter vectors, , and that are used to predict the answer-type , entity attention value , and the probability of the answer being True .

c.2 Lstm

Similar to LSTM (No Relation), the question is encoded using a LSTM network as vector . Similar to our model, we learn entity attribute-value embeddings and represent each entity as the concatenation of the attribute-value embeddings, . Similar to LSTM (No Relation), we also define the parameter vector to predict the answer-type. The entity-attention values are predicted as . To predict the probability of the boolean-type answer being true, we first add the entity representations to form , then make the prediction as .

c.3 Tree-LSTM

Training the Tree-LSTM model requires pre-parsed sentences for which we use a binary constituency tree generating PCFG parser Klein and Manning (2003). We input the pre-parsed question to the Tree-LSTM to get the question embedding . The rest of the model is same the LSTM model above.

c.4 Relation Network Augmented Model

The original formulation of the relation network module is as follows:


where , are the representations of the entities and is the question representation from an LSTM network. The output of the Relation Network module is a scalar score value for the elements in the answer vocabulary. Since our dataset contains entity-set valued answers, we modified the module in the following manner.

We concatenate the entity-pair representations with the representations of the pair of relations between them444In the CLEVR dataset, between any pair of entities, only 2 directed relations, left or right, and above or beneath are present.. We use the RN-module to produce an output representation for each entity as:


Similar to the LSTM baselines, we define a parameter vector to predict the answer-type, and compute the vector to compute the probability of the boolean type answer being true.

To predict the entity-attention values, we use a separate attribute-embedding matrix to first generate the output representation for each entity, , then predict the output attention values as follows:


We tried other architectures as well, but this modification provided the best performance on the validation set. We also tuned the hyper-parameters and found the setting from Santoro et al. (2017) to work the best based on validation accuracy. We used a different 2-step curriculum to train the Relation Network module, in which we replace the Boolean questions with the relation questions in the first-schedule and jointly train on all questions in the subsequent schedule.

c.5 Sempre

The semantic parsing model from (Pasupat and Liang, 2015) answers natural language queries for semi-structured tables. The answer is a denotation as a list of cells in the table. To use the Sempre framework, we convert the KGs in the GenX to tables as follows:

  1. Each table has the first row (header) as:

  2. Each row contains an object id, and the 4 property-attribute values in cells.

  3. The answer denotation, i.e. the objects selected by the human annotators is now represented as a list of object ids.

After converting the the KGs to tables, Sempre framework can be trivially used to train and test on the GenX

dataset. We tune the number of epochs to train for based on the validation accuracy and find

epochs over the training data to work the best. We use the default setting of the other hyper-parameters.

Appendix D Example Output Parses

In Figure 3, we show example queries and their highest scoring output structure from our learned model for GenX dataset.

(a) An example output from our learned model showing that our model learns to correctly parse the questions sentence, and model the relevant semantic operator; or as a set union operation, to generate the correct answer denotation. It also learns to cope with lexical variability in human language; triangle being referred to as ramp.
(b) An example output from our learned model that shows that our model can learn to correctly parse human-generated language into relatively complex structures and model semantic operators, such as except, that were not encountered during model development.
Figure 3: Example output structures from our learned model: Examples of queries from the GenX dataset and the corresponding highest scoring tree structures from our learned model. The examples shows that our model is able to correctly parse human-generated language and jointly learn to model semantic operators, such as set unions, negations etc.