Grammar-Based Grounded Lexicon Learning

by   Jiayuan Mao, et al.

We present Grammar-Based Grounded Lexicon Learning (G2L2), a lexicalist approach toward learning a compositional and grounded meaning representation of language from grounded data, such as paired images and texts. At the core of G2L2 is a collection of lexicon entries, which map each word to a tuple of a syntactic type and a neuro-symbolic semantic program. For example, the word shiny has a syntactic type of adjective; its neuro-symbolic semantic program has the symbolic form λx. filter(x, SHINY), where the concept SHINY is associated with a neural network embedding, which will be used to classify shiny objects. Given an input sentence, G2L2 first looks up the lexicon entries associated with each token. It then derives the meaning of the sentence as an executable neuro-symbolic program by composing lexical meanings based on syntax. The recovered meaning programs can be executed on grounded inputs. To facilitate learning in an exponentially-growing compositional space, we introduce a joint parsing and expected execution algorithm, which does local marginalization over derivations to reduce the training time. We evaluate G2L2 on two domains: visual reasoning and language-driven navigation. Results show that G2L2 can generalize from small amounts of data to novel compositions of words.


page 1

page 2

page 3

page 4


The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that lear...

Compositional Generalization via Neural-Symbolic Stack Machines

Despite achieving tremendous success, existing deep learning models have...

Dynamic Neural Program Embedding for Program Repair

Neural program embeddings have shown much promise recently for a variety...

Density Matrices for Derivational Ambiguity

Recent work on vector-based compositional natural language semantics has...

Towards Concept Formation Grounded on Perception and Action of a Mobile Robot

The recognition of objects and, hence, their descriptions must be ground...

Harmonic Grammar in a DisCo Model of Meaning

The model of cognition developed in (Smolensky and Legendre, 2006) seeks...

Visually Grounded Neural Syntax Acquisition

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an appr...

1 Introduction

Human language learning suggests several desiderata for machines learning from language. Humans can learn grounded and compositional representations for novel words from few examples. These representations are grounded on contexts, such as visual perception. We also know how these words relate with each other in composing the meaning of a sentence.

Syntax—the structured, order-sensitive relations among words in a sentence—is crucial in humans’ learning and compositional abilities for language. According to lexicalist linguistic theories Pollard and Sag (1994); Steedman (2000); Bresnan et al. (2016), syntactic knowledge involves a small number of highly abstract and potentially universal combinatory rules, together with a large amount of learned information in the lexicon: a rich syntactic type and meaning representation for each word.

Fig. 1 illustrates this idea in a visually grounded language acquisition setup. The language learner looks at a few examples containing the novel word shiny (Fig. 1a). They also have a built-in, compact but universal set of combinatory grammar rules (Fig. 1b) that describes how the semantic program of words can be combined based on their syntactic types. The learner can recover the syntactic type of the novel word and its semantic meaining. For example, shiny is an adjective and its meaning can be grounded on visually shiny objects in images (Fig. 1c). This representation supports the interpretation of novel sentences in a novel visual context (Fig. 1d).

In this paper, we present Grammar-Based Grounded Lexicon Learning (G2L2), a neuro-symbolic framework for grounded language acquisition. At the core of G2L2 is a collection of grounded lexicon entries. Each lexicon entry maps a word to (i) a syntactic type, and (ii) a neuro-symbolic semantic program. For example, the lexicon entry for the English word shiny has a syntactic type of objset/objset: it will compose with another constituent of type objset on its right, and produces a new constituent of syntactic type objset. For example, in Fig. 1d, the word shiny composes with the word cube and yields a new constituent of type objset. The neuro-symbolic semantic program for shiny has the form , where SHINY is a concept automatically

discovered by G2L2 and associated with a learned vector embedding for classifying shiny objects. G2L2 parses sentences based on these grounded lexicon entries and a small set of combinatory categorial grammar 

(CCG; Steedman, 1996) rules. Given an input question, G2L2 will lookup the lexicon entries associated with each token, and compose these lexical semantic programs based on their syntactic types.

G2L2 takes a lexicalist approach toward grounded language learning and focuses on data efficiency and compositional generalization to novel contexts. Inspired by lexicalist linguistic theories, but in contrast to neural network-based end-to-end learning, G2L2 uses a compact symbolic grammar to constraint how semantic programs of individual words can be composed, and focuses on learning the lexical representation. This approach brings us strong data efficiency in learning new words, and strong generalization to new word compositions and sentences with more complex structures.

We are interested in jointly learning these neuro-symbolic grounded lexicon entries and the grounding of individual concepts from grounded language data, such as by simultaneously looking at images and reading parallel question–answer pairs. This is particularly challenging because the number of candidate lexicon entry combinations of a sentence grows exponentially with respect to the sentence length. For this reason, previous approaches to lexicon learning have either assumed an expert-annotated set of lexical entries Zettlemoyer and Collins (2009) or only attempted to learn at very small scales Gauthier et al. (2018). We address this combinatory explosion with a novel joint parsing and expected execution mechanism, namely CKY-E2, which extends the classic CKY chart parsing algorithm. It performs local marginalization of distributions over sub-programs to make the search process tractable.

Figure 1: G2L2 learns from grounded language data, for example, by looking at images and reading parallel question–answer pairs. It learns a collection of grounded lexicon entries comprised of weights, syntax types, semantics forms, and optionally, grounded embeddings associated with semantic concepts. These lexicon entries can be used to parse questions into programs.

In sum, our paper makes three specific contributions. First, we present the neuro-symbolic G2L2 model that learns grounded lexical representations without requiring annotations for the concepts to be learned or partial word meanings; it automatically recovers underlying concepts in the target domain from language and experience with their groundings. Second, we introduce a novel expected execution mechanism for parsing in model training, to facilitate search in the compositional grammar-based space of meanings. Third, through systematic evaluation on two benchmarks, visual reasoning in CLEVR Johnson et al. (2017) and language-driven navigation in SCAN Lake and Baroni (2018), we show that the lexicalist design of G2L2 enables learning with strong data efficiency and compositional generalization to novel linguistic constructions and deeper linguistic structures.

2 Grammar-Based Grounded Lexicon Learning

Our framework, Grammar-Based Grounded Lexicon Learning (G2L2) learns grounded lexicons from cross-modal data, such as paired images and texts. Throughout this section, we will be using the visual reasoning task, specifically visual question answering (VQA) as the example, but the idea itself can be applied to other tasks and domains, such as image captioning and language-driven navigation.

G2L2 learns from a collection of VQA data tuples, containing an image, a question, and an answer to the question. In G2L2, each word type is associated with one or multiple lexical entries, comprised of their syntactic types and semantic programs. Given the input question, G2L2 first looks up the lexicon entries associated with each individual token in the sentence (Fig. 2I). G2L2 then uses a chart parsing algorithm to to derive the programmatic meaning representation of the entire sentence by recursively composing meanings based on syntax (Fig. 2II). To answer the question, we execute the program on the image representation (Fig. 2III). During training, we compare the answer derived from the model with the groundtruth answer to form the supervision for the entire system. No additional supervision, such as lexicon entries for certain words or concept labels, is needed.

Figure 2:

G2L2 parses the input sentence into an executable neuro-symbolic program by first (I) lookup the lexicon entry associated with each word, followed by (II) computes the most probable parsing tree and the corresponding tree with a chart parsing algorithm. The derived program can be grounded and executed on an image with a neuro-symbolic reasoning process 

Mao et al. (2019) (III).

2.1 Grounded Lexicon

Figure 3: Each word is associated with a grounded lexicon, comprised of its syntactic type and a neuro-symbolic semantic program.

At a high-level, G2L2 follows the combinatory categorical grammar (CCG; Steedman, 1996) formalism to maintain lexicon entries and parse sentences. Illustrated in Fig. 3, Each word (e.g., shiny) is associated with one or multiple entries. Each entry is a tuple comprised of a syntax type (e.g., objset/objset), and a semantic meaning form (e.g., ). is a symbolic program represented in a typed domain-specific language (DSL) and can be executed on the input image. Some programs contain concepts (in this case, SHINY) that can be visually grounded.

Typed domain specific language. G2L2 uses a DSL to represent word meanings. For the visual reasoning domain, we use the CLEVR DSL Johnson et al. (2017). It contains object-level operations such as selecting all objects having a particular attribute (e.g., the shiny objects) or select all objects having a specific relationship with a certain object (e.g., the objects left of the cube). It also supports functions that respond to user queries, such as counting the number of objects or query a specific attribute (e.g., shape) of an object. The language is typed: most functions takes a set of objects or a single object as their inputs, and produce another set of objects. For example, the operation filter has the signature and returns all objects that have concept (e.g., all shiny objects) in the input set.

Syntactic types. There are two types of syntactic types in G2L2: primitive and complex.***In some domains we also use conjunctions (CONJ) in the coordination rule. The primitive types are defined in the typed domain specific language (e.g., objset, int). A complex type, denoted as X/Y or X\Y, is a functor type that takes an argument of type Y and returns an object of type X. The direction of the slash indicates word order: for X/Y, the argument Y must appear on the right, whereas in X\Y, it must appear on the left. Note that X and Y can themselves be complex types, which allows us to define functor types with multiple arguments, such as (X\Y)/Z, or even functors with functor arguments (e.g., (X\Y)/(Z/Z)).

In G2L2, the semantic type of a word meaning (in the DSL) together with a set of directional and ordering settings for its arguments (reflecting how the word and its arguments should be linearized in text) uniquely determines the word’s syntactic type. For example, the syntactic type for word shiny is objset/objset. It first states that shiny acts as a function in meaning composition, which takes a subprogram that outputs a set of objects (e.g., ) as its argument, and produces anew program whose output is also a set of objects, in this case, Second, it states the direction of the argument, which should come from its right.

Neuro-symbolic programs. Some functions in the DSL involves concepts that will be grounded in other modalities, such as the visual appearance of an object and their spatial relationships. Taking the function filter as an example, its secondary argument concept should be associated with the visual representation of objects. In G2L2, the meaning of each lexicon entry may involve one more constants (called “concepts”) that are grounded on other modalities, possibly via deep neural embeddings. In the case of shiny: . The concept SHINY is associated with a vector embedding in a joint visual-semantic embedding space, following Kiros et al. (2014). During program execution, we will be comparing the embedding of concept SHINY with object embeddings extracted from the input image, to filter out all shiny objects.

Lexicon learning. G2L2 learns lexicon entries in the following three steps. (i) First, we enumerate all possible semantic meaning programs derived from the DSL. For example, in the visual reasoning domain, a candidate program is , where ? denotes a concept argument. When we try to associate this lexicon entry to the word shiny, the program is instantiated as , where SHINY

is a new concept associated with a vector embedding. Typically, we set a maximum number of arguments for each program and constrain its depth. We explain how we set these hyperparameters for different domains in the supplementary material. (ii) Next, for programs that have a primitive type, we use its semantic type as the syntactic type (e.g.,

objset). For programs that are functions with arguments, we enumerate possible argument ordering of the arguments. For example, the program has two candidate syntactic types: objset/objset (the argument is on its right in language) and objset\objset (the argument is on its left). (iii) Finally, we associate each candidate lexicon entry with a learnable scalar weight . It is typical for a single word having tens or hundreds of candidate entries, and we optimize these lexicon entry weights in the training process. In practice, we assume no lexical ambiguity, i.e., each word type has only one lexical entry. Thus, the ambiguity of parsing only comes from different syntactic derivation orders for the same lexical entries. This also allows us to prune lexicon entries that do not lead to successful derivations during training.

2.2 Program Execution

Any fully grounded programs (i.e., programs without unbound arguments) can be executed based on the image representation. We implement the Neuro-Symbolic Concept Learner (NS-CL; Mao et al., 2019) as our differentiable program executor, which consists of a collection of deterministic functional modules to realize the operations in the DSL. NS-CL represents execution results in a “soft” manner: in the visual reasoning domain, a set of objects is represented as a vector mask of length , where is the number of objects in the scene. Each element, can be interpreted as the probability that object is in the set. For example, the operation receives an input mask and produces a mask that selects all shiny objects in the input set. The computation has two steps: (i) compare the vector embedding of concept SHINY with all objects in the scene to obtain a mask , denoting the probability of each object being shiny; (ii) compute the element-wise multiplication , which can be further used as the input to other functions. In NS-CL, the execution result of any program is fully differentiable w.r.t. the input image representation and concept embeddings (e.g., SHINY).

1:: the input sentence; : sentence length; : the -th lexicon entry associated with word ; : lexicon weights.
2: the execution result of the all possible derivations and their weights .
3:for  to  do
4:     Initialize with lexicon entries and weights
5:end for
6:for  to  do
7:     for  to  do
9:           empty list
10:          for  to  do
11:               Try to combine nodes in and
12:               Append successful combination to
13:          end for
15:     end for
16:end for
17:procedure ExpectedExecution(: a list of derivations)
18:     while  are identical except for subtrees of the same type do
19:          Create from and by computing the expected execution results for non-identical subtrees
21:          Replace and in with
22:     end while
23:end procedure
Algorithm 1 The CKY-E algorithm.
Figure 4: An illustrative example of two semantic programs that can be merged by computing the expected execution results of two subtrees (highlighted in gray). Both subtrees outputs a vector of scores indicating the objects being selected.

2.3 Joint Chart Parsing and Expected Execution (CKY-E2)

G2L2 extends a standard dynamic programming algorithm for chart parsing (i.e., the CKY algorithm (Kasami, 1966; Younger, 1967; Cocke, 1969)) to compose sentence meaning from lexical meaning forms, based on syntax. Denote as the input word sequence. the -th lexicon entry associated with word , and the corresponding weight. Consider all possible derivation of the question ,

. We define the following context-free probability distribution of derivations as:

That is, the probability is exponentially proportional to the total weights of all lexicon entries used by the specific derivation.

A straightforward implementation to support joint learning of lexicon weights and neural modules (e.g., ), is to simply execute all possible derivations on the input image, and compare the answer with the groundtruth. However, the number of possible derivations grows exponentially as the question length, making such computation intractable. For example, in SCAN Lake and Baroni (2018), each word has 178 candidate lexicons, and the number of lexicon combination of a sentence with 5 words will be . To address this issue, we introduce the idea of expected execution, which essentially computes the “expected” execution result of all possible derivations. We further accelerate this process by taking local marginalization.

Our CKY-E2 algorithm is illustrated in Algorithm. 1. It processes all spans sequentially ordered by their length. The composition for derivations of has two stages. First, it enumerates possible split point and tries to combine the derivation of and . This step is identical to the standard CKY parsing algorithm. Next, if there are two derivations and of span , whose program structures are identical except for subtrees that can be partially evaluated (i.e., does not contain any unbounded arguments), we will compress these two derivations into one, by marginalizing the execution result for that subtree.

See the example from Fig. 4. Two programs have the identical structure, except for the second argument to the outer-most relate operation. However, these sub-trees, highlighted in gray, can be partially evaluated on the input image, and both of them output a vector of scores indicating the objects being selected. Denote and as the weight associated with two derivations, and and the partial evaluation results (vectors) for two subtrees. We will replace these two candidate meaning form with :

We provide additional running examples of the algorithm in the supplementary material.

Complexity. Intuitively, once we have determined the semantics of a constituents in the question, the actual concrete meaning form of the derivation does not matter for future program execution, if the meaning form can already be partially evaluated on the input image. This joint parsing and expected execution procedure significantly reduces the exponential space of possible parsing to a polynomial space w.r.t. the number of possible program layouts that can not be partially evaluated, which, in practice, is small. The complexity of CKY-E2 is polynomial with respect to the length of the sentence, and the number of candidate lexicon entries. More specifically, , where comes from the chart parsing algorithm, and the number of derivations after the expected execution procedure is . This result is obtained by viewing the maximum arity for functor types being a constant (e.g., 2). Intuitively, for each span, all possible derivations associated with this span can be grouped into 4 categories: derivations of a primitive type, derivations of a 1-ary functor type, derivations of a 2-ary functor type, and derivations of a 2-ary functor type, with one argument binded. All these numbers grow linearly w.r.t. . For detailed analysis please refer to our supplementary material.

Correctness. One can theoretically prove that, if all operations in the program layout are commutative with the expectation operator, i.e., if

, our CKY-E2 produces exact computation of the expected execution result. These operations include, tensor addition, multiplication (if tensors are independent), and concatenation, which cover most of the computation we will do in neuro-symbolic program execution. For example, for

filter, taking the expectation over different inputs before doing the filtering is the same as taking the expectation over the filter results of different inputs. However, there are operations such as quantifiers whose semantics are not commutative with the expectation operator. In practice, it is possible to still use the expected expectation framework to approximate. We leave the application of other approximated inference techniques as future work. We provide proofs and its connections with other formalisms in the supplementary material.

2.4 Learning

Our model, G2L2, can be trained in an end-to-end manner, by looking at images and reading paired questions and answers. We denote

as a loss function that compares the output of a program execution (e.g., a probability distribution over possible answers) and the groundtruth. More precisely, given all possible derivations

, the image representation , the answer , and the executor , we optimize all parameters by minimizing the loss :

In practice, we use gradient-based optimization for both the neural network weights in concept grounding modules and the lexicon weights .

3 Experiment

We evaluate G2L2 on two domains: visual reasoning in CLEVR Johnson et al. (2017) and language-driven navigation in SCAN Lake and Baroni (2018). Beyond the grounding accuracy, we also evaluate the compositional generalizability and data efficiency, comparing G2L2 with end-to-end neural models and modular neural networks.

3.1 Visual Reasoning

Model Prog? Concept? Standard Compositional Generalization Depth
10% 100%    purple    right of    count
MAC Hudson and Manning (2018) N/N N/N 85.39 98.61 97.14 90.85 54.87 77.40
TbD-Net Mascharka et al. (2018) Y/Y N/N 44.52 98.04 89.57 49.92 63.37 53.13
NS-VQA Yi et al. (2018) Y/Y Y/Y 98.57 98.57 95.52 99.80 81.81 50.45
NS-CL Mao et al. (2019) Y/N Y/N 98.51 98.91 98.02 99.01 18.88 81.60
G2L2 (ours) Y/N Y/N 98.11 98.25 97.82 98.59 96.76 98.49
Table 1: Accuracy on the CLEVR dataset. Our model achieves a comparable results with state-of-the-art approaches on the standard training-testing split. It significantly outperforms all baselines on generalization to novel word compositions and to sentences with deeper structures. The best number in each column is bolded. The second column indicates whether the model uses program-based representation of question meaning and whether it needs program annotation for training questions. The third column indicates whether the model explicitly models individual concepts and whether it needs concept annotation for objects during training.

We first evaluate G2L2 on the visual reasoning tasks in the CLEVR domain (Johnson et al., 2017), where the task is to reason and answer questions about images. In our study, we use a subset of CLEVR dataset, which does not include sentences that involve coreference resolution, and words with multiple meanings in different contexts. We add additional information on how we filter the dataset in the supplementary.


Instead of using manually defined heuristics for curriculum learning or self-paced learning as in previous works 

(Mao et al., 2019; Li et al., 2020), we employ a curriculum learning setup that is simply based on sentence length: we gradually add longer sentences into the training set. This helps the model to learn basic words from very short sentences (6 words), and use the acquired lexicon to facilitate learning longer sentences (20 words). Since CLEVR does not provide test set annotations, for all models, we held out 10% of the training data for model development and test them on the CLEVR validation split.

Baselines. We compare G2L2 with 4 baselines. (1) MAC Hudson and Manning (2018) is an end-to-end approach based on attention. (2) TbD-Net Mascharka et al. (2018) uses a pre-trained semantic parser to parse the question into a symbolic program, and executes the program with a neural module network Andreas et al. (2016b). (3) Similarly, NS-VQA Yi et al. (2018)

also parses the question into a symbolic program. It also extracts an abstract scene representation with pre-trained neural recognition models 

He et al. (2017). It executes the program based on the abstract scene representation. Both of the approaches require additional supervision for training the semantic parser, and NS-VQA requires additional annotation for training the visual recognition model. (4) NS-CL Mao et al. (2019) jointly learns a neural semantic parser and concept embeddings by looking at images and reading paired questions and answers. It requires the annotation for all concepts in the domain (e.g., colors and shapes). In contrast, G2L2 can automatically discover visual concepts from texts.

Results. Table 1 summarizes the results. We consider any model that performs in the 95–100 range to have more or less solved the task. Small differences in numeric scores in this range, such as the fact that NS-CL outperforms our model on the “purple” generalization task by 0.2%, are less important than the fact that our model far outperforms all competitors on “count” compositional generalization and the “depth” generalization task, both of which all competitor models are far from solving.

We first compare different models on the standard training-testing split. We train different models with either 10% or 100% of the training data and evaluate them on the validation set. Our model achieves a comparable performance in terms of its accuracy and data efficiency.

Next, we systematically build three compositional generalization test splits: purple, right of, and count. The detailed setups and examples for these splits are provided in the supplementary. Essentially, we remove  90% of the sentences containing the word purple, the phrase right, and counting operations, such as how many …? and what number of …?. We only keep sentences up to a certain length (6 for purple, 11 for right, and 8 for count). We make sure that each use case of these words appear in training questions. After training, we test these models on the validation set with questions containing these words. Overall, our model G2L2 outperforms all baselines on all three generalization splits. In particular, it significantly outperforms other methods on the count split. The count split is hard for other method because it requires model to generalize to sentences with deeper structures, for example, from “how many red objects are there?” to “how many red objects are right of the cube?” Note that, during training, all models have seen example use of similar structures such as “what’s the shape of the red object” and “what’s the shape of the red object right of the cube?

Finally, we test generalization to sentences with deeper structures (depth). Specifically, we define the “hop number” of a question as the number of intermediate objects being referred to in order to locate the target object. For example, the “hop number” of the question “how many red objects are right of the cube?” is 1. We train different models on 0-hop and 1-hop questions and test them on 2-hop questions. Our model strongly outperforms all baselines.

The results on the compositional generalization and depth splits yield two conclusions. First, disentangling grounded concept learning (associating words onto visual appearances) and reasoning (e.g., filtering or counting subsets of objects in a given scene) improves data efficiency and generalization. On CLEVR, neuro-symbolic approaches that separately identify concepts and perform explicit reasoning (NS-VQA, NS-CL and G2L2) consistently generalize better than approaches that do not (MAC, TbD). The comparison between TbD and NS-VQA is informative: TbD fails on the “right of” task even in the case where the semantic parser is providing correct programs, while NS-VQA, which uses the same parser but explicitly represents compositional symbolic concepts for reasoning, succeeds in this task. Crucially, of the three neuro-symbolic methods, G2L2 achieves strong performance with less domain-specific knowledge than other methods: NS-VQA needs groundtruth programs; NS-CL needs the concept vocabulary; G2L2 requires neither. Second, our model is the only one to perform well on the hardest “out-of-sample” generalization tests: holding out “count” and generalizing to deeper embeddings. The other, easier generalization tests all have close neighbors in the training set, differing by just one word. In contrast, the length, depth and “count” tests require generalizing to sentences that differ in multiple words from any training example. They appear to require – or at least benefit especially well from – G2L2 ’s lexical-grammatical approach to capturing meaning of complex utterances, with explicit constituent-level (as opposed to simply word-level) composition. We also provide in-depth analysis for the behavior of different semantic parsing models in the supplementary material.

3.2 Language-driven Navigation

Model Simple Compositional Generalization Length
10% 100% jump around right
seq2seq (Sutskever et al., 2014) 0.93 0.99 0.00 0.00 0.15
Transformer (Vaswani et al., 2017) 0.71 0.78 0.00 0.10 0.02
GECA (Andreas, 2020) 0.99 0.98 0.87 0.82 0.15
WordDrop (Guo et al., 2020) 0.56 0.62 0.52 0.70 0.18
SwitchOut (Wang et al., 2018) 0.99 0.99 0.98 0.97 0.17
SeqMix (Guo et al., 2020) 0.98 0.89
recomb-2 (Akyürek et al., 2021) 0.88 0.82
G2L2 (ours) 1.00 1.00 1.00 1.00 1.00
Table 2: Accuracy on the SCAN dataset, averaged across 10 valid runs when applicable,

denotes standard deviation. The best number in each column is bolded.

: results taken from (Akyürek et al., 2021); : results taken from (Guo et al., 2020). Both paper have only presented results on the compositional generalization split. : applied after GECA. The results for GECA are based on the released implementation by the authors. All the models are selected with respect to the accuracy on the training set.

The second domain we consider is language-driven navigation. We evaluate models on the SCAN dataset Lake and Baroni (2018): a collection of sentence and navigational action sequence pairs. There are 6 primitive actions: jump, look, walk, run, lturn, and rturn, where an instruction turn left twice and run will be translated to lturn lturn run. All instructions are generated from a finite context-free grammar, so that we can systematically construct train-test splits for different types of compositional generalizations.

Setup. We use a string-editing domain-specific language (DSL) for modeling the meaning of words in the SCAN dataset, of which the details can be found in the supplementary material. At a high level, the model supports three primitive operations: constructing a new constant string (consisting of primitive operations), concatenating two strings, and repeating the input string for a number of times.

For G2L2, we generate candidate lexicons by enumerating functions in the string-editing DSL with up to 2 arguments and the function body has a maximum depth of 3. We also allow at most one of the argument being functor-typed, for example, V\V/(V\V). To handle parsing ambiguities, we use two primitive syntax types and , while both of them are associated with the semantic type of string. In total, we have 178 candidate lexicon entries for each word.

Baselines. We compare G2L2 to seven baselines. (1) Seq2seq (Sutskever et al., 2014) trains an LSTM-based encoder-decoder model. We follow the hyperparameter setups of (Lake and Baroni, 2018). (2) Transformer (Vaswani et al., 2017) is a 4-head Transformer-based autoregressive seq2seq model. We tuned the hidden size (i.e., the dimension of intermediate token representations) within {100, 200, 400}, as well as the number of layers (for both the encoder and the decoder) from {2, 4, 8}. Other methods are based on different data augmentation schemes for training a LSTM seq2seq model. Specifically, (3) GECA augments the original training splits using heuristic span recombination rules; (4) WordDrop (Guo et al., 2020) performs random dropout for input sequence (while keeping the same label); (5) similarly, SwitchOut (Wang et al., 2018) randomly replaces an input token with a random token from the vocabulary; (6) SeqMix (Guo et al., 2020) uses soft augmentation techniques following (Zhang et al., 2017), which composes an “weighted average” of different input sequences; (7) recomb-2 (Akyürek et al., 2021) learns recombination and resampling rules for augmentation.

Results. We compare different models on three train-test splits. In Simple, the training and test instructions are drawn from the same distribution. We compare the data efficiency of various models by using either 10% or 100% of the training data, and test them on the same test split. While all models can achieve a nearly-perfect accuracy with 100% training data, our model G2L2 shows advantage with only a small amount of data. Next, in Compositional, we have held out the sentences containing certain phrases, such as jump and around right. For these held-out phrases, only valid non-contextual examples containing them (i.e., jump in isolation and no example for around right) are available during training. During test, algorithms need to make systematical generalization of these phrases in novel contexts. Finally, in Length, all training examples have the action length less than or equal to 22, while that of a test example is up to 48. Our model consistently reach perfect performance in all considered settings, even on the cross-length generalization task where GECA does not help improve performance. These results are consistent with the conclusions we derived on the CLEVR dataset. Specifically, data-augmentation techniques for SCAN can solve simple generalization tests (e.g., jump, where tests all have close neighbors in the training set, differing by just one word) but not the hard ones (e.g., length, where test sentences can different in multiple words from any training examples).

Cases study.

G2L2 is expressive enough to achieve perfect accuracy on the SCAN dataset: there exists a set of lexicon entries which matches the groundtruth in SCAN. However, our learning algorithm does not always converge on the correct lexicon, but when it fails, the failure can be identified based on training-set accuracy. So, we perform model selection based on the training accuracy for G2L2: after a sufficient number of epochs, if the model hasn’t reached perfect accuracy (100%), we re-initialize the weights and train the model again. Our results show that, among 100 times of training, the model reaches 100% accuracy 74% of the time. For runs that don’t have 100% accuracy, the average performance is 0.94.

Since G2L2 directly learns human-interpretable lexicon entries associated with each individual words, we can further inspect the failure cases made by it when the training accuracy does not converge to 0. We find that the most significant failure mode is the word and (e.g., jump and run) and after (e.g., jump after run). Both of them are treated as connectives in SCAN. Sometimes G2L2 fails to pick the syntax type S\V/V over the type V\V/V. The entry V\V/V will succeed in parsing most cases (e.g., jump and run), except that it will introduce ambiguous parsing for sentences such as “jump and run twice”: jump and run twice vs. jump and run twice. Based on the definition of the SCAN, only the first derivation is valid. In contrast, using S\V/V resolves this ambiguity. Depending on the weight initialization and the example presentation order, G2L2 sometimes get stuck at the local optima of V\V/V. However, we can easily identify this by the training accuracies—G2L2 is able to reach perfect performance on all considered splits by simply retraining with another random seed, therefore, we only select those with 100% training accuracy as valid models.

4 Related Work

Lexicalist theories. The lexicalist theories of syntax (Pollard and Sag, 1994; Steedman, 2000; Bresnan et al., 2016) propose that 1) the key syntactic principles by which words and phrases combine are extremely simple and general, and 2) nearly all of the complexity in syntax can be attributed to rich and detailed lexical entries for the words in the language. For example, whereas the relationship between the active and passive voice, e.g., “Kim saw a balloon” versus “A balloon was seen by Kim”, was treated in pre-lexicalist theories as a special syntactic rule converting between the sentences, in lexicalist theories this relationship derives simply from the knowledge that the passive participle for the verb “see” is “seen,” which interacts with knowledge of other words to make both the active and passive forms of the sentence possible. In lexicalist theories, the problem for the language learner thus becomes a problem of learning the words in the language, not a problem of learning numerous abstract rule schemas. The combinatory categorial grammar (CCG; Steedman, 1996) framework we use is a well-established example of a lexicalist theory: there is a universal inventory of just three combinatory rules (Fig. 1a), but those rules can only be applied once richly specified lexical entries are learned for the words in a sentence. We believe that this lexicalist-theory approach is a particularly good fit to the problem of grounded language learning: the visual context provides clues to the word’s meaning, and the word’s grammatical behavior is tied closely to this meaning, making learning efficient.

Compositional generalization in NLP. Improving the compositional generalization of natrual langauge processing (NLP) systems have drawn great attention in recent years (Baroni, 2020)

. Most of the recent approaches towards this goal are mostly built on deep learning-based models. There are two representative approaches: building structured neural networks with explicit phrase-based structures or segments 

(Socher et al., 2013; Zhu et al., 2015; Tai et al., 2015; Saqur and Narasimhan, 2020); and using data augmentation techniques (Andreas, 2020; Guo et al., 2020; Akyürek et al., 2021). However, these approaches either rely on additional annotation or pretrained models for phrase structure inference or require domain-specific heuristics in data augmentation. In contrast to both approaches, we propose to use combinatory grammar rules to constrain the learning of word meanings and how they compose.

Neural latent trees. CKY-E2 is in spirit related to recent work using CKY-style modules for inducing latent trees. However, our model is fundamentally different from works on unsupervised constituency parsing (Kim et al., 2019; Shi et al., 2021) which use the CKY algorithm for inference over scalar span scores and those compute span representation vectors with CKY-style algorithms (Maillard and Clark, 2018; Drozdov et al., 2019, inter alia). Our key contribution is to introduce the expected execution mechanism, where each span is associated with weighted, compressed programs. Beyond enumerating all possible parsing trees as in (Maillard and Clark, 2018), G2L2 considers all possible programs associated with each span. Our expected execution procedure works for different types (object set, integer, etc.) and even functor types. This makes our approximation exact for linear cases and has polynomial complexity.

Grammar-based grounded language learning. There have also been approaches for learning grammatical structures from grounded texts Shi et al. (2019); Zhao and Titov (2020); Jin and Schuler (2020); Artzi and Zettlemoyer (2013); Ross et al. (2018); Tellex et al. (2011). However, these approaches either rely on pre-defined lexicon entries Artzi and Zettlemoyer (2013), or only focus on inducing syntactic structures such as phrase-structure grammar Shi et al. (2019). Different from them, G2L2 jointly learns the syntactic types, semantic programs, and concept grounding, only based on a small set of combinatory grammar rules.

Grammar-based and grounded language learning have also been studied in linguistics, with related work to ours studying on how humans use grammar as constraints in learning meaning Steedman (2000) and how learning syntactic rules and semantic meanings in language bootstrap each otherAbend et al. (2017); Taylor and Gelman (1988). However, most previous computational models have focused only on explaining small-scale lab experiments and do not address grounding in visual perception Fazly et al. (2010); Gauthier et al. (2018). In contrast, G2L2 is a neuro-symbolic model that integrates the combinatory categorial grammar formalism Steedman (1996) with joint perceptual learning and concept learning, to directly learn meanings from images and texts.

Neuro-symbolic models for language grounding. Integrating symbolic structures such as programs and neural networks has shown success in modeling compositional queries in various domains, including image and video reasoning Hu et al. (2017); Mascharka et al. (2018), knowledge base query Andreas et al. (2016a), and robotic planning Andreas et al. (2017). In this paper, we use symbolic domain-specific languages with neural network embeddings for visual reasoning in images and navigation sequence generation, following NS-CL Mao et al. (2019). However, in contrasts to using neural network-based semantic parser as in the aforementioned papers, our model G2L2 focuses on learning grammar-based lexicon for compositional generalization in linguistic structures, such as novel word composition.

5 Conclusion and Discussion

In this paper, we have presented G2L2, a lexicalist approach towards learning compositional and grounded meaning of words. G2L2 builts in a compact but potentially universal set of combinatory grammar rules and learns grounded lexicon entries from a collection of sentences and their grounded meaning, without any human annotated lexicon entries. The lexicon entries represent the semantic type of the word, the ordering settings for its arguments, as well as the grounding of concepts in its semantic program. To facilitate lexicon entry induction in an exponentially-growing space, we introduced CKY-E2 for joint chart parsing and expected execution.

Through systematical evaluation on both visual reasoning and language-driven navigation domains, we demonstrate the data efficiency and compositional generalization capability G2L2, and its general applicability in different domains. The design of G2L2 suggests several research directions. First, in G2L2 we have made strong assumptions on the context-independence of the lexicon entry as well as the application of grammar rules, the handling of linguistic ambiguities and pragmatics needs further exploration Frank and Goodman (2012). Second, meta-learning models that can leverage learned words to bootstrap the learning of novel words, such as syntactic bootstrapping Gauthier et al. (2018), is a meaningful direction. Finally, future work may consider integrating G2L2 with program-synthesis algorithms Ellis et al. (2020) for learning of more generic and complex semantic programs.

Broader impact. The ideas and techniques in this paper can be potentially used for building machine systems that can better understand the queries and instructions made by humans. We hope researchers and developers can build systems for social goods based on our paper. Meanwhile, we are aware of the ethical issues and concerns that may arise in the actual deployment of such systems, particularly biases in language and their grounding. The strong interpretability of the syntactic types and semantic programs learned by our model can be used in efforts to reduce such biases.

Acknowledgements. This work is in part supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM, funded by NSF STC award CCF-1231216), the MIT Quest for Intelligence, Stanford Institute for Human-Centered AI (HAI), Google, MIT–IBM AI Lab, Samsung GRO, and ADI. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.


  • Abend et al. [2017] Omri Abend, Tom Kwiatkowski, Nathaniel J Smith, Sharon Goldwater, and Mark Steedman. Bootstrapping language acquisition. Cognition, 164:116–143, 2017.
  • Akyürek et al. [2021] Ekin Akyürek, Afra Feyza Akyürek, and Jacob Andreas. Learning to recombine and resample data for compositional generalization. In ICLR, 2021.
  • Andreas [2020] Jacob Andreas. Good-enough compositional data augmentation. In ACL, 2020.
  • Andreas et al. [2016a] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. In NAACL-HLT, 2016a.
  • Andreas et al. [2016b] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR, 2016b.
  • Andreas et al. [2017] Jacob Andreas, Dan Klein, and Sergey Levine.

    Modular multitask reinforcement learning with policy sketches.

    In ICML, 2017.
  • Artzi and Zettlemoyer [2013] Yoav Artzi and Luke Zettlemoyer.

    Weakly supervised learning of semantic parsers for mapping instructions to actions.

    TACL, 2013.
  • Bahdanau et al. [2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • Baroni [2020] Marco Baroni. Linguistic generalization and compositionality in modern artificial neural networks. Philosophical Transactions of the Royal Society B, 375(1791):20190307, 2020.
  • Bresnan et al. [2016] Joan Bresnan, Ash Asudeh, Ida Toivonen, and Stephen Wechsler. Lexical-functional syntax. John Wiley & Sons, 2016.
  • Cocke [1969] John Cocke. Programming languages and their compilers: Preliminary notes. New York University, 1969.
  • Drozdov et al. [2019] Andrew Drozdov, Pat Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum.

    Unsupervised latent tree induction with deep inside-outside recursive autoencoders.

    In NAACL-HLT, 2019.
  • Ellis et al. [2020] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv:2006.08381, 2020.
  • Fazly et al. [2010] Afsaneh Fazly, Afra Alishahi, and Suzanne Stevenson. A probabilistic computational model of cross-situational word learning. Cognitive Science, 34(6):1017–1063, 2010.
  • Frank and Goodman [2012] Michael C Frank and Noah D Goodman. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998, 2012.
  • Gauthier et al. [2018] Jon Gauthier, Roger Levy, and Joshua B Tenenbaum. Word learning and the acquisition of syntactic–semantic overhypotheses. In CogSci, 2018.
  • Guo et al. [2020] Demi Guo, Yoon Kim, and Alexander Rush. Sequence-level mixed sample data augmentation. In EMNLP, 2020.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2015.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
  • Hu et al. [2017] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In CVPR, 2017.
  • Hudson and Manning [2018] Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. In ICLR, 2018.
  • Jin and Schuler [2020] Lifeng Jin and William Schuler. Grounded PCFG induction with images. In AACL, 2020.
  • Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  • Kasami [1966] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257, 1966.
  • Kim et al. [2019] Yoon Kim, Chris Dyer, and Alexander M Rush. Compound probabilistic context-free grammars for grammar induction. In ACL, 2019.
  • Kiros et al. [2014] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, 2014.
  • Lake and Baroni [2018] Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In ICML, 2018.
  • Li et al. [2020] Qing Li, Siyuan Huang, Yining Hong, and Song-Chun Zhu. A competence-aware curriculum for visual concepts learning via question answering. In ECCV, 2020.
  • Maillard and Clark [2018] Jean Maillard and Stephen Clark. Latent tree learning with differentiable parsers: Shift-reduce parsing and chart parsing. In ACL Workshop, 2018.
  • Mao et al. [2019] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In ICLR, 2019.
  • Mascharka et al. [2018] David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In CVPR, 2018.
  • Pollard and Sag [1994] Carl Pollard and Ivan Sag. Head-Driven Phrase Structure Grammar. Chicago: University of Chicago Press and Stanford: CSLI Publications., 1994.
  • Ross et al. [2018] Candace Ross, Andrei Barbu, Yevgeni Berzak, Battushig Myanganbayar, and Boris Katz. Grounding language acquisition by training semantic parsers using captioned videos. In EMNLP, 2018.
  • Saqur and Narasimhan [2020] Raeid Saqur and Karthik Narasimhan. Multimodal graph networks for compositional generalization in visual question answering. In NeurIPS, 2020.
  • Shi et al. [2019] Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. Visually grounded neural syntax acquisition. In ACL, 2019.
  • Shi et al. [2021] Tianze Shi, Ozan Irsoy, Igor Malioutov, and Lillian Lee. Learning syntax from naturally-occurring bracketings. In NAACL-HLT, 2021.
  • Socher et al. [2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
  • Steedman [1996] Mark Steedman. Surface structure and interpretation. MIT press, 1996.
  • Steedman [2000] Mark Steedman. The Syntactic Process. Cambridge, MA: MIT Press, 2000.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NeurIPS, 2014.
  • Tai et al. [2015] Kai Sheng Tai, Richard Socher, and Christopher D Manning.

    Improved semantic representations from tree-structured long short-term memory networks.

    In ACL-IJCNLP, 2015.
  • Taylor and Gelman [1988] Marjorie Taylor and Susan A Gelman. Adjectives and nouns: Children’s strategies for learning new words. Child Development, pages 411–419, 1988.
  • Tellex et al. [2011] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Wang et al. [2018] Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. Switchout: an efficient data augmentation algorithm for neural machine translation. In EMNLP, 2018.
  • Yi et al. [2018] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Joshua B Tenenbaum. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In NeurIPS, 2018.
  • Younger [1967] Daniel H Younger. Recognition and parsing of context-free languages in time n3. Information and control, 10(2):189–208, 1967.
  • Zettlemoyer and Collins [2009] Luke Zettlemoyer and Michael Collins. Learning context-dependent mappings from sentences to logical form. In ACL, 2009.
  • Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2017.
  • Zhao and Titov [2020] Yanpeng Zhao and Ivan Titov. Visually grounded compound PCFGs. In EMNLP, 2020.
  • Zhu et al. [2015] Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. Long short-term memory over recursive structures. In ICML, 2015.

Appendix A Domain Specific Languages and Neuro-Symbolic Reasoning

In this section, we will present and discuss the domain-specific languages (DSLs) we use for two domains: visual reasoning and language-guided navigation. We will further introduce the neuro-symbolic module we have designed for executing programs in these two domains. Overall, each DSL contains a set of types and a set of deterministic modules that have been manually designed for realizing necessary operations in these domains. However, in contrast to realizing them as we do in standard programming languages (with for-loops and if-conditions), we will be using tensor operations (e.g., tensor additions and multiplications) to realize them so that the output of each program is differentiable with respect to all of its inputs.

a.1 Visual Reasoning DSL

Our domain-specific language (DSL) for the visual reasoning domain is based on the CLEVR DSL introduced in Johnson et al. (2017), and the neuro-symbolic realization of each functional module is slightly modified from the Neuro-Symbolic Concept Learner (NS-CL; Mao et al., 2019). We refer readers to the original papers for a detailed introduction to the DSL and neuro-symbolic program execution. Here we only highlight the key aspects of our language and its neuro-symbolic realization, and discuss the difference between our implementation and the ones in original papers.

Our visual reasoning DSL is a subset of CLEVR, containing 6 types and 8 primitive operations. Table 3 illustrates all 6 types and how they are internally represented in neuro-symbolic execution.

Type Note Representation
ObjConcept Object-level concepts. An embedding vector.
Attribute Object-level attributes. A vector of length , where is the number of
RelConcept Relational concepts. An embedding vector.
ObjectSet A set of objects in the scene. A vector of length , where is the number of objects in the scene. Each entry is a real value in , which can be interpreted as the probability that object is in this set.
Integer An integer. A single non-negative real value, which can be interpreted as the “expected” value of this integer.
Bool A Boolean value. A single real value in , which can be interpreted as the probability that this Boolean value is true.
Table 3: The type system of the domain-specific language for visual reasoning.
Signature Note
scene() ObjectSet Return all objects in the scene.
filter(: ObjectSet, : ObjConcept) ObjectSet Filter out a set of objects having the object-level concept (e.g., red) from the input object set.
relate(: ObjectSet, : ObjectSet, : RelConcept) ObjectSet Filter out a set of objects in set that have the relational concept (e.g., left) with the input object .
intersection(: ObjectSet, : ObjectSet) ObjectSet Return the intersection of set and set .
union(: ObjectSet, : ObjectSet) ObjectSet Return the union of set and set .
query(: ObjectSet, : Attribute) ObjConcept Query the attribute (e.g., color) of the input object .
exist(: ObjectSet) Bool Check if the set is empty.
count(: ObjectSet) Integer Count the number of objects in the input set.
Table 4: All operations in the domain-specific language for visual reasoning.

Table 4 further shows all operations in the DSL. There are two main differences between the DSL used by G2L2 and the original CLEVR DSL. First, we have removed the unique operation, whose semantic meaning was to return the single object in a set of objects. For example, it can be used to represent the meaning of word “the” in “the red object”, in which the semantic program of “red object” yields a set of red objects and the semantic program of “the” selects the unique object in that set. However, the meaning of “the” may have slightly different semantic type in different contexts, for example, “what is the color of …”. Since this has violated our assumption about each word having only one lexicon entry, we choose to remove this operation to simplify the learning problem. Meanwhile, to handle the “uniqueness” of the object being referred to, in our realization of related operations, such as relate and query, we will implicitly choose the unique object being referred to, which we will detail in the following paragraphs.

Object-centric scene representation.

In our visual reasoning domain, we have assumed access to a pre-trained object-detector that generates a list of bounding boxes of objects in the scene. In our implementation, following Mao et al. (Mao et al., 2019), we use a pre-trained Mask R-CNN He et al. (2017) to generate bounding boxes for each object proposal. These bounding boxes, paired with the original image, are then sent to a ResNet-34 (He et al., 2015) to extract a region-based representation (by RoI Align) and image-based representation, respectively. We concatenate them to form a vector embedding for each object in the image.

Neuro-symbolic realization.

The high-level idea for the program execution is to build a collection of functions that realize the semantics of each operation based on the vector embeddings of objects and concepts. Taking the filter operation as an example, denote as a vector representation of the input set, the object embeddings, and the concept embedding. We compute the vector representation of the output set as:


is the sigmoid function, and

is the inner product of two vectors. Intuitively, we first compute the inner product between the concept embedding and each object embedding, which gives as a vector of scores of whether object has concept . Next, we compute the element-wise multiplication between two vectors.

A key difference between our realization of these operations and the one in Mao et al. Mao et al. (2019) is that we use element-wise multiplication to simulate the intersection between two sets, and for union. In contrast, Mao et al. Mao et al. (2019) use element-wise min operation for intersection and max for union. Both realizations can be motivated by real-valued logic: product logic vs. Gödel logic. The main purpose of using products instead of min-max’s is to make our realization compatible with our expected execution mechanism, which we will detail in Appendix B.


Here we run a concrete example to illustrate the execution process of a program in the visual reasoning domain. Suppose we have an image containing three objects , and . We have two additional vector embeddings for concepts SHINY and CUBE. Furthermore, , and .

Consider the input sentence “How many shiny cubes are there”. Table 5 illustrates a step-by-step execution of the underlying program: .

Program Type Value
Table 5: An illustrative execution trace of the program . sum denotes the “reduced sum” operation of a vector, which returns the summation of all entries in that vector. denotes element-wise multiplication for two vectors.

Expected execution.

In the visual reasoning domain, we have only implemented the expected execution mechanism for subordinate program trees whose type is objset, although many other types such as integer and bool also naturally supports expected execution. This is because, types such as integer and bool only appear at the sentence-level, and thus computing the “expectation” of such programs does not reduce the overall complexity.

Formally, the expected execution process compresses a list of semantic programs and their corresponding weights into a single semantic program with weight . Suppose all ’s have type objset. We use to denote the execution result of these programs. Each of them is a vector of length , where is the number of objects in the scene. We compute and as the following:

Intuitively, we normalize the weights using a softmax function to translate them into a distribution. Then we compute the expectation of the vectors. For more details about the definition and properties of expected execution, please refer to our main text and Appendix B.

Candidate lexicons.

Recall that the process of lexicon learning has three stages. First, we generate an extensive collection of candidate semantic programs. Second, we generate candidate lexicon entries for each word by enumerating all possible candidate semantic programs generated in the first step and all possible ordering (linearization in a sentence) of its arguments. Third, we apply our CKY-E2 and gradient-based optimization to update the weights associated with each lexicon entry.

In our visual reasoning domain, we only consider the following candidate semantic programs and linearizations:

  1. Syntactic type: objset, semantic program: (English noun).

  2. Syntactic type: objset, semantic program: (English noun).

  3. Syntactic type: , semantic program: (English adjective).

  4. Syntactic type: , semantic program: (English preposition I).

  5. Syntactic type: , semantic program: (English preposition II)

  6. Syntactic type: , semantic program: .

  7. Syntactic type: , .

  8. Syntactic type: , .

  9. Syntactic type: , (generalized conjunction).

  10. Syntactic type: , (generalized disjunction).

As we will see later, when we compare the candidate lexicon entries for the visual reasoning domain and the language-driven navigation domain, the visual reasoning domain contains significantly fewer entries than the navigation domain. This is because much of the learning process in this domain is associated with learning the concept embeddings. In the following few paragraphs, we will explain how we instantiate concepts based on these lexicon entry templates and implement generalized conjunction and disjunction in our domain.

First, for each word (more precisely, word type), e.g., shiny, we will instantiate 10 lexicon entries. For semantic programs that contain unbounded concept arguments (? marks), we will introduce a series word-type-specific concepts. Specifically in this domain, each word type will be associated with 3 concept representations: , , and . Based on Table 4, the first two concepts will be represented as two embedding vectors, and the the third concept will be represented as a vector, indicating which concepts belong to this attribute category. Next, we will instantiate these lexicon entries by filling in these concept representations. For example, one of the candidate lexicon entry for shiny is syntactic type: objset, semantic program: . During training, all these vector embeddings as well as the weights associated with each lexicon entry, will be optimized jointly.

Next, we discuss the implementation for two conjunctive lexicon entries. The grammar rule for is:

where is an arbitrary syntactic type (thus called generalized conjunction). There are two typical use cases: what is the shape of the red and shiny object, and what is the shape of the object that is left of the cube and right of the sphere. In the first case, both arguments have syntactic type . In the second case, both arguments have syntactic type . Note that CLEVR contains only the second case.

The grammar rule for is:

It covers the case: how many objects are blue cubes or red spheres. Our implementation is slightly different with human-defined lexicon entries for the word or, in particular, because the DSL we use is a small set of set-theoretic operations, which does not fully match the expressiveness of truth-conditional semantics. Thus, the current DSL does not support the representation of all words in the dataset (in particular, or and are). Thus, we have implemented this ad-hoc fix to handle disjunction.

Finally, we want to emphasize again that, since our DSL does not support representing all semantic programs of words, we allow certain words to be associated with an “empty” lexicon entry. This entry can be combined with any words or constituents next to it and does not participate in the composition of syntactic types and semantic programs. In Table 6 we show the lexicon entry associated with each word in the sentence “are there any shiny cubes?”, learned by our model, G2L2.

Word Type Syntactic Type Semantic Program
there <EMPTY> <EMPTY>
cubes objset
Table 6: The learned lexicon entries associated with each word for a simple sentence: are there any shiny cubes?. The derived semantic program for the full sentence is

a.2 Language-Driven Navigation DSL

Our DSL for the language-driven navigation domain is a simple string manipulation language that supports creating new strings, concatenating two strings, and repeating a string multiple times. Our DSL contains only two primitive types: action sequence, abbreviated as ActSeq, and integer.

Formally, we summarize the list of operations in our language-driven navigation domain in Table 7.

Signature Note
empty() ActSeq Create an empty string (of length 0).
newprim() ActSeq Create a string containing only one primitive action. In SCAN, there are in total 6 primitives.
newint() Integer Create a single integer. In SCAN, we only support integers {2, 3, 4}.
concat(: ActSeq, : ActSeq) ActSeq Concatenate two input strings.
repeat(: ActSeq, : Integer) ActSeq Repeat the input string for multiple times.
Table 7: All operations in the domain-specific language for language-driven navigation.

Probabilistic string representation.

We represent each string in a “probabilistic” manner. In particular, each string is represented as a tuple . is a categorical distribution of the length. is a three-dimensional tensor, indexed by , where . Thus, has the shape , where is the max length of a string and is the action vocabulary. For simplicity, we constrain that for all .

It is straightforward to represent empty strings: , or strings with a single action primitive : and . Now we explain our implementation of the concat and the repeat operation.

For :

The high-level idea is to enumerate the possible length of both strings.

Similarly, for ,

Expected execution.

In the language-driven navigation domain, we only perform expected execution for semantic programs of type ActSeq, whose execution results can be represented using the probabilistic string representation. Denote as the execution results for programs, and the corresponding weights. We define . We compute the expected string and its weight as:

Candidate lexicons.

We use a simple enumerate algorithm to generate candidate lexicon entries for our language-driven navigation DSL. Specifically, we first enumerate candidate semantic programs for each lexicon entry that satisfy the following constraints:

  1. There are at most three operations.

  2. There are at most two arguments.

  3. There is at most one argument whose type is a functor.

  4. There is no argument of type Integer.

Table 8 lists a couple of programs generated by the algorithm and their corresponding types.

Type Program (Note)
The simplest program that constructs a string with a single action
primitive: WALK.
(ActSeq) ActSeq
Prepend a LOOK action to an input string.
(ActSeq, ActSeq) ActSeq
Concatenate two strings.
(ActSeq, ActSeq) ActSeq
Repeat the first string twice and concatenate with the second string.
((ActSeq) -> ActSeq, ActSeq) ActSeq
The first argument () is a function which maps a ActSeq to
another ActSeq. The second argument is an ActSeq.
The function invokes with a simple string WALK, and
concatenate the result with .
Table 8: Sample semantic programs generated by the enumeration process based on our language-driven navigation DSL.

Based on the candidate semantic types, we first instantiate candidate lexicon entries by enumerate possible ordering (linearization) of the arguments. For example, the simple program has two possible linearizations: ActSeq/ActSeq and ActSeq\ActSeq. As discussed in the main paper, in order to handle parsing ambiguities, we further introduce two finer-grained syntactic types for the ActSeq type: S and V. In practice, we only allow the following set of syntactic types: V, V/V, V\V, V\V/V, V\V/(V\V), and S\V/V. In total, we have 178 candidate lexicon entries for each word.

Appendix B Delve Into Expected Execution

In this section, we will run a concrete example in a small arithmetic domain to demonstrate the idea of expected execution. Following that, we will prove an important invariance property that has guided our realization of different functional modules in both domains.

b.1 CKY-E2 In An Arithmetic Domain

In this section, we will consider parsing a very simple sentence in an arithmetic domain. We will be using numbers and two arithmetic operations: and . Each number in the domain will be represented as a real value.

Suppose we have the following lexicon entries associated with three words, illustrated in Table 9. There are 4 candidate derivations of the sentence “ONE PLUS_ONE MUL_THREE”, as illustrated in Table 10. For simplicity, we will show the of weights. Thus, the probability of a derivation is proportional to the product of all lexicon entry weights.

Word Type Syntactic Type Semantic Program (Weight)
ONE N 1 1.0
Table 9: A set of candidate lexicon entries and their weights in a simple arithmetic domain.
Syn. Type
Semantic Program
Execution Result
1 N
2 N
3 N
4 N
Table 10: Four candidate derivations of the simple sentence “ONE PLUS_ONE MUL_THREE” in a simple arithmetic domain.

Suppose that we will be using the groundtruth execution result of this program as the supervision, applied by an L2 loss. Then we will be interested in the expected execution result of all possible derivations. In this case, it is

Next, we will try to accelerate the computation of by doing local marginalization. Consider the constituent “ONE PLUS_ONE”. In CKY-E2, this will be the first constituent that the algorithm constructs. It has two possible derivations, whose corresponding semantic programs are and . Both derivations have the same syntactic type N, and thus, they will be combined with N\N on its right, in the next step. In this case, in CKY-E2, we will merge these two derivations into one (again, since we only care about the expected execution result, not the set of all possible derivations!). The combined derivation has value , and total weight .

Then, when we are trying to compose the derivation for the whole sentence, i.e., combine the constituent “ONE PLUS_ONE” and “MUL_THREE”, we no longer need to compute all possible derivations, but only derivations. They are: , with probability , and , with probability . In this case, we see that taking local marginalization reduces the computation complexity of parsing and retains the expected execution result!

b.2 Formal Properties of CKY-E2

Motivated by the intuitive example shown above, now let us formally specify the properties of CKY-E2.

Expectation invariance.

Consider the composition of two consecutive constituents and . We use and to denote possible derivations of both constituents. We assume all ’s are of the same syntactic type without loss of generality, so do all ’s, since we will handle different syntactic types separately.

Denote as the semantic composition function for and . Without local marginalization, we will have in total derivations for the result constituent: . We further use , , and to denote the execution results of these derivations. Without derivations, the expected execution results is:

Again, without loss of generality, we will assume and , because any constant scaling of these weights will not change the expectation . Thus, we simplify this definition as,

Let us assume function has the following property: , which expands as,

Thus, locally marginalizing the expected value for and will not change the expected execution result of .

In this simple proof we have made a strong assumption on the composition function : . In practice, this is true when

are addition or multiplication functions between scalars, vectors, matrices, and in general, tensors. It will not apply to element-wise min/max operations and other non-linear transformations. Fortunately, this already covers most of the operations we use in the visual reasoning and language-driven navigation DSLs. In practice, even if some operations do not have this property, we can still use this mechanism to approximate the expected execution result.

Although we have only proved this property for binary functions, the idea itself easily generalizes to unary functions, such as the negation operation, and higher-arity functions. Furthermore, by induction over derivation trees, we can easily prove that, as long as all composition functions satisfy the expectation invariance property, applying CKY-E2 yields the same result as doing the marginalization at a sentence level.


In general, it is hard to quantify the reduction in computational complexity by doing local marginalization. However, we can still estimate the number of possible derivations constructed in the entire CKY-E2 procedure. For simplicity, consider the case where there is only one primitive syntactic type:

X. Moreover, there are candidate lexicon entries of type X; entries of type X/X, and entries of type X\X/X. For each span, considered in the CKY-E2 algorithm, all possible derivations associated with this span can be grouped into 4 categories:

  1. Derivations of type X. In this case, only 1 derivation will be retained (merged by the expected execution result).

  2. Derivations of type X/X. In this case, they must be a primitive lexicon entry. Thus, there are at most of them.

  3. Derivations of type X\X/X. Similarly, at most of them.

  4. Derivations of type X\X. This intermediate syntactic type is a result of a partial composition between X\X/X and X (on its right). Thus, there are at most of them.

Overall, there are at most derivations stored for this span. Since the total span is , where is the total length of the input sentence, the overall complexity of CKY-E2 is a polynomial of , , , and , which is significantly lower than an exponential number of derivations.

b.3 Connection with Other Parsing Models

Connection with synchronous grammars.

Our approach can be viewed as defining a synchronous grammar over joint (parse tree, meaning program) pairings. We didn’t use this framing in the main paper because typical applications of synchronous grammar involve parallel datasets (e.g., sentence pairs in two languages for machine translation, or sentence–image pairs for generating image descriptions) in which the information in both modalities is directly parsed. In our setting, in contrast, the meaning-program component of the synchronous grammar is acquired through more distant supervision. We will make all this clear in the final version, including stating how the chart parsing process can be seen as synchronously constructing parsing trees and meaning programs. The expected execution can be viewed as a “compression” step over all meaning programs that can be potentially parsed from each span.

Connection with sum-product CKY.

There are also connections between CKY-E2 and sum-product CKY, such as the shared Markovian assumption, but would like to add that the main difference between CKY-E2 and sum-product CKY is that CKY-E2 computes only the “expectation” of the execution results of the underlying program. Instead, sum-product CKY computes a full distribution of the parsing results (e.g., in syntax parsing, it can compute the categorical distribution of the root symbol). Sum-Product CKY can not be applied to our setting, because we are dealing with programs and the space of possible programs is infinite. Modeling a distribution of all possible programs might be intractable; instead, computing the expectation is much easier, and this enables us to do local marginalization.

Appendix C Experimental Setup And Analysis

In this section, we will present in detail the experimental setups for both datasets: CLEVR and SCAN. Both datasets are released under a BSD license. Specifically, we will include details about the setups for different compositional generalization tests. Although some of them (the ones in the SCAN dataset) have already been illustrated in their original paper, we echo them here for completeness. Next, we will also analyze the performance of each model on both datasets, focusing on the inductive biases they have in their model and how these inductive biases contribute to their compositional generalization in different splits.

c.1 Visual Reasoning: CLEVR

We start with the dataset generation process for the CLEVR dataset. Next, we formally present the dataset generation protocol for all splits. Furthermore, we analyze the performance of various models.


As a quick recap of our baselines, MAC Hudson and Manning (2018) uses an end-to-end vision-language attention mechanism to process the question and image jointly; TbD-Nets Mascharka et al. (2018) and NS-VQA Yi et al. (2018) uses a neural sequence-to-sequence model (with attention, see Bahdanau et al. (2015) for semantic parsing. The parser is trained with sentence-program pairs; NS-CL Mao et al. (2019) uses a customized sequence-to-tree model, and jointly learns the visual recognition models and the semantic parser. One crucial implementation detail with the semantic parser module in NS-CL is that, it uses additional token embeddings to annotate the concepts appearing in the question. When generating a concept token in the semantic program, it uses an attention mechanism to select the concept from the input question.

Dataset generation.

Since we only consider the cases where each word is associated with a unique lexicon entry, we have manually excluded sentences that will break this assumption. Among all of the 425 templates in the original CLEVR dataset Johnson et al. (2017), we have retained 208 templates. Specifically, we have removed all templates that involve: 1) coreference resolution, 2) “same”-related questions, and 3) number comparison-related questions. To keep the number of questions the same as the original dataset, we choose to re-generate the questions using the selected subset of templates, following the original data generation protocol. All our splits are generated based on this basic version, which we name as the standard training set and the standard test set.

Split: data efficiency.

We test the data efficiency of models by only using 10% of training data in the standard training set, and test the models on the standard test set.

In this split, the semantic parsers used by all program-based methods: TbD-Nets, NS-VQA, and NS-CL, have nearly perfect accuracy. Thus, the performance drops are primarily due to the limited data for training individual modules. Overall, TbD-Nets have the worst data efficiency. There is no performance drop for the NS-VQA model, because the visual recognition modules are pretrained with direct object-level supervision.

Split: compositional generalization (purple).

The training set is generated by selecting all questions that either do not contain the word “purple” or have a length smaller than or equal to 7 (including punctuation). The test set is generated by selecting all sentences containing the word “purple” and has a length greater than 7.

In this split, the semantic parsers used by all program-based methods: TbD-Nets, NS-VQA, and NS-CL have nearly perfect accuracy. Thus, the performance drops are primarily due to 1) the limited data for training individual modules (in this case, filter(purple) and 2) novel composition of learned modules. Overall, NS-VQA and NS-CL answer more questions correctly than TbD-Nets.

Split: compositional generalization (right).

The training set is generated by selecting all questions that either do not contain the word “right” or have a length smaller than or equal to 12. The test set is generated by selecting all sentences that contain the word “right” and have a length greater than 12.

In this split, the semantic parser of NS-CL still yields almost perfect accuracy. In contrast, the accuracy of the neural sequence-to-sequence parser used by TbD-Nets and NS-VQA is around 91%. Thus, the performance drop of TbD-Nets is mainly due to the inferior performance of the corresponding neural module: relate(right). Compared with the realization of a filter module (used in our purple generalization test), the relate module in TbD-Nets has a significantly deeper neural architecture (6 layers vs. 3 layer). Thus, we hypothesize that this module requires more data to train.

Split: compositional generalization (count).

The training set is generated by selecting all questions that either do not contain operation “count” or have a length smaller than or equal to 9. The test set is generated by selecting all sentences that contain operation “count” and have a length greater than 9.

Among all compositional generalization tests, this is the most challenging one. The semantic parser, in this case, need to generalize from: “how many cubes are there” and “what’s the shape of the object that is both left of the cube and right of the sphere?”, to “how many cubes are both left of the cube and right of the sphere?” We have constructed the training data in a way such that all constituents have been seen in the training data. In this test, the program-level accuracy of the semantic parser used by TbD-Nets and NS-VQA is 70.8%. NS-VQA outputs slightly higher QA accuracy.

We find that, for the parsing module in NS-CL, it fails to output the correct filter operation. Given the input question “what is the number of spheres that are right of the cube”, it sometimes outputs filter_cube relate_right filter_cube count (the correct program has the third operation filter_sphere instead). This is because the system has never seen sentences composed of “counting” operations and such complex structures; during training, it has only seen short sentences such as “what is the number of spheres?” Only our G2L2 model, with its explicit constituent-level compositionality, is able to make these generalizations.

Split: depth generalization.

We define the “hop number” of a question as the number of intermediate objects being referred to in order to locate the target object. For example, the “hop number” of the question “how many red objects are right of the cube?” is 1. We train different models on 0-hop and 1-hop questions and test them on 2-hop questions.

This generalization test evaluates the generalization to deeper syntactic structures. All methods except for our model G2L2 fail on this test. By evaluating the accuracy of the program generated by different semantic parsers, we find that, the neural sequence-to-sequence model used by TbD-Nets and NS-VQA completely fails on this task, sometimes generating invalid programs (the program-level accuracy is 1.7%). Thus, we see a significant performance drop for both methods. Meanwhile, NS-CL also generates wrong programs, but the programs are always valid due to its sequence-to-tree design. Furthermore, even if the program is not correct, for example, they miss certain operations, the execution result may still lead to a correct answer. For example, as long as the semantic parser gets the outer-most filter operation (i.e., the last hop) correct, it is still possible to generate the correct answer.

c.2 Language-Driven Navigation: SCAN

The SCAN dataset (Lake and Baroni, 2018) consists of several official splits for generalizability evaluation. Following existing work, we evaluate on the splits corresponding to the generalization test of “jump” and “around right”. In addition, we test the generalizability across different output lengths and the data efficiency of models. The performance is measured by exact match–based output accuracy.

Split: data efficiency.

The official “simple” split randomly samples 80% among all possible example pairs as the training set, and leaves the others as the test set. We use the training set of the official simple split (available at [this url]) as the entire training set, and test the data efficiency by using only 10% of them. We sample the 10% data uniformly for each input length. Both settings are tested in the official simple test split, which is available at [this url].

Split: compositional generalization (jump).

The training split consists of jump in isolation, i.e., the input is jump while the ground-truth output is I_JUMP, along with all other examples that do not contain jump. The model is expected to recognize that jump has the same syntactic category as other verbs such as run, and does well on complicated instructions including jump, e.g., mapping jump twice to I_JUMP I_JUMP. The training data is available at [this url] and the test data is available at [this url].

Split: compositional generalization (around right).

Similar to the “jump” test, the training set for the “around-right” test consists of all possible examples that do not contain around right in their inputs, while the test set consists of all examples that have around right. It is worth noting that different from jump, around right in isolation is not a valid input as it lacks a primitive. The model is expected to perform compositional generalization, understanding around right based on existing training phrases such as around left and opposite right. The training data is available at [this url] and the test data is available at [this url].

Split: length generalization.

The model is expected to perform well on examples with long ground-truth output while training those with short ground-truth output. In this test, all training examples consist of less than or equal to 22 tokens in their outputs, while the output of a test example may consist of up to 48 tokens. The training data is available at [this url] and the test data is available at [this url].

Baseline models.

All baseline models are built on top of a seq2seq model (Sutskever et al., 2014): the original seq2seq model (Sutskever et al., 2014) trains an LSTM-based encoder-decoder model using the training set; the other methods augment the training set by either heuristics or learned models, and train an LSTM-based encoder-decoder model using the augmented data. It is worth noting that among all considered baseline methods, GECA (Andreas, 2020) may generate examples in the test set, especially for compositional generalization tests since the heuristics it introduces is by nature compositional.