Compositional Generalization by Learning Analytical Expressions

06/18/2020 ∙ by Qian Liu, et al. ∙ Beihang University Microsoft 0

Compositional generalization is a basic but essential intellective capability of human beings, which allows us to recombine known parts readily. However, existing neural network based models have been proven to be extremely deficient in such a capability. Inspired by work in cognition which argues compositionality can be captured by variable slots with symbolic functions, we present a refreshing view that connects a memory-augmented neural model with analytical expressions, to achieve compositional generalization. Our model consists of two cooperative neural modules Composer and Solver, fitting well with the cognitive argument while still being trained in an end-to-end manner via a hierarchical reinforcement learning algorithm. Experiments on a well-known benchmark SCAN demonstrate that our model seizes a great ability of compositional generalization, solving all challenges addressed by previous works with 100



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When using language, humans have a remarkable ability to recombine known parts to understand novel sentences they have never encountered before Chomsky (1957); Fodor and Pylyshyn (1988); Fodor and Lepore (2002). For example, once a person has learned the meanings of “walk”, “jump” and “walk twice”, it is effortless for him or her to understand the meaning of “jump twice”. This kind of ability relies on the compositionality of language. More formally, compositionality states such a phenomena, where the meaning of a complex expression (e.g. a sentence) is determined by the meanings of its constituents (e.g. the verb “jump” and the adverb “twice") together with the way these constituents are combined (e.g. an adverb modifies a verb) Szabó (2017). Understanding compositionality in language is a basic but essential capacity for human beings, which is argued to be one of the key skills towards human-like machine intelligence Fodor and Lepore (2002); Mikolov et al. (2016).

Recently, Lake and Baroni (2018) made a step towards exploring and benchmarking the compositional generalization of neural networks. Compositional generalization is argued to be the ability to understand out-of-domain sentences, when the understanding requires leveraging the compositionality in language Russin et al. (2019). The test suite, their proposed Simplified version of the CommAI Navigation (SCAN) dataset, contains compositional navigation commands such as “walk twice” and corresponding action sequences WALK WALK. Such a task lies in the category of machine translation, and thus is expected to be well solved by current state-of-the-art translation models (e.g. sequence to sequence with attention Sutskever et al. (2014); Bahdanau et al. (2015)). However, experiments on SCAN demonstrated that modern translation models dramatically fail to obtain a satisfactory performance on compositional generalization. For example, although the meanings of “walk”, “walk twice” and “jump” have been seen, current models fail to generalize to understand “jump twice”. Subsequent works verified that it was not an isolated case, since convolutional encoder-decoder model and Transformer met the same problem Dessì and Baroni (2019). There have been several attempts towards SCAN, but so far no neural based model can successfully solve all the compositional challenges on SCAN without extra resources Li et al. (2019); Lake (2019); Gordon et al. (2020).

In this paper, we propose a memory-augmented neural model to achieve the compositional generalization by Learning Analytical Expressions (LAnE). Motivated by work in cognition which argues compositionality can be captured by variable slots with symbolic functions Baroni (2019), our memory-augmented architecture is devised to contain two cooperative neural modules accordingly: Composer and Solver. Composer aims to find structured analytical expressions from unstructured sentences, while Solver focuses on understanding these expressions with accessing Memory (Sec. 3). These two modules are trained to learn analytical expressions together in an end-to-end manner via a hierarchical reinforcement learning algorithm (Sec. 4). Experiments on a well-known benchmark SCAN demonstrate that our model seizes a great ability of compositional generalization, reaching accuracies in all tasks (Sec. 5

). As far as we know, our model is the first neural model to pass all compositional challenges addressed by previous works on SCAN without extra resources. We will open-source our code upon acceptance, and we believe our work could shed light on the community.

2 Compositional Generalization Assessment

Since the study on compositional generalization of deep neural models is still in its infancy, the overwhelming majority of previous works employ artificial datasets to conduct assessment. As one of the most important benchmarks, the SCAN dataset is proposed to evaluate the compositional generalization ability of translation models Lake and Baroni (2018). As mentioned above, SCAN describes a simple navigation task that aims to translate compositional navigation sentences into executed action sequences. However, due to the open nature of compositional generalization, there is disagreement about which aspect should be addressed Szabó (2017); Lake et al. (2019); Hupkes et al. (2020); Keysers et al. (2020). To conduct a comprehensive assessment, we consider both systematicity and productivity, two important arguments for compositional generalization.

Systematicity evaluates if models can recombine known parts. To assess it, Lake and Baroni (2018) proposed three tasks: (i) Add Jump. The pairs of train and test are split in terms of the primitive JUMP. All commands that contain, but are not exactly, the word “jump” form the test set. The rest forms the train set. (ii) Around Right. Any compositional command whose constitutes include “around right” is excluded from the train test. This task is proposed to evaluate whether the model can generalize the experience about “left” to “right”, especially on “around right”. (iii) Length. All commands with long outputs (i.e. output length is longer than ), such as “around * twice * around” and “around * thrice”, are never seen in training, where ‘*’ indicates a wildcard. More recently, Keysers et al. (2020) proposed another assessment, the distribution-based systematicity. It aims to measure the compositional generalization by using a setup where there is a large compound distribution divergence between train and test sets (Maximum Compound Divergence, MCD)Keysers et al. (2020).

Productivity is thought to be another key argument. It not only requires models to recombine known parts, but also evaluates if they can productively generalize to inputs beyond the length they have seen in training. It relates itself to the unboundedness of languages, which means languages license a theoretically infinite set of possible sentences Baroni (2019). To evaluate it, we re-create the SCAN dataset (SCAN-ext). Compared with SCAN using up to one "and" in a sentence, SCAN-ext roughly controls the distribution of input lengths by the number of “and” (e.g. “jump and walk twice and turn left”). Input sentences in the train set consist of at most “and”, while the test set allows at most . Except for “and”, the generation of other parts follows the procedure in SCAN.

3 Methodology

In this section, we first show the intrinsic connection between language compositionality and analytical expressions. We then describe how these expressions are learned through our model.

3.1 Problem Statement

Cognitive scientists argue that the compositionality of language indeed constitutes an algebraic system, of the sort that can be captured by symbolic functions with variable slots Baroni (2019); Fodor and Lepore (2002). As an illustrative example, any adjective attached with a prefix “super-” can be regarded as applying a symbolic function (i.e. “super-adj”) on a variable slot (e.g. “good”), and will be mapped to a new adjective (e.g. “super-good”) Baroni (2019). Such a formulation frees the symbolic function from specific adjectives and makes it able to generalize on new adjectives (e.g. “super-bad”).

Figure 1: The schematic illustration of our idea on learning analytical expressions (see text).

Taking a more complicated case from SCAN, as shown in Fig. 1, the understanding of “run opposite left after walk twice” can be regarded as a hierarchical application of symbolic functions. In Fig. 1, ”$x” and ”$y” are variables defined in the source domain, and ”$X” and ”$Y” are variables defined in the destination domain. We call a sequence of source domain variables or words (e.g. run) as a source analytical expression (SrcExp), and a sequence of destination domain variables or action words (e.g. RUN) as a destination analytical expression (DstExp). If there is no variable in an SrcExp (or DstExp), it is also a constant SrcExp (or DstExp). From bottom to up, each phrase marked blue represents an SrcExp which will be superseded by a source domain variable (e.g. $x) when moving to the next hierarchy of understanding. These SrcExps can be recognized and translated into their corresponding DstExps by a set of symbolic functions. We call such SrcExps as recognizable SrcExps, and their corresponding DstExps as recognizable DstExps. By iterative recognizing and translating recognizable SrcExps, we can construct a tree hierarchy with a set of recognizable DstExps. By assigning values to the destination variables in recognizable DstExps recursively (dotted red arrows in Fig. 1), we can finally obtain a constant DstExp as the final resulted sequence.

It is well known that, variables are pieces of memory in computers, and a memory mechanism can be used to support variable-related operations. Thus we propose a memory-augmented neural model to achieve compositional generalization by automatically learning the above analytical expressions.

3.2 Model Design

Our model consists of three components: Composer, Solver and Memory. Composer accepts an SrcExp as input, and aims to find a recognizable SrcExp inside it. Solver first translates the recognizable SrcExp into a recognizable DstExp, and then assigns values to destination variables in the recognizable DstExp, obtaining a constant DstExp. Memory is designed to support variable-related operations in a differentiable manner Sukhbaatar et al. (2015). The understanding of a sentence is an iterative procedure involving these three components, and below we take step as an example.

Composer   Given an SrcExp , Composer aims to find a recognizable SrcExp . There are several ways to implement it, where we choose to gradually merge elements of until a recognizable SrcExp appears. For example, given “$x after $y”, at first Composer merges “$x” and “after”. Then it checks if “$x after” is a recognizable SrcExp. The answer for this case is NO, so Composer continues to merge “$x after” with “$y”. Next, it checks if “$x after $y” is a recognizable SrcExp. Fortunately, it is, and thus Composer triggers Solver to translate it. As indicated by the example, the overall procedure is iterative, and thus achieved by iteratively building a binary tree from bottom to up. That is to say, viewing elements of as nodes, Composer iteratively merges two neighboring nodes into a parent node at each layer (i.e. the merge process), and checks if the parent node represents a recognizable SrcExp (i.e. the check process).

The merge process is implemented by first enumerating all possible parent nodes of the current layer, and then selecting the one which has the highest merging score. Assuming -th and -th node at layer are represented by and respectively, their parent representation can be obtained via a standard Tree-LSTM encoding Tai et al. (2015) using and as input. As shown inside Composer in Fig. 2

, given all parent node representations (blue neurons), Composer selects the parent node (solid lines with arrows) whose merging score is the maximum. In fact, the merging score measures the merging priority of

using a learnable query vector

by , where represents the inner product. Once the parent node for layer is determined, the check process is called.

The check process is to check if a parent node represents a recognizable SrcExp. Concretely, denoting

the parent node representation, an affine transformation is built based on it to obtain the probability

where and are learned parameters and

is the sigmoid function.

means that the parent node represents a recognizable SrcExp, and thus Composer triggers Solver to translate it. Otherwise, the parent node and other unmerged nodes enter a new layer , based on which Composer restarts the merge process.

Figure 2: The illustration of our memory-augmented model at step and . Items in Memory written by Solver are marked with a gray background (see text).

Solver   Given a recognizable SrcExp , Solver first translates it into a recognizable DstExp, and then obtains a constant DstExp via variable assignment through interacting with Memory. To achieve this, Solver is designed to be an LSTM-based sequence to sequence network with an attention mechanism Bahdanau et al. (2015). It generates the recognizable DstExp via decoding step by step. At each step, Solver either generates an action word, or a destination variable. Using the recognizable DstExp as the skeleton, Solver obtains a constant DstExp by replacing each destination variable with its corresponding constant DstExp stored in Memory.

Memory   There are a number of items in Memory, each of which contains a source vector (SrcVec), a destination vector (DesVec), and a value slot to temporarily store a constant DstExp. Here, the source vectors (yellow neurons) and destination vectors (red neurons) are learnable vectors, which are used to represent source variables (e.g. $x, $y) and destination variables (e.g. $X, $Y) respectively.

Interaction   The understanding of a sentence takes several steps iteratively. At each step, Composer, Solver and Memory interact with each other. To illustrate the interaction clearly, Fig. 2 presents the overall procedure of our model at step and in detail (corresponding to step and in Fig. 1). Despite that variables are implemented by vectors, below we still call them variables for better illustration. At the beginning of step , an SrcExp “$x after $y twice” is fed into Composer. Then Composer finds a recognizable SrcExp “$y twice” and sends it to Solver. Receiving “$y twice”, Solver first translates it into “$Y $Y”. Using “$Y $Y” as the skeleton, Solver obtains WALK WALK by replacing “$Y” with its corresponding constant DstExp WALK stored in Memory. Meanwhile, since WALK has been used, the value slot which stores WALK is set to be empty. Next, Solver applies for one item with an empty value slot in Memory, i.e. the item containing $y and $Y, and then writes WALK WALK into its value slot. Finally, the recognizable SrcExp “$y twice” in is superseded with “$y”, producing “$x after $y” as the input for the next step. Such a procedure is repeated until the SrcExp fed into Composer is exactly a recognizable SrcExp. Assuming the step at this point is , the constant DstExp is actually the final output action sequence.

4 Model Training

Training our proposed model is non-trivial for two reasons: (i) since the identification of is discrete, it is hard to optimize Composer and Solver via back propagation; (ii) since there is no supervision about and , Composer and Solver cannot be trained separately. Recalling the procedure of these two modules in Fig. 2, it is natural to model the problem via Hierarchical Reinforcement Learning (HRL) Barto and Mahadevan (2003): a high-level agent to find recognizable SrcExps (Composer), and a low-level agent to obtain constant DstExps conditioned on these recognizable SrcExps (Solver).

4.1 Hierarchical Reinforcement Learning

We begin by introducing some preliminary formulations for our HRL algorithm. Denoting as the state at step , it contains both and Memory. The action of Composer, denoted by , is the recognizable SrcExp to be found at step . Given as observation, the parameter of Composer defines a high-level policy . Once a high-level action is produced, the low-level agent Solver is triggered to react following a low-level policy conditioned on . In this sense, the high-level action can be viewed as a sub-goal for the low-level agent. Denoting the action of Solver, the low-level policy is parameterized by the parameter of Solver . is the constant DstExp output by Solver at step . More implementation details about and can be found in the supplementary material.

Figure 3: Our HRL algorithm contains a high-level policy and a low-level policy .

Policy Gradient   As illustrated in Fig. 3, in our HRL algorithm, Composer and Solver take actions in turn. When it is Composer’s turn to act, it picks a sub-goal according to . Once is set, Solver is triggered to pick a low-level action according to . These two modules alternately act until they reach the endpoint (i.e. step ) and predict the output action sequence, forming a trajectory . Once is determined, the reward is collected to optimize and using policy gradient Sutton et al. (2000). Denoting as the reward of a trajectory (elaborated in Sec. 4.2), the training objective of our model is to maximize the expectation of rewards as:


Applying the likelihood ratio trick, and can be optimized by ascending the following gradient:


Expanding the above equation via the chain rule

111More details can be found in the supplementary material., we can obtain:


Considering the search space of is huge, the REINFORCE algorithm Williams (1992) is leveraged to approximate Eq. 3 by sampling from for times. Furthermore, the technique of subtracting a baseline Weaver and Tao (2001)

is employed to reduce variance, where the baseline is the mean reward over sampled


Differential Update   Unlike standard Reinforcement Learning (RL) algorithms, we introduce a differential update strategy to optimize Composer and Solver via different learning rates. It is motivated by an intuition that actions of a high-level agent cannot be changed quickly. According to Eq. 3, simplifying as , the parameters of Composer and Solver are optimized as:


where Solver’s learning rate is greater than Composer’s learning rate . In our experiments, the AdaDelta optimizer Zeiler (2012) is employed to optimize our model, with and as and respectively.

4.2 Reward Design

The design of the reward function is critical to an RL based algorithm. Bearing this in mind, we design our reward from two aspects: similarity and simplicity. It is worth noting that both rewards work globally, i.e., all actions share the same reward, as indicated by dotted lines in Fig. 3.

Similarity-based Reward   It is based on the similarity between the model’s output and the ground-truth. Since the output of our model is an action sequence, a kind of sequence similarity, the Intersection over Union (IoU) similarity, is employed as the similarity-based reward function. Given the sampled output and the ground-truth , the similarity-based reward is computed by:


where means the longest common substring between and , and represents the length of a sequence. Compared with exact matching, such a reward alleviates the reward sparsity issue.

Simplicity-based Reward   Inspired by Occam’s Razor principle that “the simplest solution is most likely the right one”, we try to encourage our model to have the fewest kinds of learned recognizable DstExps. In other words, we encourage the model to fully utilize variables and be more generalizable. Taking an illustration of “jump twice”, and both result in correct outputs. Intuitively, the latter is more generalizable as it enables Solver to reuse learned recognizable DstExps, more in line with the Occam’s Razor principle. Concretely, when understanding a novel input “walk twice”, can be reused. Denoting as the number of steps where the recognizable DstExp only contains destination variables, we design a reward as a measure of the simplicity. The final reward function is a linear summation as , where

is a hyperparameter that is set as

in our experiments.

4.3 Curriculum Learning

One typical strategy for improving model generalization capacity is to use curriculum learning, which arranges examples from easy to hard in training Lyu and Tsang (2019); Ada et al. (2019). Motivated by it, we divide the training into different lessons according to the length of the input sequence. Our model starts training on the simplest lesson, and then the lesson complexity gradually increases. Besides, as done in literature Chen et al. (2017), we accumulate training data from previous lessons to avoid catastrophic forgetting.

5 Experiments

In this section, we conduct a series of experiments to evaluate our model on various compositional tasks mentioned in Sec. 2. We then verify the importance of each component via a thorough ablation study. Finally we present two real cases to illustrate our model concretely. More implementation details of our model can be found in the supplementary material.

5.1 Experimental Setup

Task   Here we introduce Tasks used in our experiments. Systematicity is evaluated on Add Jump, Around Right and Length of SCAN Lake and Baroni (2018), while distribution-based systematicity is assessed on MCD splits of SCAN Keysers et al. (2020). MCD uses a nondeterministic algorithm to split examples into the train set and the test set. By using different random seeds, it introduces three tasks MCD1, MCD2, and MCD3. Productivity is evaluated on the SCAN-ext dataset. Besides them, we also conduct experiments on the Simple task of SCAN which requires no compositional generalization capacity, and the Limit task of MiniSCAN Lake et al. (2019) which evaluates if models can learn compositional generalization when given limited (i.e. ) training data. We follow previous works to split datasets for all tasks.

Baselines   We consider a range of state-of-the-art models on SCAN compositional tasks as our baselines. In terms of the usage of extra resources, we divide them into two groups: (i) No Extra Resources includes vanilla sequence to sequence with attention (Seq2Seq) Lake and Baroni (2018); Loula et al. (2018), convolutional sequence to sequence (CNN) Dessì and Baroni (2019), Transformer Vaswani et al. (2017), Universal Transformer Dehghani et al. (2019), Syntactic Attention Russin et al. (2019) and Compositional Generalization for Primitive Substitutions (CGPS) Li et al. (2019). (ii) Using Extra Resources consists of Good Enough Compositional Data Augmentation (GECA) Andreas (2019), meta sequence to sequence (Meta Seq2seq) Lake (2019), equivariant sequence to sequence (Equivariant Seq2seq) Gordon et al. (2020) and program synthesis Nye et al. (2020). Details of these baselines can be found in Sec. 6.

Extra Resources Model Simple Add Jump Around Right Length
None Seq2Seq Lake and Baroni (2018); Loula et al. (2018)
CNN Dessì and Baroni (2019)
Syntactic Attention Russin et al. (2019)
CGPS  Li et al. (2019)
LAnE (Ours)
Data Augmentation GECA Andreas (2019) - -
Permutation-based Augmentation Meta Seq2Seq Lake (2019) -
Manually Designed Local Groups Equivariant Seq2Seq Gordon et al. (2020)
Manually Designed Meta Grammar Program Synthesis  Nye et al. (2020)
Table 1: Test accuracies of systematicity assessment on the SCAN dataset. All results of LAnE are obtained by averaging over runs, the same for Tab. 2 and Tab. 3.

5.2 Experimental Results

Experiment 1: Systematicity on SCAN   As shown in Tab. 1, LAnE achieves stunning % test accuracies on all tasks. Compared with state-of-the-art baselines without extra resources, LAnE achieves significantly higher performance. Even in contrast to baselines with extra resources, LAnE is highly competitive, suggesting that LAnE is capable of learning human prior knowledge to some extent. Although program synthesis Nye et al. (2020) also achieves perfect accuracies, it heavily depends on a predefined meta-grammar where decent task-related knowledge is encoded. As far as we know, LAnE is the first neural model to pass all tasks without extra resources.

Model MCD1 MCD2 MCD3
Seq2Seq Keysers et al. (2020)
Transformer Keysers et al. (2020)
Universal Transformer  Keysers et al. (2020)
LAnE (Ours)
Model Limit
Human Lake et al. (2019)
Meta Seq2Seq
LAnE (Ours)

Table 2: Test accuracies of the distribution-based systematicity assessment on the SCAN dataset (left) and the Limit task on the MiniSCAN dataset (right).

Experiment 2: Distribution-based Systematicity on SCAN   LAnE also achieves % accuracies on the more challenging distribution-based systematicity tasks (see Tab. 2). By comparing Tab. 1 and Tab. 2, one can find LAnE maintains a stable and perfect performance regardless of the task, while a strong baseline CGPS shows a sharp drop. Furthermore, to the best of our knowledge, LAnE is also the first one to pass the assessment of distribution-based systematicity on SCAN.

Figure 4: The input length distributions on train & test set of SCAN-ext (left) and test accuracies of various method on different input lengths (right).

Experiment 3: Productivity   As shown in Fig. 4, there is a sharp divergence between input lengths of train and test set on SCAN-ext, suggesting it is a feasible benchmark for productivity. From the results (right), one can find that test accuracies of baselines are mainly ruled by the frequency of input lengths in the train set. In contrast, LAnE maintains a perfect trend as the input length increases, indicating it has productive generalization capabilities. Furthermore, the trend suggests the potential of LAnE on tackling inputs with unbounded length.

Experiment 4: Compositional Generalization on MiniSCAN   Tab. 2 (right) shows the performance of various methods given limited training data, and LAnE remains highly effective. Without extra resources such as permutation-based augmentation employed by Meta Seq2Seq, our model performs perfectly, i.e. % on the Limit task. Compared with the human performance % Lake et al. (2019), to a certain extent, our model is close to the human ability at learning compositional generalization from few examples. However, it does not imply that either our model or Meta Seq2Seq triumphs over humans in terms of compositional generalization, as the Limit task is relatively simple.

5.3 Closer Analysis

Variant Simple Add Jump Length Around Right MCD1 MCD2 MCD3
w/o Composer
w/o Curriculum Learning
w/o Simplicity-based Reward
Table 3: Test accuracies of different variants in all tasks on the SCAN dataset.

We conduct a thorough ablation study in Tab. 3 to verify the effectiveness of each component. “w/o Composer” ablates the check process of Composer, making our model degenerate into a tree to sequence model, which employs a Tree-LSTM to build trees and encode input sequences dynamically. “w/o Curriculum Learning” means training our model on the full train set from the beginning. As the result shows, ablating each of above causes an enormous performance drop, indicating the necessity of Composer and the curriculum learning. “w/o Simplicity-based Reward”, which only considers the similarity-based reward, fails on several tasks such as Around Right. We attribute its failure to its inability to learn sufficiently general

Figure 5: Accuracies on train set (left) and test set (right) under different learning rate combinations.

recognizable DstExps from the data. As for the differential update, we compare the results of several learning rate combinations in Fig. 5. As indicated, our designed differential update strategy is essential for successful convergence and high test accuracies. Meanwhile, LAnE does not rely heavily on a particular combination of learning rates, suggesting its robustness. Last, we present learned tree structures of two real cases in Fig. 6. Observing that “twice” behaves differently under different contexts, it is non-trivial to produce such trees.

Figure 6: Learned tree structures in Composer of two real cases.

6 Related Work

The most related work is the line of exploring compositional generalization on neural networks, which has attracted a large attention on different topics in recent years. Under the topic of mathematical reasoning, Veldhoen et al. (2016) explored the algebraic compositionality of neural networks via simple arithmetic expressions, and Saxton et al. (2019) pushed the area forward by probing if the standard Seq2Seq model can resolve complex mathematical problems. Under the topic of logical inference, previous works devoted to testing the ability of neural networks on inferring logical relations between pairs of artificial language utterances Bowman et al. (2015); Mul and Zuidema (2019). Our work differently focuses more on the compositionality in languages, benchmarked by the SCAN compositional tasks Lake and Baroni (2018).

As for the SCAN compositional tasks, there have been several attempts. Inspired by work in neuroscience which suggests a disjoint processing on syntactic and semantic, Russin et al. (2019)

proposed the Syntactic Attention model. Analogously,

Li et al. (2019) employed different representations for primitives and functions respectively (CGPS). Unlike their separate representations, our proposed Composer and Solver can be seen as separate at the module level. There are also some works which impose prior knowledge of compositionality via extra resources. Andreas (2019) presented a data augmentation technique to enhance standard approaches (GECA). Lake (2019) argued to achieve compositional generalization by meta learning, and thus they employed a Meta Seq2Seq model with a memory mechanism. Regarding the memory mechanism, our work is similar to theirs. However, their training process, namely permutation training, requires handcrafted data augmentation. In a follow-up paper Nye et al. (2020), they argued to generalize via the paradigm of program synthesis. Despite the nearly perfect performance, it also requires a predefined meta-grammar, where decent knowledge is encoded. Meanwhile, based on the group-equivariance theory, Gordon et al. (2020) predefined local groups to enable models aware of equivariance between verbs or directions (Equivariant Seq2Seq). The biggest difference between our work and theirs is that we do not utilize any extra resource.

Our work is also related to those which apply RL on language. In this sense, using language as the abstraction for HRL Jiang et al. (2019)

is the most related work. They proposed to use sentences as the sub-goal for the low-level policy in vision-based tasks, while we employ recognizable SrcExps as the sub-goal. In addition, the applications of RL on language involves topics such as natural language generation

Dethlefs and Cuayáhuitl (2011), conversational semantic parsing Liu et al. (2019) and text classification Zhang et al. (2018).

7 Conclusion & Future Work

In this paper, we propose to achieve compositional generalization by learning analytical expressions. Motivated by work in cognition, we present a memory-augmented neural model which contains two cooperative neural modules Composer and Solver. These two modules are trained in an end-to-end manner via a hierarchical reinforcement learning algorithm. Experiments on a well-known benchmark demonstrate that our model solves all challenges addressed by previous works with % accuracies, surpassing existing baselines significantly. For future work, we plan to extend our model to a recently proposed compositional task CFQ Keysers et al. (2020) and more realistic applications.

Broader Impact

This work explores the topic of compositional generalization capacities in neural networks, which is a fundamental problem in artificial intelligence but not involved in real applications at now. Therefore, there will be no foreseeable societal consequences nor ethical aspects.


  • S. E. Ada, E. Ugur, and H. L. Akin (2019)

    Generalization in transfer learning

    arXiv preprint arXiv:1909.01331. Cited by: §4.3.
  • J. Andreas (2019) Good-enough compositional data augmentation. CoRR abs/1904.09545. External Links: 1904.09545 Cited by: §5.1, Table 1, §6.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1, §3.2.
  • M. Baroni (2019) Linguistic generalization and compositionality in modern artificial neural networks. CoRR abs/1904.00157. External Links: 1904.00157 Cited by: §1, §2, §3.1.
  • A. G. Barto and S. Mahadevan (2003) Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems 13 (1-2), pp. 41–77. Cited by: §4.
  • S. R. Bowman, C. D. Manning, and C. Potts (2015) Tree-structured composition in neural networks without tree-structured architectures. In Proceedings of the NIPS Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches co-located with the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, Canada, December 11-12, 2015, T. R. Besold, A. S. d’Avila Garcez, G. F. Marcus, and R. Miikkulainen (Eds.), CEUR Workshop Proceedings, Vol. 1583. Cited by: §6.
  • X. Chen, C. Liu, and D. Song (2017) Towards Synthesizing Complex Programs from Input-Output Examples. arXiv e-prints, pp. arXiv:1706.01284. External Links: 1706.01284 Cited by: §4.3.
  • N. Chomsky (1957) Syntactic structures (the hague: mouton, 1957). Review of Verbal Behavior by BF Skinner, Language 35, pp. 26–58. Cited by: §1.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §5.1.
  • R. Dessì and M. Baroni (2019) CNNs found to jump around more skillfully than RNNs: compositional generalization in seq2seq convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3919–3923. External Links: Document Cited by: §1, §5.1, Table 1.
  • N. Dethlefs and H. Cuayáhuitl (2011)

    Combining hierarchical reinforcement learning and Bayesian networks for natural language generation in situated dialogue

    In Proceedings of the 13th European Workshop on Natural Language Generation, Nancy, France, pp. 110–120. Cited by: §6.
  • J. A. Fodor and Z. W. Pylyshyn (1988) Connectionism and cognitive architecture - a critical analysis. Cognition 28 (1-2), pp. 3–71. Cited by: §1.
  • J. A. Fodor and E. Lepore (2002) The compositionality papers. Oxford University Press. Cited by: §1, §3.1.
  • J. Gordon, D. Lopez-Paz, M. Baroni, and D. Bouchacourt (2020) Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations, Cited by: §1, §5.1, Table 1, §6.
  • S. Havrylov, G. Kruszewski, and A. Joulin (2019) Cooperative learning of disjoint syntax and semantics. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1118–1128. External Links: Document Cited by: Appendix D.
  • D. Hupkes, V. Dankers, M. Mul, and E. Bruni (2020) Compositionality decomposed: how do neural networks generalise?. J. Artif. Intell. Res. 67, pp. 757–795. External Links: Document Cited by: §2.
  • Y. Jiang, S. Gu, K. Murphy, and C. Finn (2019) Language as an abstraction for hierarchical deep reinforcement learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 9414–9426. Cited by: §6.
  • D. Keysers, N. Schärli, N. Scales, H. Buisman, D. Furrer, S. Kashubin, N. Momchev, D. Sinopalnikov, L. Stafiniak, T. Tihon, D. Tsarkov, X. Wang, M. van Zee, and O. Bousquet (2020) Measuring compositional generalization: a comprehensive method on realistic data. In International Conference on Learning Representations, Cited by: §2, §2, §5.1, Table 2, §7.
  • B. M. Lake and M. Baroni (2018) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , J. G. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, pp. 2879–2888. Cited by: §1, §2, §2, §5.1, §5.1, Table 1, §6.
  • B. M. Lake, T. Linzen, and M. Baroni (2019) Human few-shot learning of compositional instructions. In Proceedings of the 41th Annual Meeting of the Cognitive Science Society, CogSci 2019: Creativity + Cognition + Computation, Montreal, Canada, July 24-27, 2019, pp. 611–617. Cited by: §2, §5.1, §5.2, Table 2.
  • B. M. Lake (2019) Compositional generalization through meta sequence-to-sequence learning. In Advances in Neural Information Processing Systems 32, pp. 9791–9801. Cited by: §1, §5.1, Table 1, §6.
  • Y. Li, L. Zhao, J. Wang, and J. Hestness (2019) Compositional generalization for primitive substitutions. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 4293–4302. External Links: Document Cited by: §1, §5.1, Table 1, §6.
  • Q. Liu, B. Chen, H. Liu, J. Lou, L. Fang, B. Zhou, and D. Zhang (2019) A split-and-recombine approach for follow-up query analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5316–5326. External Links: Document Cited by: §6.
  • J. Loula, M. Baroni, and B. Lake (2018) Rearranging the familiar: testing compositional generalization in recurrent networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 108–114. External Links: Document Cited by: §5.1, Table 1.
  • Y. Lyu and I. W. Tsang (2019) Curriculum loss: robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045. Cited by: §4.3.
  • T. Mikolov, A. Joulin, and M. Baroni (2016) A roadmap towards machine intelligence. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 29–61. Cited by: §1.
  • M. Mul and W. H. Zuidema (2019) Siamese recurrent networks learn first-order logic reasoning and exhibit zero-shot compositional generalization. CoRR abs/1906.00180. Cited by: §6.
  • M. I. Nye, A. Solar-Lezama, J. B. Tenenbaum, and B. M. Lake (2020) Learning compositional rules via neural program synthesis. CoRR abs/2003.05562. External Links: 2003.05562 Cited by: §5.1, §5.2, Table 1, §6.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8026–8037. Cited by: Appendix D.
  • J. Russin, J. Jo, R. C. O’Reilly, and Y. Bengio (2019) Compositional generalization in a deep seq2seq model by separating syntax and semantics. CoRR abs/1904.09708. External Links: 1904.09708 Cited by: §1, §5.1, Table 1, §6.
  • D. Saxton, E. Grefenstette, F. Hill, and P. Kohli (2019) Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, Cited by: §6.
  • S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2440–2448. Cited by: §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. Cited by: §1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K. Müller (Eds.), pp. 1057–1063. Cited by: §4.1.
  • Z. G. Szabó (2017) Compositionality. In The Stanford Encyclopedia of Philosophy, E. N. Zalta (Ed.), Cited by: §1, §2.
  • K. S. Tai, R. Socher, and C. D. Manning (2015)

    Improved semantic representations from tree-structured long short-term memory networks

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1556–1566. External Links: Document Cited by: Appendix A, §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §5.1.
  • S. Veldhoen, D. Hupkes, and W. H. Zuidema (2016)

    Diagnostic classifiers revealing how neural networks process hierarchical structure

    In CoCo@NIPS, Cited by: §6.
  • L. Weaver and N. Tao (2001) The optimal reward baseline for gradient-based reinforcement learning. In UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, University of Washington, Seattle, Washington, USA, August 2-5, 2001, J. S. Breese and D. Koller (Eds.), pp. 538–545. Cited by: §4.1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, pp. 229–256. External Links: Document Cited by: §4.1.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. External Links: 1212.5701 Cited by: Appendix D, §4.1.
  • T. Zhang, M. Huang, and L. Zhao (2018) Learning structured representation for text classification via reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 6053–6060. Cited by: §6.

Appendix A Tree-LSTM Encoding

As mentioned in Sec. 3, a Tree-LSTM Tai et al. [2015] model is employed to accomplish the merge process in Composer. Similar to LSTM, Tree-LSTM uses gate mechanisms to control the flow of information from child nodes to parent nodes. Meanwhile, it maintains a hidden state and a cell state analogously. Denoting as the node representation of -th node at layer , it consists of the hidden state vector and the cell state vector . For any parent node, its node representation () can be obtained by merging its left child node representation and right child node representation as:


where is a learnable matrix, is a learnable vector, and

are activation functions, and

represents the element-wise product. As for leaf nodes, their representations () can be obtained by applying leaf transformation on the embeddings of their corresponding elements (e.g. $x, after) as:


where is a learnable matrix, is a learnable vector, is the -th element of , and represents the word embedding if is a word, otherwise the key vector of the source domain variable .

Appendix B Details about Policy

In the following, we will explain the high-level policy and the low-level policy in detail. For the sake of clarity, we simplify , and as , and , respectively.

High-level policy   Given , the high-level agent picks according to the high-level policy parameterized by . As mentioned in Sec. 3, is obtained by applying in turn the merge and check process. Denoting the decisions made in the merge and check process at layer as and , they are governed by parameters and , respectively. A high-level action is indeed a sequence of and as , where represents the highest layer. Therefore, is expanded as:


where is implemented by a Tree-LSTM with a learnable query vector (mentioned in Sec. 3.2). Assuming there are parent node candidates for layer , is a one-hot vector drawn from a -dimensional categorical distribution with the weight . For the -th parent node candidate, represented by , its selection probability is computed by normalizing over all merging scores (mentioned in Sec. 3.2) as:


As for

in the check process, it follows a Bernoulli distribution with expectation

, where are learned parameters. is indeed the trigger probability mentioned in Sec. 3.2.

Low-level policy   When the high-level action is determined, the low-level agent is triggered to output according to the low-level policy . The policy is implemented by an LSTM-based sequence to sequence network with an attention mechanism, i.e.,


where is the number of decoding steps and represents an action word (e.g. JUMP), or a destination variable (e.g. $Y) which will be replaced by its corresponding constant DstExp stored in Memory. At each decoding step, is sampled from a categorical distribution, whose sample space consists of all action words and destination variables with non-empty value slots.

Appendix C Chain Rule Derivation

Looking back to Eq. 2, the parameters and can be optimized by ascending the following gradient:


where the policy can be further decomposed into a sequence of actions and state transitions:


Consider that the low-level action is conditioned on the high-level action , which means that , and thus can be expanded as:


Since the state at step is fully determined by the state and actions at step , not dependent on the policy parameters and , the gradients of and with respect to and are . Therefore, can be expanded as:


Appendix D Implementation Details

Dataset SCAN SCAN-ext MiniSCAN
Simple Add Jump Around Right Length MCD (//) Extend Limit
Train Size
Test Size
Table 4: The dataset splits for all tasks.

Our model is implemented in PyTorch Paszke et al. [2019]. All experiments use the same hyperparameters. Dimensions of word embeddings, hidden states, key vectors and value vectors are set as . Hyperparameters and are set as and respectively. All parameters are randomly initialized and updated via the AdaDelta Zeiler [2012] optimizer, with a learning rate of for Composer and for Solver. Meanwhile, as done in previous works Havrylov et al. [2019], we introduce a regularization term to prevent our model from overfitting in the early stage of training. Its weight is set to at the beginning, and exponentially anneals with a rate as the lesson increases. The source code has been submitted as part of the supplementary material. As for data splits, we split each dataset into the train set and the test set for all tasks according to previous works. More details about train and test sizes can be seen in Tab. 4. More specifically, except for the task Limit, we further randomly take training data as the development set to tune the hyperparameters, with the rest being the train set.