Log In Sign Up

What they do when in doubt: a study of inductive biases in seq2seq learners

by   Eugene Kharitonov, et al.

Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use SCAN and three new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases. In our experimental study, we find that LSTM-based learners can learn to perform counting, addition, and multiplication by a constant from a single training example. Furthermore, Transformer and LSTM-based learners show a bias toward the hierarchical induction over the linear one, while CNN-based learners prefer the opposite. On the SCAN dataset, we find that CNN-based, and, to a lesser degree, Transformer- and LSTM-based learners have a preference for compositional generalization over memorization. Finally, across all our experiments, description length proved to be a sensitive measure of inductive biases.


page 1

page 2

page 3

page 4


Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks

Learners that are exposed to the same training data might generalize dif...

ORCHARD: A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning

The ability to reason with multiple hierarchical structures is an attrac...

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

While designing inductive bias in neural architectures has been widely s...

Can neural networks acquire a structural bias from raw linguistic data?

We evaluate whether BERT, a widely used neural network for sentence proc...

LSTMs Compose (and Learn) Bottom-Up

Recent work in NLP shows that LSTM language models capture hierarchical ...

Examining the Inductive Bias of Neural Language Models with Artificial Languages

Since language models are used to model a wide variety of languages, it ...

Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication

Communication is compositional if complex signals can be represented as ...

1 Introduction

Sequence-to-sequence (seq2seq) learners (Sutskever et al., 2014)

demonstrated remarkable performance in machine translation, story generation, and open-domain dialog 

(Sutskever et al., 2014; Fan et al., 2018; Adiwardana et al., 2020). These capabilities, however, come at the expense of being hardly interpretable black-boxes, which spurred an interest in better understanding their inner working (Lake and Baroni, 2017; Gordon et al., 2019; Yu et al., 2019; McCoy et al., 2020).

In this work, we focus on studying inductive biases of seq2seq models. We start from an observation that, generally, multiple explanations can be consistent with a limited training set, each leading to different predictions on unseen data. A learner might prefer one type of explanations over another in a systematic way, as a result of its inductive biases (Ritter et al., 2017; Feinman and Lake, 2018).

To illustrate the setup we work in, consider a quiz-like question: if maps to , what does map to? The “training” example is consistent with the following answers: 6 ; ; ; any number , since we always can construct a function such that and . By analyzing the learner’s output on this new input, we can infer its biases.

This example demonstrates how biases of learners are studied through the lenses of the poverty of the stimulus principle (Chomsky, 1965, 1980): if nothing in the training data indicates that a learner should generalize in a certain way, but it does nonetheless, then this is due to the biases of the learner. Inspired by the work of Zhang et al. (2019) in the image domain, we take this principle to the extreme and study biases of seq2seq learners in the regime of very few training examples, often as little as one. Under this setup, we propose three new synthetic tasks that probe seq2seq learners’ preferences to memorization-, arithmetic-, and hierarchy-based “reasoning”. We complement these tasks by SCAN (Lake and Baroni, 2017), a well-known benchmark for compositional generalization.

Next, we connect to the ideas of Solomonoff’s theory of induction (Solomonoff, 1964) and Minimal Description Length (Rissanen, 1978; Grunwald, 2004) and propose to use description length, under a learner, as a principled measure of its inductive biases.

Our experimental study shows that the standard seq2seq learners have strikingly different inductive biases. We find that LSTM-based learners are able to learn non-trivial counting-, multiplication-, and addition-based rules from as little as one example. CNN-based seq2seq learners would prefer linear over hierarchical generalizations, while LSTM-based ones and Transformers would do just the opposite. When experimenting with SCAN, description length proved to be a more sensitive measure of inductive biases than the “intuitive” approach. Equipped with this measure, we found that CNN-, LSTM- and Transformer-based learners prefer, to different degrees, the compositional generalization over the memorization one.

2 Searching for inductive biases

To formalize the way we look for inductive biases of a learner , we consider a training dataset of input/output pairs, , and a hold-out set of inputs, . W.l.o.g, we assume that there are two candidate “rules” explaining the training data, and , such that , but, generally, . After we fit the learner on the training data, its bias toward or is inferred by analysing similarity in the output space between and or , respectively. We call this approach “intuitive”.

Typically, those measures of similarity in the output-space are task-specific. McCoy et al. (2020) used accuracy of the first term, Zhang et al. (2019) used correlation and MSE, and Lake and Baroni (2017) used accuracy calculated on the entire output sequence.

We too start with an accuracy-based measure. We define the fraction of perfect agreement (FPA) between a learner and a candidate generalization rule as the fraction of seeds that generalize perfectly in agreement with that rule on the hold-out set . That is, the larger FPA of is w.r.t. , the more biased is toward . However, FPA neither considers imperfect generalization, nor allows direct comparison between two candidate rules, and , when both are dominated by a third candidate rule, . Hence, below we propose a principled approach based on the description length.

Description Length and Inductive Biases At the core of the theory of induction (Solomonoff, 1964) is the question of continuation of a finite string that is very similar to our setup. Indeed, we can easily re-formulate our motivating example as a string continuation problem: “”. The solution proposed by Solomonoff (1964) is to select the continuation that admits “the simplest explanation” of the entire string, i.e. that is produced by programs of the shortest length (description length).

Our intuition is that if, for a learner, a continuation is “simple,” then this learner is biased toward it. We consider a learner to be biased toward over if the training set and its extension according to has a shorter description length (for ) compared to that of . Denoting description length of a dataset under the learner as , we hypothesise that if , then is biased toward .

This definition is seemingly different from the “intuitive” one that we used above. At the end of this Section, we discuss connections of description length to other ways of measuring inductive biases.

Calculating Description Length111We refer the reader to Supplementary for a more detailed derivation. To find the description length of data under a fixed learner, we use the online (prequential) code (Rissanen, 1978; Grunwald, 2004; Blier and Ollivier, 2018). We are given a learner and a dataset, . W.l.o.g. we fix some order over . We assume that, given , the learner

produces a probability distribution over the space of the outputs

, . Sequences are composed of tokens from vocabulary .

The problem of calculating is cast as a problem of transferring outputs one-by-one over a channel in a compressed form, where a sender and a receiver support identical instances of . At each step , the sender and the receiver update their copies using and the newly transmitted to get . Next, the sender uses its updated copy to compress and transmit . The very first output can be sent by using not more than nats, using a naïve encoding.222As we are interested in comparing candidate generalization rules, the value of the additive constant is not important, as it is learner- and candidate-independent. In experiments, we subtract it from all measurements. The cumulative number of nats transmitted is:


The obtained code length of Eq. A depends on the order in which are transmitted and the procedure we use to update To account for that, we average out the data order by training with multiple random seeds. Further, for larger datasets, full re-training after adding a new example is impractical and, in such cases, examples can be transmitted in blocks.

If we measure the description length of the training data shuffled with the hold-out data , both datasets would have symmetric roles. However, there is a certain asymmetry in the extrapolation problem: we are looking for an extrapolation from , not vice-versa. To break this symmetry, we always transmit outputs for the entire training data as the first block. Note that if we consider a case where we first transmit the training outputs as the first block and all of the hold-out data outputs under , , as the second block, then the description length reduces to measuring the cross-entropy of the learner on the hold-out data. In this case, we recover a process akin to the “intuitive” measuring of inductive biases. However, in our case, as described in Eq. A, catches also whether a learner is capable of finding regularity in the hold-out data fast, with few data points; hence it also represents the speed-of-learning ideas for measuring inductive biases (Chaabouni et al., 2019).

Apart from being connected to the existent approaches, description length, as a measure of inductive biases, has two attractive properties: (a) it is domain-independent, i.e. can be applied e.g. in the image domain, and (b) it allows comparisons across models that account for model complexity. However, it requires the learner to have a probabilistic output.

3 Tasks

We describe the four tasks that we use to study inductive biases of seq2seq learners. We select those tasks to cover different aspects of learners’ behavior. For each task, we investigate learners’ generalization when trained on highly ambiguous training set. In such scenario, there are infinite consistent generalization rules (as described in our motivating example of Section 1). We, however, pre-select only several candidate rules highlighting biases that are useful for language processing and known to exist in humans, or are otherwise reasonable. Remarkably, our experiments show that these rules cover many cases of the actual learners’ behavior.

The first two tasks study biases in arithmetic reasoning: Count-or-Memorization quantifies learners’ preference for counting vs. a simple memorization and Add-or-Multiply further probes the learners’ preferences in arithmetic operations. We believe these tasks are interesting, as counting is needed in some NLP problems like processing linearized parse trees (Weiss et al., 2018). The third task, Hierarchical-or-Linear, contrasts hierarchical and linear inductive reasoning. The hierarchical reasoning bias is believed to be fundamental in learning some syntactical rules in human acquisition of syntax (Chomsky, 1965, 1980). Finally, we investigate biases for systematic compositionality, which is central for human generalization capabilities in language. For that, we use a well-known benchmark, SCAN (Lake and Baroni, 2017).

By we denote a sequence that contains token repeated

times. For training, we represent sequences in a standard way for seq2seq models: the tokens are one-hot-encoded separately, and we append a special end-of-sequence token to each sequence. Input and output vocabularies are disjoint.

Count-or-Memorization: In this task, we contrast learners’ preferences for counting vs. memorization. We train models to fit a single training example with input and output (i.e., to perform the mapping ) and test it on with . If a learner learns the constant function, outputting independently of its inputs, then it follows the mem strategy. On the other hand, if it generalizes to the mapping, then the learner is biased toward the count strategy.

Add-or-Multiply: This task is akin to the motivating example in Section 1. The single training example represents a mapping of an input string to an output string . As test inputs, we generate for in the interval . We consider the learned rule to be consistent with mul if for all , the input/output pairs are consistent with . Similarly, if they are consistent with , we say that the learner follows the addition rule, add. Finally, the learner can learn a constant mapping for any . Again, we call this rule mem.

Hierarchical-or-Linear: For a fixed depth , we train learners on four training examples where .333This mapping consist then on four combinations ; ; and . Each training example has a nested structure, where defines its depth. A learner with a hierarchical bias (hierar), would output the middle symbol. We also consider the linear rule (linear) in which the learner outputs the + symbol of its input.
To probe learners’ biases, we test them on inputs with different depths . Note that to examine the linear rule (i.e. if the learner outputs the + symbol of any test input of depth ), we need . Similar to the previous tasks, there is no vocabulary sharing between a model’s inputs and outputs (input and output tokens and are different).

SCAN: SCAN (Lake and Baroni, 2017), is a dataset and a set of benchmarks used for studying systematic generalization of the seq2seq learners.444Available at In SCAN, inputs are sequences that represent trajectories and outputs are step-wise instructions for following these instructions (see Figure 1). We experiment with the SCAN-jump split of the dataset, where the test set (7706 examples) is obtained by filtering all compositional uses of one of the primitives, jump. The train set (14670 examples) contains all uses of all other primitives (including compositional), and lines where jump occurs in isolation. This makes the training data ambiguous only for the jump instruction. However, learners can transfer compositional rules across the primitives. We refer to this split simply as SCAN.
We consider two interpretable generalizations: (1) all input sequences with jump are mapped to a single instruction JUMP as memorized from the training examples, or (2) underlying compositional grammar. We call them mem and comp, respectively. We used the original test as a representative of the comp candidate explanation and generated one for mem ourselves.

jump JUMP
turn left twice LTURN LTURN
jump opposite left after walk around left LTURN WALK LTURN WALK LTURN WALK LTURN WALK LTURN LTURN JUMP
Figure 1: Examples of SCAN trajectories and instructions, adopted from (Lake and Baroni, 2017).

4 Methodology

4.1 Sequence-to-sequence learners

We experiment with three standard seq2seq models: LSTM-based seq2seq (LSTM-s2s) (Sutskever et al., 2014), CNN-based seq2seq (CNN-s2s) (Gehring et al., 2017), and Transformer (Vaswani et al., 2017). All share a similar Encoder/Decoder architecture (Sutskever et al., 2014)

. In the main text, we report experiments using models with comparable hyperparameter values. We studied, however, the effect of different hyperparameters in Supplementary, and show how small impact their variations have on learners’ biases (see Supplementary for more details).

LSTM-s2s Both Encoder and Decoder are implemented as LSTM cells (Hochreiter and Schmidhuber, 1997). Encoder encodes its inputs incrementally from left to right. We experiment with architectures without (LSTM-s2s no att.) and with (LSTM-s2s att.) an attention mechanism (Bahdanau et al., 2014). For the first three tasks, both Encoder and Decoder are single-layer LSTM cells with hidden size of 512 and embedding of dimension 16.
For SCAN, we chose the architecture used in Lake and Baroni (2017). In particular, Encoder and Decoder are two hidden layers with 200 units and embedding of dimension 32. This architecture gets 0.98 test accuracy on the i.i.d. (simple) split of SCAN, averaged over 10 seeds.

CNN-s2s Encoder and Decoder are convolutional networks (LeCun et al., 1990), followed by GLU non-linearities (Dauphin et al., 2017) and an attention layer. To represent positions of input tokens, CNN-s2s uses learned positional embeddings. For the first three tasks, both convolutional networks have one layer with 512 filters and a kernel width of 3. We set the embedding size to 16.
For SCAN, we use one of the successful CNN-s2s models of Dessì and Baroni (2019) that has 5 layers and embedding size of 128. We also vary kernel size in as it was found to impact the performance on SCAN (Dessì and Baroni, 2019). These architectures reach test accuracy above 0.98 on the i.i.d. split of SCAN, averaged over 10 seeds.

Transformer Encoder and Decoder are implemented as a sequence of (self-)attention and feed-forward layers. We use sinusoidal position embedding. When considering the first three tasks, both Encoder and Decoder contain one transformer layer. The attention modules have 8 heads, feed-forward layers have dimension of 512 and the embedding is of dimension 16.
We believe our work is the first study of Transformer’s performance on SCAN. For both Encoder and Decoder, we use 8 attention heads, 4 layers, embedding size of 64, FFN layer dimension of 256. This architecture gets 0.94 test accuracy on the i.i.d. split of SCAN, averaged over 10 seeds.

4.2 Training and evaluation

For all tasks except SCAN, we follow the same training procedure. We train with Adam optimizer (Kingma and Ba, 2014) for epochs. The learning rate starts at and increases for the first warm-up updates till reaching . We include all available examples in a single batch. We use teacher forcing (Goodfellow et al., 2016). We set the dropout probability to 0.5 (we report experiments with other values in Supplementary). For each learner, we perform training and evaluation 100 times, changing random seeds. When generating sequences to calculate FPA, we select the next token greedily. We use the model implementations from fairseq (Ott et al., 2019).

As discussed in Section 2, when calculating , we use the training examples as the first transmitted block at . In Count-or-Memorization and Add-or-Multiply this block contains one example, and in Hierarchical-or-Linear it has 4 examples. Next, we transmit examples obtained from the candidate rules in a randomized order, by blocks of size 1, 1, and 4 for Count-or-Memorization, Add-or-Multiply, and Hierarchical-or-Linear respectively. At each step, the learner is re-trained from the same initialization, using the procedure and hyper-parameters as discussed above.

For SCAN, we follow the same scenario as above with two differences: (a) when calculating , we use blocks of size 1024, (b) during training, we sample batches with replacement, similar to (Lake and Baroni, 2017). We uses batches of size 16 for LSTM-s2s & CNN-s2s, and 256 for Transformer. We repeat the training/evaluation of each learner 10 times, varying the random seed. Again, we use Adam optimizer and a learning rate of .

5 Experiments

Count-or-Memorization We investigate here learners’ biases toward count and mem rules. We provide a single example as the training set, varying . We report the learners’ performances in Table 0(a). We observe that, independently of the length of the training example , CNN-s2s and Transformer learners inferred perfectly the mem rule with FPA-mem > 0.90 (i.e. more than of the random seeds output for any given input ).

However, LSTM-based learners demonstrate a more complex behavior. With , both learners (with and without attention) exhibit a preference for mem. Indeed, while these learners rarely generalize perfectly to any of the hypothesis ( FPA (no att.), / FPA for mem/count (att.)), they have significantly lower -mem. As increases, LSTM-based learners become more biased toward count. Surprisingly, for , most learner instances show sufficiently strong inductive biases to infer perfectly the non-trivial count hypothesis. With , of random seeds of LSTM-s2s att. and all (100%) of LSTM-s2s no att. seeds generalized perfectly to count.

Further, we see that if shows similar trends, it has a higher sensitivity. For example, while both LSTM-based learners have a similar FPA with , demonstrates that LSTM-s2s no att. has a stronger count bias.

Add-or-Multiply In this task, we examine learners’ generalization after training on the single example . We vary . In Table 0(b), we report FPA and for the three generalization hypotheses, add, mul, and mem. We observe, similarly to the previous task, that CNN-s2s and Transformer learners always converge perfectly to memorization.

In contrast, LSTM-based learners show non-trivial generalizations. Examining first LSTM-s2s att., when =, we note that mem has a high FPA and an considerably lower than others. This is consistent with the learner’s behavior in the Count-or-Memorization task. As we increase , more interesting behavior emerges. First, -mem decreases as increases. Second, mul-type preference increases with . Finally, -add presents a U-shaped function of . That is, for the medium example length , the majority of learners switch to approximating the add rule (for ). However, when grows further, a considerable fraction of these learners start to follow a mul-type rule. Strikingly, of LSTM-s2s att. seeds generalized perfectly to the non-trivial mul rule. As for LSTM-s2s no att., we do not observe a strong bias to infer any of the rules when =. However, when increasing , the LSTM-s2s no att. behaves similarly to LSTM-s2s att.: at first it has a preference for add (FPA-add=0.95, for =) then for mul (e.g. FPA-mul=0.94, for =).

Hierarchical-or-Linear We look now at learners’ preference for either hierar or linear generalizations. The architectures we use were only able to consistently learn the training examples with the depth not higher than . Hence, in this experiment, we set to .

We report in Table 0(c) the learners’ FPA and . We observe that CNN-s2s exhibits a strikingly different bias compared to all other learners with a perfect agreement with the linear rule. In contrast, Transformer learners show a clear preference for hierar with a high FPA (0.69) and a low (1.21). Surprisingly, this preference increases with the embedding size and Transformers with embedding size admit an FPA-hierar of 1.00 (see Supplementary for more details). LSTM-s2s att. learners demonstrate also a similar preference for hierar with an FPA of 0.30 and a considerably lower than -hierar. Finally, while only of LSTM-s2s no att. instances generalized to perfect hierar (and none to linear), confirms their considerable preference for the hierar hypothesis.

FPA , nats
l count mem count mem
LSTM-s2s no att.
LSTM-s2s att.
(a) Count-or-Memorization
FPA , nats
l add mul mem add mul mem
LSTM-s2s no att.
LSTM-s2s att.
(b) Add-or-Multiply
FPA , nats
hierar linear hierar linear
LSTM-s2s no att.
LSTM-s2s att.
(c) Hierarchical-or-Linear with
Accuracy , nats
comp mem comp mem
LSTM-s2s no att.
LSTM-s2s att.
CNN-s2s, kernel width 3
CNN-s2s, kernel width 5
CNN-s2s, kernel width 8
(d) SCAN
Table 1: (a-c): FPA measures the fraction of seeds that generalize according to a particular rule. (d): Accuracy is averaged across seeds and examples. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (

, paired t-test).

SCAN Next, we turn to a larger-scale SCAN task. Due to the large number of test examples and low performance of the learners, all FPA scores would be equal to zero. Hence, we follow  Lake and Baroni (2017) and use per-sequence accuracy. We report our results in Table 0(d). First, we see that CNN-s2s learners have a strong preference to comp, both in accuracy and description length. Furthermore, we observe that with an increase in the kernel size, description length of mem increases, while description length of comp decreases, indicating that the preference for comp over mem grows with the kernel width. We believe this echoes findings of (Dessì and Baroni, 2019).

While accuracy is well below for all other learners/candidate combinations (and rounded to 0.00), according to the description length, Transformer and LSTM-based learners also have preference for comp over mem. This can be due to the transfer from compositional training examples, that can make comp explanation most “simple” given the dataset. Hence, the failure for systematic generalization in SCAN comes not from learners’ preferences for mem. We believe this resonates with the initial qualitative analysis in (Lake and Baroni, 2017).

Overall, across all the above experiments, we see that seq2seq learners demonstrate strikingly different biases. In many cases, these biases lead to non-trivial generalizations when facing ambiguity in the training data. This spans tasks that probe for memorization, arithmetic, hierarchical, and compositional reasoning. We found that a single example is sufficient for LSTM-based learners to learn counting, addition, and multiplication. Moreover, within the same task, they can switch from one explanation to another, depending on the training example length, with Addition-or-Multiplication being the task where this switch happens twice. In contrast, CNN-s2s and Transformers show a strong bias toward memorization.555In Supplementary, we verify that such preferences remain when studying biases toward multiplication by 3. Furthermore, all learners except for CNN-s2s demonstrate a strong bias toward the hierarchical behavior. In the task of compositional generalization (SCAN), our results not only confirm earlier results, but also indicate that, actually, the standard seq2seq learners prefer compositional generalization over memorization.

We see that the conclusions derived from comparing the description length of the candidate rules are in agreement with the results under accuracy-based metrics, but provide a more nuanced picture.

6 Background

There have been dramatic advances in language processing using seq2seq models. Yet, these models have been long criticized for requiring a tremendous amount of data with, in the end, a lack of systematic generalization (Lake and Baroni, 2017; Dupoux, 2018; Loula et al., 2018; Bastings et al., 2018). In contrast, humans rely on their inductive biases to generalize from a limited amount of data (Chomsky, 1965; Lake et al., 2019). Because of the centrality of humans’ biases in language learning, several works had studied seq2seq inductive biases, connecting their poor generalization to their lack of the “right” biases (Lake and Baroni, 2017; Lake et al., 2019; McCoy et al., 2020). However, these works have mostly relied on complex tasks that assume the knowledge of different factors of language, such as semantics. This makes it harder to connect the failures of seq2seq learners with their biases. We differ from this line of work by investigating seq2seq biases in a more focused setup. Our approach follows the ideas of Zhang et al. (2019)’s work in the vision domain, but, in contrast, we study seq2seq learners and consider inductive biases that are crucial for language learning in humans. We demonstrate that, when considering our proposed tasks, some seq2seq learners show strong human-like biases and generalize perfectly to behaviors useful for language learning.

Another line of research investigates theoretically learners’ capabilities, that is, the classes of the hypothesis that a learner can discover (Siegelmann and Sontag, 1992; Weiss et al., 2018; Merrill et al., 2020). For example, Weiss et al. (2018) demonstrated that LSTM cells can count (unlike, e.g. GRU cells). In turn, we demonstrate that LSTM-based seq2seq learners are not only capable but also biased toward arithmetic behavior.

7 Discussion and Conclusion

In this work, we studied inductive biases of standard seq2seq learners, Transformer-, LSTM-, and CNN-based. To do so, we used one well-known task and introduced three new ones, which allowed us to cover an interesting spectrum of behaviors useful for language learning. In particular, we considered arithmetic, hierarchical, and compositional “reasoning”. Next, we connected the problem of finding and measuring inductive biases to Solomonoff’s theory of induction and proposed to use a dataset’s description length under a learner as a tool for sensitive measurement of inductive biases.

In our experiments, we found that the seq2seq learners have strikingly different inductive biases and some of them generalize non-trivially when facing ambiguity. For instance, a single training example is sufficient for LSTM-based learners to learn perfectly how to count, to add and to multiply by a constant. Transformers and, to a lesser degree, LSTM-s2s demonstrated preferences for the hierarchical bias, a bias that has been argued to govern children’s acquisition of syntax. Interestingly, such biases arose with no explicit wiring for them. Our results support then Elman et al. (1998)’s theory which states that human’s inductive biases can arise from low-level architectural constraints in the brain with no need for an explicit encoding of a linguistic structure. However, how the brain, or, more generally, a learner is wired to admit a specific inductive bias is still an important open question.

Across our experiments, we also observed that description length is consistent with “intuitive” measurements of inductive biases, and, at the same time, it turned out to be more sensitive. This also indicates that, in the presence of ambiguity in the training data, a learner is more likely to follow the alternative with the shorter description length (i.e. the simplest one) when applied on unseen data, showing consistency with the prescriptions of the theory of induction (Solomonoff, 1964). A similar simplicity preference is argued to play a role in human language acquisition (Perfors et al., 2011).

Our findings can provide a guidance for architecture selection in the low-data regimes where inductive biases might have a higher influence on model’s generalization performance. Large sparse datasets can also benefit from predictable behavior in few-shot scenarios akin to what we consider.

Our work paves the way for multiple future directions. It would be interesting to understand what drives the “switching” behavior in LSTM-s2s: why, as the training example gets longer, learners switch from memorization-based to less trivial explanations? Why, in Add-vs-Multiply, LSTM-s2s switches from memorization to addition and then to multiplication? Is it some pressure for explanations with small coefficients or is it due to “computation time” that longer sequences require?

Finally, our results demonstrate that relatively large deep learning models

can generalize non-trivially from as little as one example – as long as the task is aligned with the their inductive biases. We believe this should reinforce interest in future work on injecting useful inductive biases in our learners and, we hope, our findings and setup can provide a furtile ground for such work.


The authors are grateful to Marco Baroni, Emmanuel Dupoux, Emmanuel Chemla and participants of the EViL meeting for their feedback on our work.


  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.1.
  • J. Bastings, M. Baroni, J. Weston, K. Cho, and D. Kiela (2018) Jump to better conclusions: SCAN both left and right. In

    EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

    Cited by: §6.
  • L. Blier and Y. Ollivier (2018) The description length of deep learning models. In NeurIPS, Cited by: Appendix A, §2.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2018)

    Optimization methods for large-scale machine learning

    Siam Review 60 (2), pp. 223–311. Cited by: Appendix D.
  • R. Chaabouni, E. Kharitonov, A. Lazaric, E. Dupoux, and M. Baroni (2019) Word-order biases in deep-agent emergent communication. arXiv preprint arXiv:1905.12330. Cited by: §2.
  • N. Chomsky (1965) Aspects of the theory of syntax. Vol. 11, MIT press. Cited by: §1, §3, §6.
  • N. Chomsky (1980) Rules and representations: behavioral and brain sciences. New York: Harcourt Brace Jovanovich, Inc. Cited by: §1, §3.
  • Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In ICML, Cited by: §4.1.
  • R. Dessì and M. Baroni (2019) CNNs found to jump around more skillfully than rnns: compositional generalization in seq2seq convolutional networks. In ACL, Cited by: §4.1, §5.
  • E. Dupoux (2018)

    Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner

    Cognition 173, pp. 43–59. Cited by: §6.
  • J. L. Elman, E. A. Bates, M. H. Johnson, A. Karmiloff-Smith, K. Plunkett, and D. Parisi (1998) Rethinking innateness: a connectionist perspective on development. Vol. 10, MIT press. Cited by: §7.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §1.
  • R. Feinman and B. M. Lake (2018) Learning inductive biases with simple neural networks. arXiv preprint arXiv:1802.02745. Cited by: §1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In ICML, Cited by: §4.1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §4.2.
  • J. Gordon, D. Lopez-Paz, M. Baroni, and D. Bouchacourt (2019) Permutation equivariant models for compositional generalization in language. In ICLR, Cited by: §1.
  • P. Grunwald (2004) A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077. Cited by: §1, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix D, §4.2.
  • B. M. Lake and M. Baroni (2017) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: §1, §1, §2, Figure 1, §3, §3, §4.1, §4.2, §5, §5, §6.
  • B. M. Lake, T. Linzen, and M. Baroni (2019) Human few-shot learning of compositional instructions. arXiv preprint arXiv:1901.04587. Cited by: §6.
  • Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel (1990) Handwritten digit recognition with a back-propagation network. In NeurIPS, Cited by: §4.1.
  • J. Loula, M. Baroni, and B. M. Lake (2018) Rearranging the familiar: testing compositional generalization in recurrent networks. arXiv preprint arXiv:1807.07545. Cited by: §6.
  • D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: Appendix A.
  • R. T. McCoy, R. Frank, and T. Linzen (2020) Does syntax need to grow on trees? sources of hierarchical inductive bias in sequence-to-sequence networks. arXiv preprint arXiv:2001.03632. Cited by: §1, §2, §6.
  • W. Merrill, G. Weiss, Y. Goldberg, R. Schwartz, N. A. Smith, and E. Yahav (2020) A formal hierarchy of rnn architectures. External Links: 2004.08500 Cited by: §6.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In NAACL-HLT 2019: Demonstrations, Cited by: §4.2.
  • A. Perfors, J. B. Tenenbaum, and T. Regier (2011) The learnability of abstract syntactic principles. Cognition 118 (3), pp. 306–338. Cited by: §7.
  • J. Rissanen (1978) Modeling by shortest data description. Automatica 14 (5), pp. 465–471. Cited by: §1, §2.
  • S. Ritter, D. G. Barrett, A. Santoro, and M. M. Botvinick (2017) Cognitive psychology for deep neural networks: a shape bias case study. In ICML, Cited by: §1.
  • H. T. Siegelmann and E. D. Sontag (1992) On the computational power of neural nets. In COLT, Cited by: §6.
  • R. J. Solomonoff (1964) A formal theory of inductive inference. part i. Information and control 7 (1), pp. 1–22. Cited by: §1, §2, §7.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NeurIPS, Cited by: §1, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §4.1.
  • G. Weiss, Y. Goldberg, and E. Yahav (2018) On the practical computational power of finite precision rnns for language recognition. arXiv preprint arXiv:1805.04908. Cited by: §3, §6.
  • X. Yu, N. T. Vu, and J. Kuhn (2019) Learning the Dyck language with attention-based Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, M. C. Mozer, and Y. Singer (2019) Identity crisis: memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698. Cited by: §1, §2, §6.

Appendix A Computing description length

We generally follow the description of [Blier and Ollivier, 2018]. The problem of calculating , is considered as a problem of transferring outputs one-by-one, in a compressed form, between two parties, Alice (sender) and Bob (receiver). Alice has the entire dataset , while Bob only has inputs . Before the transmission starts, both parties agreed on the initialization of the model , order of the inputs , random seeds, and the details of the learning procedure. Outputs are sequences of tokens from a vocabulary .

The very first output can be send by using not more than nats, using a naïve encoding. After that, both Alice and Bob update their learners using the example , available to both of them, and get identical instances of .

Further transfer is done iteratively under the invariant that both Alice and Bob start every step with exactly the same learners and finish with identical . At step Alice would use to encode the next output . This can be done using nats [MacKay, 2003]. Since Bob has exactly the same model, he can decode the message to obtain and use the new pair to update his model and get . Alice also updates her model, and proceeds to sending the next (if any), encoding it with the help of .

Overall, this procedure gives us Eq. 1 from Section 2 of the main text:

Appendix B Can seq2seq learners multiply by 3?

In the experiments on the Multiply-or-Add task, reported in the main text, we saw that LSTM-s2s learners are able to learn to multiply by 2 from a single example. Moreover, whether the learner prefers additive or multiplicative “explanation” depends on the length of the training example. A natural further question is whether these learners can learn to multiply by larger numbers and what governs their switching behavior? We believe answering these questions is a promising direction for further work. Here we only provide a preliminary study in a hope to inspire more focused studies.

To do so, we build a task that is similar to Multiply-or-Add, but centered around multiplication by 3 instead of 2. The single training example represents a mapping of an input string to an output string . As test inputs, we use with coming from an interval .

Since can be represented as a combination of addition and multiplication in several ways (), we have more candidate generalizations than for Multiply-or-Add. In particular, we consider different candidate rules. As before, by mem we denote the constant mapping from . mul1 represents the mapping . mul2 corresponds to and mul3 denotes . The “explanation” mul1 is akin to the add rule in the Multiply-or-Add task. We use the same hyperparameters and training procedure described in Section 4 of the main text.

We report the results in Table 2. Like the results observed in the Multiply-or-Add task, both CNN-s2s and Transformer learners show a strong preference for the mem rule while LSTM-based learners switch their generalization according to the length of the training example . Indeed, for CNN-s2s and Transformer, we note an FPA-mem>0.97 independently of (with -mem significantly lower than others). LSTM-s2s att. learners start inferring the mem rule for (FPA=0.64, =2.44), then switch to comparable preference for mul2 and mul3 when , and finally show a significant bias toward the mul3 hypothesis for (e.g. FPA-mul3=0.76 for ). LSTM-s2s no att. learners are also subject to a similar switch of preference. That is, for , these learners have a significant bias toward mul1 (FPA=0.49). Strikingly, when , LSTM-s2s no att. learners inferred perfectly the mul2 rule after one training example. Lastly, we observe again another switch to approximate mul3 for .

In sum, if CNN-s2s and Transformer learners show a significant and robust bias toward mem, LSTM-based learners generalize differently depending on the training input length. In particular, our results suggest that these learners avoid adding very large integers by switching to the multiplicative explanation in those cases. Answering our initial question in this Section, we see that LSTM-based learners can learn to multiply by 3 from a single example.

FPA , nats
mul1 mul2 mul3 mem mul1 mul2 mul3 mem
LSTM-s2s no att. 20 0.00 0.01 0.78 0.00 52.27 22.20 1.17 77.75
15 0.00 0.13 0.45 0.00 40.46 13.22 6.14 66.10
10 0.00 0.92 0.00 0.00 26.48 0.65 22.26 53.81
5 0.49 0.00 0.00 0.00 1.97 26.50 54.97 36.13
LSTM-s2s att. 20666Only 21% of learners succeeded in learning the training example in this setting. 0.00 0.19 0.62 0.00 36.76 20.35 9.35 49.34
15777Only 51% of learners succeeded in learning the training example in this setting. 0.00 0.14 0.76 0.00 37.84 18.42 5.53 56.43
10 0.02 0.45 0.49 0.00 29.83 11.67 8.96 45.47
5 0.01 0.03 0.00 0.64 32.97 48.26 60.38 2.44
CNN-s2s 20 0.00 0.00 0.00 0.99 263.82 262.01 261.71 0.02
15 0.00 0.00 0.00 1.00 250.97 253.32 253.08 0.00
10 0.00 0.00 0.00 1.00 243.17 245.16 248.28 0.00
5 0.00 0.00 0.00 1.00 258.10 257.79 264.06 0.00
Transformer 20 0.00 0.00 0.00 0.97 37.90 51.47 57.57 5.31
15 0.00 0.00 0.00 1.00 40.36 51.62 57.42 2.50
10 0.00 0.00 0.00 1.00 38.05 49.88 55.61 2.47
5 0.00 0.00 0.00 1.00 37.96 51.83 60.19 0.74
Table 2: Multiplication by 3. FPA measures the fraction of seeds that generalize according to a particular rule. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (, paired t-test).

Appendix C Robustness to changes in architecture

In this section, we examine how changing different hyper-parameters affects learners’ preferences for memorization, arithmetic, and hierarchical reasoning. In particular, we vary the number of layers, hidden and embedding sizes of the different learners and test their generalization on Count-or-Memorization, Add-or-Multiply and Hierarchical-or-Linear tasks.888In the main text, description length reported for the Hierarchical-or-Linear task (Table 1c) is normalized by the length of the block (i.e., constant 4). Throughout Supplementary, we do not apply such normalization. Since we always only compare biases with a learner and a dataset fixed, this should not add any confusion. Upon acceptance, we remove this scaling from Table 1c.

In all these experiments, we fix the length of the training examples. Concretely, we fix and for Count-or-Memorization and Add-or-Multiply respectively, and for Hierarchical-or-Linear.

Finally, we keep the same training and evaluation procedure as detailed in Section 4.2 of the main text. However, we use 20 different random seeds instead of 100.

c.1 Number of hidden layers ()

We experiment with the standard seq2seq learners described in the main paper and vary the number of layers . Results are reported in Table 3.

First, when looking at the interplay between mem and count (Table 2(a)), we observe that, independently of , more than of CNN-s2s and Transformer learners inferred perfectly the mem rule (i.e. output for any given input ). Further, if the preference for count decreases with the increase of , LSTM-s2s att. learners display in all cases a significant bias toward count with a large FPA and an significantly lower than -mem . However, we note a decline of the bias toward count when considering LSTM-s2s no att. learners. For , of the seeds generalize to perfect count, versus for . Note that this lower preference for count is followed by an increase of preference for mem. However, there is no significant switch of preferences according to .

Second, we consider the Add-or-Multiply task where we examine the three generalization hypothesis add, mul and mem. Results are reported in Table 2(b). Similarly to the previous task, Transformer and CNN-s2s learners are robust to the number of layers change. They perfectly follow the mem rule with FPA-mem=1.00. However, LSTM-based learners show a more complex behavior: If single-layer LSTM-s2s att. and no att. demonstrate a considerable bias for mul (FPA-mul>0.94), this bias fades in favor of add and memo for larger architectures.

Finally, in Table 2(c), we observe that larger learners are slightly less biased toward hierar and linear. However, we do not observe any switch of preferences. That is, across different , CNN-s2s learners prefer the linear rule, whereas Transformers and, to a lesser degree, LSTM-based learners show a significant preference for the hierar rule.

In sum, we note little impact of on learners’ generalizations. Indeed, LSTM-based learners can learn to perform counting from a single training example, even when experimenting with 10-layer architectures. They also, for most tested , favor mul and add over mem. In contrast Transformer and CNN-s2s perform systematic memorization of the single training example. Furthermore, independently of , Transformer and LSTM-based learners show a bias toward the hierar hypothesis over the linear one, while CNN-based learners do the opposite. Whether learners prefer one rule or another, these findings show strong inductive biases as the training example(s) are highly ambiguous. Interestingly, these biases are barely influenced by the change of the number of layers.

FPA , nats
count mem count mem
LSTM-s2s no att. 1 1.00 0.00 0.01 97.51
3 0.80 0.00 0.00 60.80
10 0.06 0.47 16.89 22.02
LSTM-s2s att. 1 0.99 0.00 7.84 121.48
3 0.50 0.00 11.30 57.57
10 0.39 0.22 22.86 45.13
CNN-s2s 1 0.00 0.98 660.73 0.02
3 0.00 1.00 1685.18 0.00
10 - - - -
Transformer 1 0.00 0.97 116.34 11.10
3 0.00 1.00 139.58 0.31
10 - - - -
(a) Count-or-Memorization
FPA , nats
add mul mem add mul mem
LSTM-s2s no att. 1 0.00 0.94 0.00 25.42 0.31 57.32
3 0.15 0.20 0.00 9.82 10.11 33.16
10 0.28 0.00 0.33 5.83 23.16 10.01
LSTM-s2s att. 1 0.00 0.98 0.00 30.26 1.40 58.84
3 0.65 0.20 0.00 5.88 15.67 27.82
10 0.11 0.11 0.28 9.66 27.72 8.26
CNN-s2s 1 0.00 0.00 1.00 318.12 346.19 0.00
3 0.00 0.00 1.00 1058.27 824.93 2.31
10 - - - - - -
Transformer 1 0.00 0.00 1.00 38.77 50.64 3.50
3 0.00 0.00 1.00 40.03 57.73 0.18
10 - - - - - -
(b) Add-or-Multiply
FPA , nats
hierar linear hierar linear
LSTM-s2s no att. 1 0.05 0.00 31.04 61.84
3 0.00 0.00 29.08 50.92
10 - - - -
LSTM-s2s att. 1 0.30 0.00 26.32 57.20
3 0.00 0.00 32.32 50.12
10 - - - -
CNN-s2s 1 0.00 1.00 202.64 0.00
3 0.00 0.70 222.32 1.84
10 - - - -
Transformer 1 0.69 0.00 4.84 35.04
3 0.56 0.00 5.48 29.16
10 - - - -
(c) Hierarchical-or-Linear
Table 3: Effect of the number of layers (): FPA measures the fraction of seeds that generalize according to a particular rule. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (, paired t-test.). ‘-’ denotes settings where learners have lower than success rate on the train set.

c.2 Hidden size ()

We experimented in the main text with standard seq2seq learners when hidden size . In this section, we look at the effect of varying it in . We report learners performances in Table 4.

First, Table 3(a) demonstrates how minor effect hidden size has on learners counting/memorization performances. Indeed, for any given , between and of LSTM-based learners learn perfect counting after only one training example. Similarly, and with even a lower variation, more than of Transformer and CNN-s2s learners memorize the single training example outputting for any given .

The same observation can be made when studying the interplay between the hierar and linear biases. Concretely, Table 3(c) shows that learners’ generalizations are stable across values with the exception of LSTM-s2s no att. learners. If the latter display significant bias toward hierar for , we do not observe any significant difference between both rules for .

Finally, as demonstrated in Table 3(b), all Transformer and CNN-s2s learners perform perfect memorization when tested on the Add-or-Multiply task independently of their hidden sizes. Both LSTM-based learners are significantly biased toward mul for . However, when experimenting with smaller (=128), we detect a switch of preference for LSTM-s2s no att. learners. The latter start approximating add-type rule (with significantly lower ). Lastly, we do not distinguish any significant difference between add and mul for LSTM-s2s att. when .

Taken together, learners’ biases are quite robust to variations. We however note a switch of preference from mul to add for LSTM-s2s no att. learners when decreasing . Furthermore, we see a loss of significant preference in three distinct settings.

FPA , nats
count mem count mem
LSTM-s2s no att. 128 0.84 0.00 8.25 109.84
512 1.00 0.00 0.01 97.51
1024 1.00 0.00 0.00 149.86
LSTM-s2s att. 128 0.90 0.00 0.01 89.31
512 0.99 0.00 7.84 121.48
1024 1.00 0.00 0.00 300.93
CNN-s2s 128 0.00 1.00 805.01 0.00
512 0.00 0.98 660.73 0.02
1024 0.00 1.00 993.85 0.00
Transformer 128 0.00 0.91 110.68 9.57
512 0.00 0.97 116.34 11.10
1024 0.00 0.94 122.38 1.42
(a) Count-or-Memorization
FPA , nats
add mul mem add mul mem
LSTM-s2s no att. 128 0.00 0.00 0.00 7.56 19.66 43.72
512 0.00 0.94 0.00 25.42 0.31 57.32
1024 0.25 0.75 0.00 30.29 5.26 77.99
LSTM-s2s att. 128 0.00 0.00 0.00 15.32 17.37 45.09
512 0.00 0.98 0.00 30.26 1.40 58.84
1024 0.00 1.00 0.00 51.70 3.09 86.82
CNN-s2s 128 0.00 0.00 1.00 281.34 301.75 0.02
512 0.00 0.00 1.00 318.12 346.19 0.00
1024 0.00 0.00 1.00 520.75 508.75 0.00
Transformer 128 0.00 0.00 1.00 35.21 46.07 2.94
512 0.00 0.00 1.00 38.77 50.64 3.50
1024 0.00 0.00 1.00 38.74 51.71 0.88
(b) Add-or-Multiply
FPA , nats
hierar linear hierar linear
LSTM-s2s no att. 128 0.00 0.00 30.40 79.00
512 0.05 0.00 31.04 61.84
1024 0.00 0.00 60.24 47.24
LSTM-s2s att. 128 0.00 0.00 32.72 72.28
512 0.30 0.00 26.32 57.2
1024 0.05 0.00 65.80 73.80
CNN-s2s 128 0.00 0.95 178.88 0.00
512 0.00 1.00 202.64 0.00
1024 0.00 0.95 225.36 0.12
Transformer 128 0.75 0.00 2.96 36.92
512 0.69 0.00 4.84 35.04
1024 0.75 0.00 31.04 61.84
(c) Hierarchical-or-Linear
Table 4: Effect of the hidden size (): FPA measures the fraction of seeds that generalize according to a particular rule. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (, paired t-test).

c.3 Embedding size ()

We look here at the effect of the embedding size, , on learners’ generalizations. In particular, we vary . Results are reported in Table 5.

Across all sub-tables, we see small influence of on learners’ biases. For example, if we consider the Count-or-Memorization task when varying (see Table 4(a)), between and of LSTM-s2s no att. learners inferred perfectly the count hypothesis. More striking, between and of LSTM-s2s att. learners learned the count rule after one training example. The same trend is observed for the remaining learners and across the other tasks; Add-or-Multiply (Table 4(b)) and Hierarchical-or-Linear (Table 4(c)). Yet, we still discern in some cases, systematic, but low, effects of . First, the larger is, the lower FPA-mul of LSTM-s2s no att. learners is (from 0.94 for to 0.84 for ). However, LSTM-s2s no att. learners still have considerable preference for mul for any tested . Second, we see an increase of Transformer’s preference for hierar with the increase of . Surprisingly, for , of Transformer learners generalize to perfect hierar hypothesis.

FPA , nats
count mem count mem
LSTM-s2s no att. 16 1.00 0.00 0.01 97.51
64 0.95 0.00 0.00 91.54
256 1.00 0.00 0.00 90.32
LSTM-s2s att. 16 0.99 0.00 7.84 121.48
64 1.00 0.00 11.12 117.39
256 1.00 0.00 9.79 127.31
CNN-s2s 16 0.00 0.98 660.73 0.02
64 0.00 1.00 670.53 0.00
256 0.00 0.95 826.23 0.01
Transformer 16 0.00 0.97 116.34 11.10
64 0.00 1.00 232.53 0.00
256 0.00 1.00 338.88 0.00
(a) Count-or-Memorization
FPA , nats
add mul mem add mul mem
LSTM-s2s no att. 16 0.00 0.94 0.00 25.42 0.31 57.32
64 0.05 0.90 0.00 23.70 1.41 51.33
256 0.05 0.84 0.00 20.42 1.33 51.34
LSTM-s2s att. 16 0.00 0.98 0.00 30.26 1.40 58.84
64 0.07 0.93 0.00 26.71 2.54 50.65
256 0.00 1.00 0.00 26.83 1.94 51.20
CNN-s2s 16 0.00 0.00 1.00 318.12 346.19 0.00
64 0.00 0.00 1.00 293.86 294.50 0.00
256 0.00 0.00 1.00 486.81 447.20 0.00
Transformer 16 0.00 0.00 1.00 38.77 50.64 3.50
64 0.00 0.00 1.00 87.11 142.83 0.00
256 0.00 0.00 1.00 118.34 172.65 0.00
(b) Add-or-Multiply
FPA , nats
hierar linear hierar linear
LSTM-s2s no att. 16 0.05 0.00 31.04 61.84
64 0.10 0.00 34.44 72.2
256 0.00 0.00 36.32 76.56
LSTM-s2s att. 16 0.30 0.00 26.32 57.2
64 0.30 0.00 23.4 72.68
256 0.05 0.00 55.56 90.76
CNN-s2s 16 0.00 1.00 202.64 0.00
64 0.00 1.00 227.12 0.08
256 0.00 0.94 419.28 8.84
Transformer 16 0.69 0.00 4.84 35.04
64 1.00 0.00 0.00 81.16
256 1.00 0.00 0.56 121.68
(c) Hierarchical-or-Linear
Table 5: Effect of the embedding size (): FPA measures the fraction of seeds that generalize according to a particular rule. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (, paired t-test).

In this section, we studied the impact of the number of layers, hidden and embedding sizes on learners’ generalizations. We found that, if these hyper-parameters can influence, in some cases, the degree of one learner’s preference w.r.t. a given rule, inductive biases are quite robust to their changes. In particular, among all tested combinations, we observe only cases of preference switch (out of ).

Appendix D Robustness to changes in training parameters

We examine here the effect of the training parameters on learners’ biases. As previously, we only consider the Count-or-Memorization task with , the Add-or-Multiply task with and the Hierarchical-or-Linear task with . We experiment with the architectures detailed in the main paper; however, we use 20 different random seeds instead of 100, used in the main text.

We consider in this section two different hyperparameters: (1) the choice of the optimizer, and (2) the dropout probability.

Optimizer We experiment with replacing the Adam optimizer [Kingma and Ba, 2014] with SGD [Bottou et al., 2018]. We found experimentally that learners failed to learn the training examples consistently in most of the settings. Yet, when successful, they showed the same preferences. In particular, Transformer and CNN-s2s were the only learners that had good performances on Count-or-Memorization and Add-or-Multiply train sets (success rate higher than ). These learners showed a prefect generalization to the mem rule in both tasks.

Dropout We then examine how the dropout probability affects learners’ preferences. We use, as mentioned in the main paper, Adam optimizer and vary the dropout probability . Results are reported in Table 6.

Both Count-or-Memorization (Table 5(a)) and Add-or-Multiply (Table 5(b)) tasks show the same trend. First, Transformer and CNN-s2s learners prefer consistently the mem rule. Second, when looking at LSTM-based learners, we distinguish a more complex behavior. For , LSTM-based learners show a significant preference for arithmetic reasoning (count for the Count-or-Memorization task and mul for the Add-or-Multiply task). However, when , we see different preferences. In particular, both LSTM-based learners show a preference for mem (not significant for LSTM-s2s att.), LSTM-s2s no att. learners are significantly biased toward add whereas LSTM-s2s att. do not show any significant bias with a slight preference for mem. In sum, the lower is, the more likely learners will overfit the mem rule.

Finally, we consider the Hierarchical-or-Linear task (see Table 5(c)). We observe that, for any value, CNN-s2s and Transformer inductive biases remain the same. Indeed, CNN-s2s learners show a consistent preference for linear with FPA while Transformers prefer hierar (note that this preference is not very large for with an FPA-hierar of 0.05, compared to an FPA-hierar of 1.00 and 0.69 for and respectively). On the other hand, has a larger impact on LSTM-based learners. When , both LSTM-s2s prefer the hierar hypothesis. However, for , both learners do not show any significant preference for any of the rules (with FPA for both rules and close values).

FPA , nats
count mem count mem
LSTM-s2s no att. 0.0 0.00 0.20 56.23 16.52
0.2 0.95 0.00 0.17 60.15
0.5 1.00 0.00 0.01 97.51
LSTM-s2s att. 0.0 0.32 0.68 63.30 47.68
0.2 0.95 0.05 33.66 87.36
0.0 0.99 0.00 7.84 121.48
CNN-s2s 0.0 0.00 0.55 1034.67 0.43
0.2 0.00 0.98 999.62 0.01
0.5 0.00 0.98 660.73 0.02
Transformer 0.0 0.00 0.65 261.02 1.17
0.2 0.00 1.00 171.31 0.05
0.5 0.00 0.97 116.34 11.10
(a) Count-or-Memorization
FPA , nats
add mul mem add mul mem
LSTM-s2s no att. 0.0 0.25 0.00 0.00 5.00 35.55 19.07
0.2 0.30 0.45 0.00 12.07 11.04 37.39
0.5 0.00 0.94 0.00 25.42 0.31 57.32
LSTM-s2s att. 0.0 0.00 0.11 0.58 25.09 36.33 12.09
0.2 0.18 0.53 0.18 22.22 9.92 34.48
0.5 0.00 0.98 0.00 30.26 1.40 58.84
CNN-s2s 0.0 0.00 0.00 0.65 236.41 247.52 0.42
0.2 0.00 0.00 1.00 438.06 464.26 0.00
0.5 0.00 0.00 1.00 318.12 346.19 0.00
Transformer 0.0 0.00 0.00 0.65 84.58 130.88 0.96
0.2 0.00 0.00 1.00 65.62 99.05 0.02
0.5 0.00 0.00 1.00 38.77 50.64 3.50
(b) Add-or-Multiply
FPA , nats
hierar linear hierar linear
LSTM-s2s no att. 0.0 0.00 0.00 16.38 17.72
0.2 0.00 0.00 11.09 19.87
0.5 0.05 0.00 7.76 15.46
LSTM-s2s att. 0.0 0.00 0.00 31.12 28.61
0.2 0.00 0.00 10.63 17.55
0.5 0.30 0.00 6.58 14.30
CNN-s2s 0.0 0.00 0.75 68.48 0.44
0.2 0.00 1.00 99.42 1.16
0.5 0.00 1.00 50.66 0.00
Transformer 0.0 0.05 0.00 3.99 8.56
0.2 1.00 0.00 0.09 13.09
0.5 0.69 0.00 1.21 8.76
(c) Hierarchical-or-Linear
Table 6: Effect of dropout probability: FPA measures the fraction of seeds that generalize according to a particular rule. Description length is averaged across examples and seeds. The lowest are in bold and denotes stat. sig. difference in (, paired t-test).