1 Introduction
Sequencetosequence (seq2seq) learners (Sutskever et al., 2014)
demonstrated remarkable performance in machine translation, story generation, and opendomain dialog
(Sutskever et al., 2014; Fan et al., 2018; Adiwardana et al., 2020). These capabilities, however, come at the expense of being hardly interpretable blackboxes, which spurred an interest in better understanding their inner working (Lake and Baroni, 2017; Gordon et al., 2019; Yu et al., 2019; McCoy et al., 2020).In this work, we focus on studying inductive biases of seq2seq models. We start from an observation that, generally, multiple explanations can be consistent with a limited training set, each leading to different predictions on unseen data. A learner might prefer one type of explanations over another in a systematic way, as a result of its inductive biases (Ritter et al., 2017; Feinman and Lake, 2018).
To illustrate the setup we work in, consider a quizlike question: if maps to , what does map to? The “training” example is consistent with the following answers: 6 ; ; ; any number , since we always can construct a function such that and . By analyzing the learner’s output on this new input, we can infer its biases.
This example demonstrates how biases of learners are studied through the lenses of the poverty of the stimulus principle (Chomsky, 1965, 1980): if nothing in the training data indicates that a learner should generalize in a certain way, but it does nonetheless, then this is due to the biases of the learner. Inspired by the work of Zhang et al. (2019) in the image domain, we take this principle to the extreme and study biases of seq2seq learners in the regime of very few training examples, often as little as one. Under this setup, we propose three new synthetic tasks that probe seq2seq learners’ preferences to memorization, arithmetic, and hierarchybased “reasoning”. We complement these tasks by SCAN (Lake and Baroni, 2017), a wellknown benchmark for compositional generalization.
Next, we connect to the ideas of Solomonoff’s theory of induction (Solomonoff, 1964) and Minimal Description Length (Rissanen, 1978; Grunwald, 2004) and propose to use description length, under a learner, as a principled measure of its inductive biases.
Our experimental study shows that the standard seq2seq learners have strikingly different inductive biases. We find that LSTMbased learners are able to learn nontrivial counting, multiplication, and additionbased rules from as little as one example. CNNbased seq2seq learners would prefer linear over hierarchical generalizations, while LSTMbased ones and Transformers would do just the opposite. When experimenting with SCAN, description length proved to be a more sensitive measure of inductive biases than the “intuitive” approach. Equipped with this measure, we found that CNN, LSTM and Transformerbased learners prefer, to different degrees, the compositional generalization over the memorization one.
2 Searching for inductive biases
To formalize the way we look for inductive biases of a learner , we consider a training dataset of input/output pairs, , and a holdout set of inputs, . W.l.o.g, we assume that there are two candidate “rules” explaining the training data, and , such that , but, generally, . After we fit the learner on the training data, its bias toward or is inferred by analysing similarity in the output space between and or , respectively. We call this approach “intuitive”.
Typically, those measures of similarity in the outputspace are taskspecific. McCoy et al. (2020) used accuracy of the first term, Zhang et al. (2019) used correlation and MSE, and Lake and Baroni (2017) used accuracy calculated on the entire output sequence.
We too start with an accuracybased measure. We define the fraction of perfect agreement (FPA) between a learner and a candidate generalization rule as the fraction of seeds that generalize perfectly in agreement with that rule on the holdout set . That is, the larger FPA of is w.r.t. , the more biased is toward . However, FPA neither considers imperfect generalization, nor allows direct comparison between two candidate rules, and , when both are dominated by a third candidate rule, . Hence, below we propose a principled approach based on the description length.
Description Length and Inductive Biases At the core of the theory of induction (Solomonoff, 1964) is the question of continuation of a finite string that is very similar to our setup. Indeed, we can easily reformulate our motivating example as a string continuation problem: “”. The solution proposed by Solomonoff (1964) is to select the continuation that admits “the simplest explanation” of the entire string, i.e. that is produced by programs of the shortest length (description length).
Our intuition is that if, for a learner, a continuation is “simple,” then this learner is biased toward it. We consider a learner to be biased toward over if the training set and its extension according to has a shorter description length (for ) compared to that of . Denoting description length of a dataset under the learner as , we hypothesise that if , then is biased toward .
This definition is seemingly different from the “intuitive” one that we used above. At the end of this Section, we discuss connections of description length to other ways of measuring inductive biases.
Calculating Description Length^{1}^{1}1We refer the reader to Supplementary for a more detailed derivation. To find the description length of data under a fixed learner, we use the online (prequential) code (Rissanen, 1978; Grunwald, 2004; Blier and Ollivier, 2018). We are given a learner and a dataset, . W.l.o.g. we fix some order over . We assume that, given , the learner
produces a probability distribution over the space of the outputs
, . Sequences are composed of tokens from vocabulary .The problem of calculating is cast as a problem of transferring outputs onebyone over a channel in a compressed form, where a sender and a receiver support identical instances of . At each step , the sender and the receiver update their copies using and the newly transmitted to get . Next, the sender uses its updated copy to compress and transmit . The very first output can be sent by using not more than nats, using a naïve encoding.^{2}^{2}2As we are interested in comparing candidate generalization rules, the value of the additive constant is not important, as it is learner and candidateindependent. In experiments, we subtract it from all measurements. The cumulative number of nats transmitted is:
(1) 
The obtained code length of Eq. A depends on the order in which are transmitted and the procedure we use to update To account for that, we average out the data order by training with multiple random seeds. Further, for larger datasets, full retraining after adding a new example is impractical and, in such cases, examples can be transmitted in blocks.
If we measure the description length of the training data shuffled with the holdout data , both datasets would have symmetric roles. However, there is a certain asymmetry in the extrapolation problem: we are looking for an extrapolation from , not viceversa. To break this symmetry, we always transmit outputs for the entire training data as the first block. Note that if we consider a case where we first transmit the training outputs as the first block and all of the holdout data outputs under , , as the second block, then the description length reduces to measuring the crossentropy of the learner on the holdout data. In this case, we recover a process akin to the “intuitive” measuring of inductive biases. However, in our case, as described in Eq. A, catches also whether a learner is capable of finding regularity in the holdout data fast, with few data points; hence it also represents the speedoflearning ideas for measuring inductive biases (Chaabouni et al., 2019).
Apart from being connected to the existent approaches, description length, as a measure of inductive biases, has two attractive properties: (a) it is domainindependent, i.e. can be applied e.g. in the image domain, and (b) it allows comparisons across models that account for model complexity. However, it requires the learner to have a probabilistic output.
3 Tasks
We describe the four tasks that we use to study inductive biases of seq2seq learners. We select those tasks to cover different aspects of learners’ behavior. For each task, we investigate learners’ generalization when trained on highly ambiguous training set. In such scenario, there are infinite consistent generalization rules (as described in our motivating example of Section 1). We, however, preselect only several candidate rules highlighting biases that are useful for language processing and known to exist in humans, or are otherwise reasonable. Remarkably, our experiments show that these rules cover many cases of the actual learners’ behavior.
The first two tasks study biases in arithmetic reasoning: CountorMemorization quantifies learners’ preference for counting vs. a simple memorization and AddorMultiply further probes the learners’ preferences in arithmetic operations. We believe these tasks are interesting, as counting is needed in some NLP problems like processing linearized parse trees (Weiss et al., 2018). The third task, HierarchicalorLinear, contrasts hierarchical and linear inductive reasoning. The hierarchical reasoning bias is believed to be fundamental in learning some syntactical rules in human acquisition of syntax (Chomsky, 1965, 1980). Finally, we investigate biases for systematic compositionality, which is central for human generalization capabilities in language. For that, we use a wellknown benchmark, SCAN (Lake and Baroni, 2017).
By we denote a sequence that contains token repeated
times. For training, we represent sequences in a standard way for seq2seq models: the tokens are onehotencoded separately, and we append a special endofsequence token to each sequence. Input and output vocabularies are disjoint.
CountorMemorization: In this task, we contrast learners’ preferences for counting vs. memorization. We train models to fit a single training example with input and output (i.e., to perform the mapping ) and test it on with . If a learner learns the constant function, outputting independently of its inputs, then it follows the mem strategy. On the other hand, if it generalizes to the mapping, then the learner is biased toward the count strategy.
AddorMultiply: This task is akin to the motivating example in Section 1. The single training example represents a mapping of an input string to an output string . As test inputs, we generate for in the interval . We consider the learned rule to be consistent with mul if for all , the input/output pairs are consistent with . Similarly, if they are consistent with , we say that the learner follows the addition rule, add. Finally, the learner can learn a constant mapping for any . Again, we call this rule mem.
HierarchicalorLinear:
For a fixed depth , we train learners on four training examples where .^{3}^{3}3This mapping consist then on four combinations ; ; and . Each training example has a nested structure, where defines its depth. A learner with a hierarchical bias (hierar), would output the middle symbol. We also consider the linear rule (linear) in which the learner outputs the + symbol of its input.
To probe learners’ biases, we test them on inputs with different depths . Note that to examine the linear rule (i.e. if the learner outputs the + symbol of any test input of depth ), we need .
Similar to the previous tasks, there is no vocabulary sharing between a model’s inputs and outputs (input and output tokens and are different).
SCAN:
SCAN (Lake and Baroni, 2017), is a dataset and a set of benchmarks used for studying systematic generalization of the seq2seq learners.^{4}^{4}4Available at https://github.com/brendenlake/SCAN. In SCAN, inputs are sequences that represent trajectories and outputs are stepwise instructions for following these instructions (see Figure 1). We experiment with the SCANjump split of the dataset, where the test set (7706 examples) is obtained by filtering all compositional uses of one of the primitives, jump. The train set (14670 examples) contains all uses of all other primitives (including compositional), and lines where jump occurs in isolation. This makes the training data ambiguous only for the jump instruction. However, learners can transfer compositional rules across the primitives. We refer to this split simply as SCAN.
We consider two interpretable generalizations: (1) all input sequences with jump are mapped to a single instruction JUMP as memorized from the training examples, or (2) underlying compositional grammar. We call them mem and comp, respectively. We used the original test as a representative of the comp candidate explanation and generated one for mem ourselves.
jump  JUMP  
jump around right  RTURN JUMP RTURN JUMP RTURN JUMP RTURN JUMP  
turn left twice  LTURN LTURN  
jump opposite left after walk around left  LTURN WALK LTURN WALK LTURN WALK LTURN WALK LTURN LTURN JUMP 
4 Methodology
4.1 Sequencetosequence learners
We experiment with three standard seq2seq models: LSTMbased seq2seq (LSTMs2s) (Sutskever et al., 2014), CNNbased seq2seq (CNNs2s) (Gehring et al., 2017), and Transformer (Vaswani et al., 2017). All share a similar Encoder/Decoder architecture (Sutskever et al., 2014)
. In the main text, we report experiments using models with comparable hyperparameter values. We studied, however, the effect of different hyperparameters in Supplementary, and show how small impact their variations have on learners’ biases (see Supplementary for more details).
LSTMs2s Both Encoder and Decoder are implemented as LSTM cells (Hochreiter and Schmidhuber, 1997). Encoder encodes its inputs incrementally from left to right. We experiment with architectures without (LSTMs2s no att.) and with (LSTMs2s att.) an attention mechanism (Bahdanau et al., 2014). For the first three tasks, both Encoder and Decoder are singlelayer LSTM cells with hidden size of 512 and embedding of dimension 16.
For SCAN, we chose the architecture used in Lake and Baroni (2017). In particular, Encoder and Decoder are two hidden layers with 200 units and embedding of dimension 32. This architecture gets 0.98 test accuracy on the i.i.d. (simple) split of SCAN, averaged over 10 seeds.
CNNs2s Encoder and Decoder are convolutional networks (LeCun et al., 1990), followed by GLU nonlinearities (Dauphin et al., 2017) and an attention layer. To represent positions of input tokens, CNNs2s uses learned positional embeddings. For the first three tasks, both convolutional networks have one layer with 512 filters and a kernel width of 3. We set the embedding size to 16.
For SCAN, we use one of the successful CNNs2s models of Dessì and Baroni (2019) that has 5 layers and embedding size of 128. We also vary kernel size in as it was found to impact the performance on SCAN (Dessì and Baroni, 2019). These architectures reach test accuracy above 0.98 on the i.i.d. split of SCAN, averaged over 10 seeds.
Transformer Encoder and Decoder are implemented as a sequence of (self)attention and feedforward layers. We use sinusoidal position embedding. When considering the first three tasks, both Encoder and Decoder contain one transformer layer. The attention modules have 8 heads, feedforward layers have dimension of 512 and the embedding is of dimension 16.
We believe our work is the first study of Transformer’s performance on SCAN. For both Encoder and Decoder, we use 8 attention heads, 4 layers, embedding size of 64, FFN layer dimension of 256. This architecture gets 0.94 test accuracy on the i.i.d. split of SCAN, averaged over 10 seeds.
4.2 Training and evaluation
For all tasks except SCAN, we follow the same training procedure. We train with Adam optimizer (Kingma and Ba, 2014) for epochs. The learning rate starts at and increases for the first warmup updates till reaching . We include all available examples in a single batch. We use teacher forcing (Goodfellow et al., 2016). We set the dropout probability to 0.5 (we report experiments with other values in Supplementary). For each learner, we perform training and evaluation 100 times, changing random seeds. When generating sequences to calculate FPA, we select the next token greedily. We use the model implementations from fairseq (Ott et al., 2019).
As discussed in Section 2, when calculating , we use the training examples as the first transmitted block at . In CountorMemorization and AddorMultiply this block contains one example, and in HierarchicalorLinear it has 4 examples. Next, we transmit examples obtained from the candidate rules in a randomized order, by blocks of size 1, 1, and 4 for CountorMemorization, AddorMultiply, and HierarchicalorLinear respectively. At each step, the learner is retrained from the same initialization, using the procedure and hyperparameters as discussed above.
For SCAN, we follow the same scenario as above with two differences: (a) when calculating , we use blocks of size 1024, (b) during training, we sample batches with replacement, similar to (Lake and Baroni, 2017). We uses batches of size 16 for LSTMs2s & CNNs2s, and 256 for Transformer. We repeat the training/evaluation of each learner 10 times, varying the random seed. Again, we use Adam optimizer and a learning rate of .
5 Experiments
CountorMemorization We investigate here learners’ biases toward count and mem rules. We provide a single example as the training set, varying . We report the learners’ performances in Table 0(a). We observe that, independently of the length of the training example , CNNs2s and Transformer learners inferred perfectly the mem rule with FPAmem > 0.90 (i.e. more than of the random seeds output for any given input ).
However, LSTMbased learners demonstrate a more complex behavior. With , both learners (with and without attention) exhibit a preference for mem. Indeed, while these learners rarely generalize perfectly to any of the hypothesis ( FPA (no att.), / FPA for mem/count (att.)), they have significantly lower mem. As increases, LSTMbased learners become more biased toward count. Surprisingly, for , most learner instances show sufficiently strong inductive biases to infer perfectly the nontrivial count hypothesis. With , of random seeds of LSTMs2s att. and all (100%) of LSTMs2s no att. seeds generalized perfectly to count.
Further, we see that if shows similar trends, it has a higher sensitivity. For example, while both LSTMbased learners have a similar FPA with , demonstrates that LSTMs2s no att. has a stronger count bias.
AddorMultiply In this task, we examine learners’ generalization after training on the single example . We vary . In Table 0(b), we report FPA and for the three generalization hypotheses, add, mul, and mem. We observe, similarly to the previous task, that CNNs2s and Transformer learners always converge perfectly to memorization.
In contrast, LSTMbased learners show nontrivial generalizations. Examining first LSTMs2s att., when =, we note that mem has a high FPA and an considerably lower than others. This is consistent with the learner’s behavior in the CountorMemorization task. As we increase , more interesting behavior emerges. First, mem decreases as increases. Second, multype preference increases with . Finally, add presents a Ushaped function of . That is, for the medium example length , the majority of learners switch to approximating the add rule (for ). However, when grows further, a considerable fraction of these learners start to follow a multype rule. Strikingly, of LSTMs2s att. seeds generalized perfectly to the nontrivial mul rule. As for LSTMs2s no att., we do not observe a strong bias to infer any of the rules when =. However, when increasing , the LSTMs2s no att. behaves similarly to LSTMs2s att.: at first it has a preference for add (FPAadd=0.95, for =) then for mul (e.g. FPAmul=0.94, for =).
HierarchicalorLinear We look now at learners’ preference for either hierar or linear generalizations. The architectures we use were only able to consistently learn the training examples with the depth not higher than . Hence, in this experiment, we set to .
We report in Table 0(c) the learners’ FPA and . We observe that CNNs2s exhibits a strikingly different bias compared to all other learners with a perfect agreement with the linear rule. In contrast, Transformer learners show a clear preference for hierar with a high FPA (0.69) and a low (1.21). Surprisingly, this preference increases with the embedding size and Transformers with embedding size admit an FPAhierar of 1.00 (see Supplementary for more details). LSTMs2s att. learners demonstrate also a similar preference for hierar with an FPA of 0.30 and a considerably lower than hierar. Finally, while only of LSTMs2s no att. instances generalized to perfect hierar (and none to linear), confirms their considerable preference for the hierar hypothesis.




, paired ttest).
SCAN Next, we turn to a largerscale SCAN task. Due to the large number of test examples and low performance of the learners, all FPA scores would be equal to zero. Hence, we follow Lake and Baroni (2017) and use persequence accuracy. We report our results in Table 0(d). First, we see that CNNs2s learners have a strong preference to comp, both in accuracy and description length. Furthermore, we observe that with an increase in the kernel size, description length of mem increases, while description length of comp decreases, indicating that the preference for comp over mem grows with the kernel width. We believe this echoes findings of (Dessì and Baroni, 2019).
While accuracy is well below for all other learners/candidate combinations (and rounded to 0.00), according to the description length, Transformer and LSTMbased learners also have preference for comp over mem. This can be due to the transfer from compositional training examples, that can make comp explanation most “simple” given the dataset. Hence, the failure for systematic generalization in SCAN comes not from learners’ preferences for mem. We believe this resonates with the initial qualitative analysis in (Lake and Baroni, 2017).
Overall, across all the above experiments, we see that seq2seq learners demonstrate strikingly different biases. In many cases, these biases lead to nontrivial generalizations when facing ambiguity in the training data. This spans tasks that probe for memorization, arithmetic, hierarchical, and compositional reasoning. We found that a single example is sufficient for LSTMbased learners to learn counting, addition, and multiplication. Moreover, within the same task, they can switch from one explanation to another, depending on the training example length, with AdditionorMultiplication being the task where this switch happens twice. In contrast, CNNs2s and Transformers show a strong bias toward memorization.^{5}^{5}5In Supplementary, we verify that such preferences remain when studying biases toward multiplication by 3. Furthermore, all learners except for CNNs2s demonstrate a strong bias toward the hierarchical behavior. In the task of compositional generalization (SCAN), our results not only confirm earlier results, but also indicate that, actually, the standard seq2seq learners prefer compositional generalization over memorization.
We see that the conclusions derived from comparing the description length of the candidate rules are in agreement with the results under accuracybased metrics, but provide a more nuanced picture.
6 Background
There have been dramatic advances in language processing using seq2seq models. Yet, these models have been long criticized for requiring a tremendous amount of data with, in the end, a lack of systematic generalization (Lake and Baroni, 2017; Dupoux, 2018; Loula et al., 2018; Bastings et al., 2018). In contrast, humans rely on their inductive biases to generalize from a limited amount of data (Chomsky, 1965; Lake et al., 2019). Because of the centrality of humans’ biases in language learning, several works had studied seq2seq inductive biases, connecting their poor generalization to their lack of the “right” biases (Lake and Baroni, 2017; Lake et al., 2019; McCoy et al., 2020). However, these works have mostly relied on complex tasks that assume the knowledge of different factors of language, such as semantics. This makes it harder to connect the failures of seq2seq learners with their biases. We differ from this line of work by investigating seq2seq biases in a more focused setup. Our approach follows the ideas of Zhang et al. (2019)’s work in the vision domain, but, in contrast, we study seq2seq learners and consider inductive biases that are crucial for language learning in humans. We demonstrate that, when considering our proposed tasks, some seq2seq learners show strong humanlike biases and generalize perfectly to behaviors useful for language learning.
Another line of research investigates theoretically learners’ capabilities, that is, the classes of the hypothesis that a learner can discover (Siegelmann and Sontag, 1992; Weiss et al., 2018; Merrill et al., 2020). For example, Weiss et al. (2018) demonstrated that LSTM cells can count (unlike, e.g. GRU cells). In turn, we demonstrate that LSTMbased seq2seq learners are not only capable but also biased toward arithmetic behavior.
7 Discussion and Conclusion
In this work, we studied inductive biases of standard seq2seq learners, Transformer, LSTM, and CNNbased. To do so, we used one wellknown task and introduced three new ones, which allowed us to cover an interesting spectrum of behaviors useful for language learning. In particular, we considered arithmetic, hierarchical, and compositional “reasoning”. Next, we connected the problem of finding and measuring inductive biases to Solomonoff’s theory of induction and proposed to use a dataset’s description length under a learner as a tool for sensitive measurement of inductive biases.
In our experiments, we found that the seq2seq learners have strikingly different inductive biases and some of them generalize nontrivially when facing ambiguity. For instance, a single training example is sufficient for LSTMbased learners to learn perfectly how to count, to add and to multiply by a constant. Transformers and, to a lesser degree, LSTMs2s demonstrated preferences for the hierarchical bias, a bias that has been argued to govern children’s acquisition of syntax. Interestingly, such biases arose with no explicit wiring for them. Our results support then Elman et al. (1998)’s theory which states that human’s inductive biases can arise from lowlevel architectural constraints in the brain with no need for an explicit encoding of a linguistic structure. However, how the brain, or, more generally, a learner is wired to admit a specific inductive bias is still an important open question.
Across our experiments, we also observed that description length is consistent with “intuitive” measurements of inductive biases, and, at the same time, it turned out to be more sensitive. This also indicates that, in the presence of ambiguity in the training data, a learner is more likely to follow the alternative with the shorter description length (i.e. the simplest one) when applied on unseen data, showing consistency with the prescriptions of the theory of induction (Solomonoff, 1964). A similar simplicity preference is argued to play a role in human language acquisition (Perfors et al., 2011).
Our findings can provide a guidance for architecture selection in the lowdata regimes where inductive biases might have a higher influence on model’s generalization performance. Large sparse datasets can also benefit from predictable behavior in fewshot scenarios akin to what we consider.
Our work paves the way for multiple future directions. It would be interesting to understand what drives the “switching” behavior in LSTMs2s: why, as the training example gets longer, learners switch from memorizationbased to less trivial explanations? Why, in AddvsMultiply, LSTMs2s switches from memorization to addition and then to multiplication? Is it some pressure for explanations with small coefficients or is it due to “computation time” that longer sequences require?
Finally, our results demonstrate that relatively large deep learning models
can generalize nontrivially from as little as one example – as long as the task is aligned with the their inductive biases. We believe this should reinforce interest in future work on injecting useful inductive biases in our learners and, we hope, our findings and setup can provide a furtile ground for such work.Acknowledgements
The authors are grateful to Marco Baroni, Emmanuel Dupoux, Emmanuel Chemla and participants of the EViL meeting for their feedback on our work.
References
 Towards a humanlike opendomain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1.
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.1.

Jump to better conclusions: SCAN both left and right.
In
EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, Cited by: §6.  The description length of deep learning models. In NeurIPS, Cited by: Appendix A, §2.

Optimization methods for largescale machine learning
. Siam Review 60 (2), pp. 223–311. Cited by: Appendix D.  Wordorder biases in deepagent emergent communication. arXiv preprint arXiv:1905.12330. Cited by: §2.
 Aspects of the theory of syntax. Vol. 11, MIT press. Cited by: §1, §3, §6.
 Rules and representations: behavioral and brain sciences. New York: Harcourt Brace Jovanovich, Inc. Cited by: §1, §3.
 Language modeling with gated convolutional networks. In ICML, Cited by: §4.1.
 CNNs found to jump around more skillfully than rnns: compositional generalization in seq2seq convolutional networks. In ACL, Cited by: §4.1, §5.

Cognitive science in the era of artificial intelligence: a roadmap for reverseengineering the infant languagelearner
. Cognition 173, pp. 43–59. Cited by: §6.  Rethinking innateness: a connectionist perspective on development. Vol. 10, MIT press. Cited by: §7.
 Hierarchical neural story generation. In ACL, Cited by: §1.
 Learning inductive biases with simple neural networks. arXiv preprint arXiv:1802.02745. Cited by: §1.
 Convolutional sequence to sequence learning. In ICML, Cited by: §4.1.
 Deep learning. MIT press. Cited by: §4.2.
 Permutation equivariant models for compositional generalization in language. In ICLR, Cited by: §1.
 A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077. Cited by: §1, §2.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix D, §4.2.
 Generalization without systematicity: on the compositional skills of sequencetosequence recurrent networks. arXiv preprint arXiv:1711.00350. Cited by: §1, §1, §2, Figure 1, §3, §3, §4.1, §4.2, §5, §5, §6.
 Human fewshot learning of compositional instructions. arXiv preprint arXiv:1901.04587. Cited by: §6.
 Handwritten digit recognition with a backpropagation network. In NeurIPS, Cited by: §4.1.
 Rearranging the familiar: testing compositional generalization in recurrent networks. arXiv preprint arXiv:1807.07545. Cited by: §6.
 Information theory, inference and learning algorithms. Cambridge university press. Cited by: Appendix A.
 Does syntax need to grow on trees? sources of hierarchical inductive bias in sequencetosequence networks. arXiv preprint arXiv:2001.03632. Cited by: §1, §2, §6.
 A formal hierarchy of rnn architectures. External Links: 2004.08500 Cited by: §6.
 Fairseq: a fast, extensible toolkit for sequence modeling. In NAACLHLT 2019: Demonstrations, Cited by: §4.2.
 The learnability of abstract syntactic principles. Cognition 118 (3), pp. 306–338. Cited by: §7.
 Modeling by shortest data description. Automatica 14 (5), pp. 465–471. Cited by: §1, §2.
 Cognitive psychology for deep neural networks: a shape bias case study. In ICML, Cited by: §1.
 On the computational power of neural nets. In COLT, Cited by: §6.
 A formal theory of inductive inference. part i. Information and control 7 (1), pp. 1–22. Cited by: §1, §2, §7.
 Sequence to sequence learning with neural networks. In NeurIPS, Cited by: §1, §4.1.
 Attention is all you need. In NeurIPS, Cited by: §4.1.
 On the practical computational power of finite precision rnns for language recognition. arXiv preprint arXiv:1805.04908. Cited by: §3, §6.
 Learning the Dyck language with attentionbased Seq2Seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §1.
 Identity crisis: memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698. Cited by: §1, §2, §6.
Appendix A Computing description length
We generally follow the description of [Blier and Ollivier, 2018]. The problem of calculating , is considered as a problem of transferring outputs onebyone, in a compressed form, between two parties, Alice (sender) and Bob (receiver). Alice has the entire dataset , while Bob only has inputs . Before the transmission starts, both parties agreed on the initialization of the model , order of the inputs , random seeds, and the details of the learning procedure. Outputs are sequences of tokens from a vocabulary .
The very first output can be send by using not more than nats, using a naïve encoding. After that, both Alice and Bob update their learners using the example , available to both of them, and get identical instances of .
Further transfer is done iteratively under the invariant that both Alice and Bob start every step with exactly the same learners and finish with identical . At step Alice would use to encode the next output . This can be done using nats [MacKay, 2003]. Since Bob has exactly the same model, he can decode the message to obtain and use the new pair to update his model and get . Alice also updates her model, and proceeds to sending the next (if any), encoding it with the help of .
Overall, this procedure gives us Eq. 1 from Section 2 of the main text:
Appendix B Can seq2seq learners multiply by 3?
In the experiments on the MultiplyorAdd task, reported in the main text, we saw that LSTMs2s learners are able to learn to multiply by 2 from a single example. Moreover, whether the learner prefers additive or multiplicative “explanation” depends on the length of the training example. A natural further question is whether these learners can learn to multiply by larger numbers and what governs their switching behavior? We believe answering these questions is a promising direction for further work. Here we only provide a preliminary study in a hope to inspire more focused studies.
To do so, we build a task that is similar to MultiplyorAdd, but centered around multiplication by 3 instead of 2. The single training example represents a mapping of an input string to an output string . As test inputs, we use with coming from an interval .
Since can be represented as a combination of addition and multiplication in several ways (), we have more candidate generalizations than for MultiplyorAdd. In particular, we consider different candidate rules. As before, by mem we denote the constant mapping from . mul1 represents the mapping . mul2 corresponds to and mul3 denotes . The “explanation” mul1 is akin to the add rule in the MultiplyorAdd task. We use the same hyperparameters and training procedure described in Section 4 of the main text.
We report the results in Table 2. Like the results observed in the MultiplyorAdd task, both CNNs2s and Transformer learners show a strong preference for the mem rule while LSTMbased learners switch their generalization according to the length of the training example . Indeed, for CNNs2s and Transformer, we note an FPAmem>0.97 independently of (with mem significantly lower than others). LSTMs2s att. learners start inferring the mem rule for (FPA=0.64, =2.44), then switch to comparable preference for mul2 and mul3 when , and finally show a significant bias toward the mul3 hypothesis for (e.g. FPAmul3=0.76 for ). LSTMs2s no att. learners are also subject to a similar switch of preference. That is, for , these learners have a significant bias toward mul1 (FPA=0.49). Strikingly, when , LSTMs2s no att. learners inferred perfectly the mul2 rule after one training example. Lastly, we observe again another switch to approximate mul3 for .
In sum, if CNNs2s and Transformer learners show a significant and robust bias toward mem, LSTMbased learners generalize differently depending on the training input length. In particular, our results suggest that these learners avoid adding very large integers by switching to the multiplicative explanation in those cases. Answering our initial question in this Section, we see that LSTMbased learners can learn to multiply by 3 from a single example.
FPA  , nats  
mul1  mul2  mul3  mem  mul1  mul2  mul3  mem  
LSTMs2s no att.  20  0.00  0.01  0.78  0.00  52.27  22.20  1.17  77.75 
15  0.00  0.13  0.45  0.00  40.46  13.22  6.14  66.10  
10  0.00  0.92  0.00  0.00  26.48  0.65  22.26  53.81  
5  0.49  0.00  0.00  0.00  1.97  26.50  54.97  36.13  
LSTMs2s att.  20^{6}^{6}6Only 21% of learners succeeded in learning the training example in this setting.  0.00  0.19  0.62  0.00  36.76  20.35  9.35  49.34 
15^{7}^{7}7Only 51% of learners succeeded in learning the training example in this setting.  0.00  0.14  0.76  0.00  37.84  18.42  5.53  56.43  
10  0.02  0.45  0.49  0.00  29.83  11.67  8.96  45.47  
5  0.01  0.03  0.00  0.64  32.97  48.26  60.38  2.44  
CNNs2s  20  0.00  0.00  0.00  0.99  263.82  262.01  261.71  0.02 
15  0.00  0.00  0.00  1.00  250.97  253.32  253.08  0.00  
10  0.00  0.00  0.00  1.00  243.17  245.16  248.28  0.00  
5  0.00  0.00  0.00  1.00  258.10  257.79  264.06  0.00  
Transformer  20  0.00  0.00  0.00  0.97  37.90  51.47  57.57  5.31 
15  0.00  0.00  0.00  1.00  40.36  51.62  57.42  2.50  
10  0.00  0.00  0.00  1.00  38.05  49.88  55.61  2.47  
5  0.00  0.00  0.00  1.00  37.96  51.83  60.19  0.74 
Appendix C Robustness to changes in architecture
In this section, we examine how changing different hyperparameters affects learners’ preferences for memorization, arithmetic, and hierarchical reasoning. In particular, we vary the number of layers, hidden and embedding sizes of the different learners and test their generalization on CountorMemorization, AddorMultiply and HierarchicalorLinear tasks.^{8}^{8}8In the main text, description length reported for the HierarchicalorLinear task (Table 1c) is normalized by the length of the block (i.e., constant 4). Throughout Supplementary, we do not apply such normalization. Since we always only compare biases with a learner and a dataset fixed, this should not add any confusion. Upon acceptance, we remove this scaling from Table 1c.
In all these experiments, we fix the length of the training examples. Concretely, we fix and for CountorMemorization and AddorMultiply respectively, and for HierarchicalorLinear.
Finally, we keep the same training and evaluation procedure as detailed in Section 4.2 of the main text. However, we use 20 different random seeds instead of 100.
c.1 Number of hidden layers ()
We experiment with the standard seq2seq learners described in the main paper and vary the number of layers . Results are reported in Table 3.
First, when looking at the interplay between mem and count (Table 2(a)), we observe that, independently of , more than of CNNs2s and Transformer learners inferred perfectly the mem rule (i.e. output for any given input ). Further, if the preference for count decreases with the increase of , LSTMs2s att. learners display in all cases a significant bias toward count with a large FPA and an significantly lower than mem . However, we note a decline of the bias toward count when considering LSTMs2s no att. learners. For , of the seeds generalize to perfect count, versus for . Note that this lower preference for count is followed by an increase of preference for mem. However, there is no significant switch of preferences according to .
Second, we consider the AddorMultiply task where we examine the three generalization hypothesis add, mul and mem. Results are reported in Table 2(b). Similarly to the previous task, Transformer and CNNs2s learners are robust to the number of layers change. They perfectly follow the mem rule with FPAmem=1.00. However, LSTMbased learners show a more complex behavior: If singlelayer LSTMs2s att. and no att. demonstrate a considerable bias for mul (FPAmul>0.94), this bias fades in favor of add and memo for larger architectures.
Finally, in Table 2(c), we observe that larger learners are slightly less biased toward hierar and linear. However, we do not observe any switch of preferences. That is, across different , CNNs2s learners prefer the linear rule, whereas Transformers and, to a lesser degree, LSTMbased learners show a significant preference for the hierar rule.
In sum, we note little impact of on learners’ generalizations. Indeed, LSTMbased learners can learn to perform counting from a single training example, even when experimenting with 10layer architectures. They also, for most tested , favor mul and add over mem. In contrast Transformer and CNNs2s perform systematic memorization of the single training example. Furthermore, independently of , Transformer and LSTMbased learners show a bias toward the hierar hypothesis over the linear one, while CNNbased learners do the opposite. Whether learners prefer one rule or another, these findings show strong inductive biases as the training example(s) are highly ambiguous. Interestingly, these biases are barely influenced by the change of the number of layers.



c.2 Hidden size ()
We experimented in the main text with standard seq2seq learners when hidden size . In this section, we look at the effect of varying it in . We report learners performances in Table 4.
First, Table 3(a) demonstrates how minor effect hidden size has on learners counting/memorization performances. Indeed, for any given , between and of LSTMbased learners learn perfect counting after only one training example. Similarly, and with even a lower variation, more than of Transformer and CNNs2s learners memorize the single training example outputting for any given .
The same observation can be made when studying the interplay between the hierar and linear biases. Concretely, Table 3(c) shows that learners’ generalizations are stable across values with the exception of LSTMs2s no att. learners. If the latter display significant bias toward hierar for , we do not observe any significant difference between both rules for .
Finally, as demonstrated in Table 3(b), all Transformer and CNNs2s learners perform perfect memorization when tested on the AddorMultiply task independently of their hidden sizes. Both LSTMbased learners are significantly biased toward mul for . However, when experimenting with smaller (=128), we detect a switch of preference for LSTMs2s no att. learners. The latter start approximating addtype rule (with significantly lower ). Lastly, we do not distinguish any significant difference between add and mul for LSTMs2s att. when .
Taken together, learners’ biases are quite robust to variations. We however note a switch of preference from mul to add for LSTMs2s no att. learners when decreasing . Furthermore, we see a loss of significant preference in three distinct settings.



c.3 Embedding size ()
We look here at the effect of the embedding size, , on learners’ generalizations. In particular, we vary . Results are reported in Table 5.
Across all subtables, we see small influence of on learners’ biases. For example, if we consider the CountorMemorization task when varying (see Table 4(a)), between and of LSTMs2s no att. learners inferred perfectly the count hypothesis. More striking, between and of LSTMs2s att. learners learned the count rule after one training example. The same trend is observed for the remaining learners and across the other tasks; AddorMultiply (Table 4(b)) and HierarchicalorLinear (Table 4(c)). Yet, we still discern in some cases, systematic, but low, effects of . First, the larger is, the lower FPAmul of LSTMs2s no att. learners is (from 0.94 for to 0.84 for ). However, LSTMs2s no att. learners still have considerable preference for mul for any tested . Second, we see an increase of Transformer’s preference for hierar with the increase of . Surprisingly, for , of Transformer learners generalize to perfect hierar hypothesis.



In this section, we studied the impact of the number of layers, hidden and embedding sizes on learners’ generalizations. We found that, if these hyperparameters can influence, in some cases, the degree of one learner’s preference w.r.t. a given rule, inductive biases are quite robust to their changes. In particular, among all tested combinations, we observe only cases of preference switch (out of ).
Appendix D Robustness to changes in training parameters
We examine here the effect of the training parameters on learners’ biases. As previously, we only consider the CountorMemorization task with , the AddorMultiply task with and the HierarchicalorLinear task with . We experiment with the architectures detailed in the main paper; however, we use 20 different random seeds instead of 100, used in the main text.
We consider in this section two different hyperparameters: (1) the choice of the optimizer, and (2) the dropout probability.
Optimizer We experiment with replacing the Adam optimizer [Kingma and Ba, 2014] with SGD [Bottou et al., 2018]. We found experimentally that learners failed to learn the training examples consistently in most of the settings. Yet, when successful, they showed the same preferences. In particular, Transformer and CNNs2s were the only learners that had good performances on CountorMemorization and AddorMultiply train sets (success rate higher than ). These learners showed a prefect generalization to the mem rule in both tasks.
Dropout We then examine how the dropout probability affects learners’ preferences. We use, as mentioned in the main paper, Adam optimizer and vary the dropout probability . Results are reported in Table 6.
Both CountorMemorization (Table 5(a)) and AddorMultiply (Table 5(b)) tasks show the same trend. First, Transformer and CNNs2s learners prefer consistently the mem rule. Second, when looking at LSTMbased learners, we distinguish a more complex behavior. For , LSTMbased learners show a significant preference for arithmetic reasoning (count for the CountorMemorization task and mul for the AddorMultiply task). However, when , we see different preferences. In particular, both LSTMbased learners show a preference for mem (not significant for LSTMs2s att.), LSTMs2s no att. learners are significantly biased toward add whereas LSTMs2s att. do not show any significant bias with a slight preference for mem. In sum, the lower is, the more likely learners will overfit the mem rule.
Finally, we consider the HierarchicalorLinear task (see Table 5(c)). We observe that, for any value, CNNs2s and Transformer inductive biases remain the same. Indeed, CNNs2s learners show a consistent preference for linear with FPA while Transformers prefer hierar (note that this preference is not very large for with an FPAhierar of 0.05, compared to an FPAhierar of 1.00 and 0.69 for and respectively). On the other hand, has a larger impact on LSTMbased learners. When , both LSTMs2s prefer the hierar hypothesis. However, for , both learners do not show any significant preference for any of the rules (with FPA for both rules and close values).


