1 Introduction
Humans leverage compositional generalization to recombine familiar concepts to understand and create new things. We have been using this ability from early civilization. For example, Sphinx has the face of a human and the body of a lion (Figure (a)a). We do not see such a living animal, but ancient people could create it and we can recognize it. This shows we are able to recombine different parts of seen objects for an unseen object. Sphinx actually also has wings of an eagle, and the type of wings can be another part to combine. This means we have an exponentially large amount of combinations as the number of parts grows. So compositional generalization helps humans to efficiently learn from a few training data, and generalize to a many unseen combinations. We hope machines also have such ability.
Different compositional generalization approaches have been investigated, such as architecture design [27, 14], independence assumption [15, 7], data augmentation [3, 1], causality [4, 8]
[25], group theory [12] and metalearning [20]. There are also general discussions [6, 13]. Compositional generalization has been applied in many areas, such as instruction learning [20], grounding [26], continual learning [17], question answering [2, 16, 19], reasoning [31], zeroshot learning [30] and language inference [11]. In this report, we focus on summarizing a series of our work for theoretical discussions [9, 10, 24]. Please refer to these papers for concrete examples [22, 23] and applications [21]. Please also find broad related work in the papers.In this report, we first discuss concepts and properties related to compositional generalization. Based on them, we clarify the setting in our scope (Section 3.1). We then propose an approach with architecture design, training and inference. We also share conjectures to partially explain some human behaviors, such as system 1 and system 2 cognition. This report has three main key points. First, what is compositional generalization. Second, what is conditional independence property, and how does it help compositional generalization. Third, how to
control random variable information
, and how does it enable conditional independence property. We will explain them in the following sections and summarize them in conclusion.2 Concepts and properties
In this section, we first introduce compositional generalization and disentangled representation. We then discuss two questions about disentangled representation: subjectivity of components and conditional independence property.
2.1 Compositional generalization
We compare different types of generalization to describe compositional generalization. The images in Figure 2 only contain input distributions for explanation purposes. Conventionally, many machine learning researches take the assumption that training and test distributions are identical (Figure 2 left). This means a main problem of conventional generalization is to learn a model working on the correct underlying distribution, and use it in the test. In this case, the smoothness assumption is important for learning the model, so that there exist some types of general purpose regularization, such as regularization and dropout. Also, when we have more training data, we are likely to have better test performance.
Outofdistribution (o.o.d.) generalization [6], however, has different training and test distributions (Figure 2 middle). We focus on the test distribution manifold not in the training manifold, because the overlapping part is similar to conventional generalization. The difference of the distributions needs both training and test distributions to define, so that training distribution alone does not have information for the difference. This means the distribution difference information can only be given as prior knowledge during training, which is not general, but is specific for the test distribution. So, in this case, more training data or general regularization does not directly help learning the distribution difference. We are familiar with arguments in conventional generalization, but some of them may not directly apply in o.o.d. generalization.
Compositional generalization, a.k.a. systematic generalization, is a type of o.o.d. generalization. It has multiple components, and the generalization requires recombining values of different components in a novel way. The values in each component appear in the training. In Figure 2 (right), a test sample is not in the training distribution, but when we decompose it to horizontal and vertical directions, the values of each component are in the training distribution, and we can combine these values for the test sample. However, the components might be mixed together, and it is not straightforward to separate them. This means we do not know the horizontal and the vertical directions in such cases. When the representation has these orthogonal directions, we say it is a disentangled representation.
2.2 Disentangled representations
A disentangled representation [5] contains several separate component representations. Each component representation corresponds to an underlying component, or generative factor. When a representation is not disentangled, it is an entangled representation. In the examples in Figure 3
, we suppose to know that the components are color and shape. The upper images are entangled representations, where color and shape are in the same image. The lower vectors are disentangled representations, where the left vector is for color and the right vector is for shape.
Then we have several questions. Where the types of components are from? In this case, it means how do we know the components are color and shape? Another question is what’s the relation between two representations, such as between the entangled and the disentangled representations?
2.3 Subjectivity of components
We discuss the first question of where the types of components are from. This part is still controversial and maybe not straightforward to agree, but we like to share the idea. The idea is that the components can be subjectively defined by humans. Sometimes they are common agreement of humans. This also enables discussing different components in the same machine learning framework. We study a general way to encode human’s understanding of components to models.
Components can be subjective, but not arbitrary. They are defined according to how humans perceived the world. This means some components may be factors in real world physics, such as position and rotation. They influence human perceptions, but humans decide the components.
In the example in Figure 1, we have Sphinx and Centaur. Though they are both created by compositional generalization, they have different components, one for face, and the other for upper body. Another example is color. We often use primary colors red, green and blue as components for colors. However, their essential difference is the light wavelength. There are three colors because generally humans have three types of photopsin proteins in eyes, each absorbing a primary color [29]. This means if an animal or a machine has four types of proteins, then they may have four primary colors. So primary colors are not completely objective, and not completely subjective. It depends on the objective biological mechanism of humans.
These are the examples of subjectivity of components. Since machines are not humans, they do not know what subjective components humans have. This means we need a general way to encode human’s understanding into models as prior knowledge.
2.4 Conditional independence property
Let’s look at the relation between representations. The key idea is conditional independence property. We may first look at a question in Figure (a)a. What is in the right hand, when we see there is a fork in the left hand? We do not know the exact answer, but the fork tells something about the answer. We may guess the right hand has a knife or spoon.
Let’s ask again the question when we also have observation of the right hand (Figure (b)b). Given the observation of a spoon, we can tell the right hand has a spoon. Then, let’s hide the left hand (Figure (c)c). In this case, the answer is the same, and hiding the left hand does not influence the answer. This means the answer depends only on the observation of the right hand, though the left hand is related. In other words, given the observation of the right hand, the answer is conditionally independent of other things. This property is called conditional independence property [9].
We formalize this property and find how it helps compositional generalization. We consider two representations and . They both have components, and each pair of components are aligned. Conditional independence property can be summarized as depends only on
. This can be written in probability.
We also formalize compositional generalization (Figure 2 right). We consider a particular test sample with values of and . In the training, each component value of appears, but the value of does not appear. Note that this means the components are not marginally independent. When the value of appears, the value of has a high probability. In the test, the value of appears, and we hope the predicted conditional probability of given is high. For example in Figure 3, a test sample can be a yellow heart, and it does not appear in training. However, yellow appears in yellow moon, and heart appears in red heart in training. is image, and is label pair.
In train,  In test,  
The conditional independence property bridges training and test distributions. We first apply chain rule, and use compositional independence property.
When are all high, their product is high, so that is high. Therefore, a model satisfying conditional independence property addresses compositional generalization.
3 Architecture design and training
In this section, we introduce our setting, and describe an approach for compositional generalization. We mainly discuss how to encode the prior knowledge and enable conditional independence property.
3.1 Settings
We focus on a general setting for compositional generalization. We consider a problem with both entangled input and entangled output
, and components are aligned. For example in language translation, both input and output languages are entangled with grammar and lexicon. The input grammar decides output grammar, and input lexicon decides output lexicon.
Compositional generalization requires recombining values of different components in a novel way. As we consider component types are subjective (Section 2.3), it requires knowing what are the types of components, such as shape and color. There are different ways to add this prior knowledge. In some cases, the prior knowledge is in the design of data structure (position in image). Some approaches design training data distribution to make the components statistically marginally independent. We attend to using the prior knowledge in model architecture design with particular regularization.
We focus on using disentangled representation, because it is conceptually straightforward for compositional generalization. We do not assume statistical independence between components or use annotations on components. In such a setting, we have encoding and decoding modules (Figure 5). Encoder converts entangled input to hidden disentangled representation , and decoder converts to entangled output . We can set for unsupervised representation learning.
3.2 Strategy
We hope to enable conditional independence property (Section 2.4) by encoding prior knowledge for components. This means we expect a component representation to have exact information of the corresponding component. For example, we hope a component representation contains the color information. This requires controlling information of a random variable (component representation).
To achieve it, we hope to design a loss function that has the minimum value when the information is expected. So we study the relation between optimization loss and entropy of a component representation. Note that entropy measures the amount of information, not the contents, but we use entropy for intuitive explanation. Also note that we consider a (multidimensional) representation as a random variable. We discuss the distribution, and its entropy, of this random variable with all the samples in a dataset. This means for one dataset, we have only one distribution for the component representation and only one entropy for the distribution.
The strategy is to design a convex loss with the minimum at the target entropy (Figure (c)c, and note that the horizontal axis is entropy instead of parameters). This requires techniques to increase entropy, decrease entropy, and enable local turning of loss at the target entropy (locality). We look at related techniques in machine learning (Table 1). Prediction loss increases entropy, because when we train a model to have correct prediction, an intermediate representation should contain more information to do that. During increasing entropy, we can encode local turning point by architecture design as we will discuss in Section 3.3. Regularization can reduce entropy, and we discuss in Section 3.4. However, it is not clear how to encode locality during reducing entropy.
Increase entropy  Decrease entropy  

Loss  Prediction loss  Regularization 
Locality  Architecture design  Not clear 
With the above availability, we design two losses. For the loss to increase entropy (Figure (a)a), we use prediction loss and architecture design. The loss rapidly decreases when entropy increases and is below the target value, and it is constant after that. For the loss to decrease entropy (Figure (b)b), we use regularization. The loss stably increases as entropy increases. They together form the expected curve (Figure (c)c). This approach has two advantages. First, the target position is encoded only to one loss. Second, it does not need specific values for the losses.
3.3 Architecture design and prediction loss
We first discuss how to encode locality when increasing information. In Figure (a)a, the loss decreases when entropy is not enough, and the loss is constant when entropy is enough. This means that we hope to make a component representation have at least certain information. We achieve this by architecture design combined with prediction loss.
When each output component is connected only to one corresponding hidden component representation , information of can come only from this hidden component representation (Figure 7). Note that this also means is connected forward only to . We consider that if the output is correct, all its components should be correct. Since this component needs to be correct when reducing prediction loss, the hidden component representation should contain at least the information of the component. Please also refer to Appendix A for extended discussions.
This is the way that we encode component prior knowledge in the architecture design, i.e., we describe how it generates output. Note that this generating process might be different from the real generating process. For example, we have generative factors of shape, size and color for an apple. For a machine, a generating process can first choose a shape, then adjust the size and paint color. However, a real apple grows with these three components changing together.
This technique works for the decoding process, because the output needs to be compared with ground truth to compute the loss. Also, for humans, describing the decoding process is easier than the encoding process. For example, computer graphics is easier than computer vision. Computer graphics, in many cases, does not need machine learning, such as developing 3D games. However, computer vision is hard without machine learning.
Let’s look at an example in Figure 8. There are two component representations, one for shape and the other for color. We can design an architecture to achieve the following effects. The output shape changes only when the first component representation changes its value. The output color changes only when the second component representation changes its value. With such design, to produce correct output, the first component representation should at least contain the shape information, and the second component representation should at least contain the color information, because the other representation is not able to provide the information. Please refer to [24] for more analysis.
3.4 Entropy regularization
We then talk about reducing the information of a component representation. This does not require component specific prior knowledge. Entropy for a random variable can be roughly understood as the number of possible values.
Entropy regularization [22] aims at reducing entropy of a component representation. Given a representation , we compute the norm and add normal noise to each element of the representation. This decreases the channel capacity, so that the entropy for the representation reduces. We then feed the noised representation to the next layer, and add the norm to loss function.
where is a weight of noise, positive for training and zero for inference. is a coefficient.
Please see Figure 9 for intuitive illustration. The noise makes different values far from each other in vector space. If they are close, the noise will make them not distinguishable, so the prediction would be wrong. At the same time, norm regularization makes different values close to each other to reduce the region of manifold. These two forces squash the values, so that unnecessary values will be merged. With less number of possible values, the entropy reduces. Please also refer to Appendix B.
3.5 Stochastic sampling and gradient descent
We discussed two losses to increase and decrease entropy. However, during the optimization of neural networks, there are other influences acting like losses, and we consider them as losses for simple explanation. These losses come from stochastic gradient descent. It is a widely used optimization algorithm with many variations, and the following arguments apply to them.
One loss is from stochastic sampling. This reduces entropy because it adds noise. The effect is similar to entropy regularization, but this is weak. This effect appears mainly during the later stage of training. Occasionally, this enables learning compositionality without entropy regularization. For more details, please refer to [28].
Another loss comes from gradient descent. It increases entropy of a component representation. This is because the optimization process imposes a bias toward noncompositional solutions, which is because gradient seeks the steepest direction, so that it uses all available and redundant input information. This mainly happens during the early stage of training. This effect can be canceled by entropy regularization, so entropy regularization is important. Note that it is not prominent when there is only one solution, e.g., with linear model. Please refer to [9] for more details.
3.6 Summary of four losses
Let’s summarize the four losses during optimization. Please see Figure 10 for intuitions. The first loss is from prediction loss with architecture design. This loss decreases rapidly when entropy is small, and it is constant after the entropy is above the target value. The second loss is from stochastic sampling [28]. This exists naturally but is weak. The third loss is from gradient descent [9]. The fourth loss is from entropy regularization [22], and it counteracts the effect from gradient descent. This loss should be less steep than the prediction loss, so that the summed loss has the lowest point near to the expected value.
4 Inference
In this section, we look at problems during inference. We also discuss conjectures for human behaviors (Appendix C). We then analyze that language tasks are less likely to suffer from the problems in inference (Appendix D).
4.1 Problem
So far, we have discussed learning compositionality during training. However, our goal is compositional generalization, and we hope for high performance in the test. So a question is whether the model still works on test distribution. We have both encoding and decoding parts (Figure 5).
Decoding still works if encoding is correct. This is because of the architecture design, where only the corresponding component representation produces the component in output. Since the component representation is correct, and it is in the same manifold as training by definition of compositional generalization, the network produces a correct output component.
However, the encoding part may not work on test distribution. It extracts disentangled representation from entangled input representation. This extraction network, however, can be a general network, and we do not have special treatment for it. By definition of compositional generalization, the input manifold changes, and a general network does not work well in such cases. Therefore, the encoding network may not produce correct disentangled representation. Please also refer to [10].
4.2 Solution
One idea to address this problem is to convert the encoding problem to a decoding problem by reversing input and output and specify architecture design with an additional decoding network (Figure 11). Similar to the other decoding network , works when each component is in its training manifold. Since the input and the output is opposite, we cannot get the hidden representation with a forward pass. So we use optimization to get the input that best produces the output .
To regularize each test in its training manifold, we may keep the manifold information, and use it in test. A straightforward way to keep the information is to store some training samples. They may be stored as input representation or hidden representation. In test, we make each test close to the corresponding training ones. The encoding network provides initial hidden representations.
In summary (Figure 11), we jointly train three models in training. In inference, we first use the encoder to get initial hidden representation. Then we use the additional decoder to optimize the hidden representation to reconstruct the input with manifold regularization. We then use the original decoder to convert the optimized hidden representation to output.
5 Conclusion
This report introduces compositional generalization and an approach to it, with pointers to a series of corresponding papers. It has three important key points. First, what is compositional generalization. It is an outofdistribution generalization with recombination of seen component values in a novel way. Second, conditional independence property. This is the core property of compositional generalization. It means an output component depends only on the corresponding input component. The last point is controlling random variable information. It enables conditional independence property. We achieve it by squeezing entropy from above and below. We hope this report will help understanding compositional generalization and advancing artificial intelligence.
Acknowledgments
We thank Mohamed Elhoseiny, Liang Zhao, Wei Xu, Kenneth Church, Joel Hestness, Jianyu Wang, Yi Yang and Zhuoyuan Chen for helpful suggestions and discussions.
References
 [1] (2020) Learning to recombine and resample data for compositional generalization. arXiv preprint arXiv:2010.03706. Cited by: §1.

[2]
(2016)
Neural module networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  [3] (202007) Goodenough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7556–7566. External Links: Link, Document Cited by: §1.
 [4] (2020) A metatransfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [5] (2013) Deep learning of representations: looking forward. In International Conference on Statistical Language and Speech Processing, pp. 1–37. Cited by: §2.2.
 [6] (2017) The consciousness prior. arXiv preprint arXiv:1709.08568. Cited by: §1, §2.1.
 [7] (2018) Understanding disentangling in vae. arXiv preprint arXiv:1804.03599. Cited by: §1.
 [8] (2020) Efficiently disentangle causal representations. OpenReview. Note: https://openreview.net/pdf?id=SvafwURywB External Links: Link Cited by: §1.
 [9] (2020) Gradient descent resists compositionality. OpenReview. Note: https://openreview.net/pdf?id=VMAesov3dfU External Links: Link Cited by: §1, §2.4, §3.5, §3.6.
 [10] (2020) Transferability of compositionality. OpenReview. Note: https://openreview.net/pdf?id=GHCu1utcBvX External Links: Link Cited by: §1, §4.1.

[11]
(201911)
Posing fair generalization tasks for natural language inference.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, Hong Kong, China, pp. 4485–4495. External Links: Link, Document Cited by: §1.  [12] (2020) Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [13] (2020) Inductive biases for deep learning of higherlevel cognition. arXiv preprint arXiv:2011.15091. Cited by: §1.
 [14] (2019) Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893. Cited by: §1.
 [15] (2017) VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), Cited by: §1.
 [16] (2019) Gqa: a new dataset for realworld visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709. Cited by: §1.
 [17] (2020) Visually grounded continual learning of compositional semantics. arXiv preprint arXiv:2005.00785. Cited by: §1.
 [18] (2011) Thinking, fast and slow. Macmillan. Cited by: Appendix C.
 [19] (2020) Measuring compositional generalization: a comprehensive method on realistic data. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [20] (2019) Compositional generalization through meta sequencetosequence learning. In Advances in Neural Information Processing Systems, pp. 9788–9798. Cited by: §1.
 [21] (2020) Compositional language continual learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [22] (2019) Compositional generalization for primitive substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 4284–4293. Cited by: Appendix A, §1, §3.4, §3.6.
 [23] (2020) Grounded compositional generalization with environment interactions. OpenReview. Note: https://openreview.net/pdf?id=b6BdrqTnFs7 External Links: Link Cited by: §1.
 [24] (2020) Necessary and sufficient conditions for compositional representations. OpenReview. Note: https://openreview.net/pdf?id=r6I3EvB9eDO External Links: Link Cited by: §1, §3.3.
 [25] (2020) Compositional generalization by learning analytical expressions. arXiv preprint arXiv:2006.10627. Cited by: §1.
 [26] (2020) A benchmark for systematic generalization in grounded language understanding. arXiv preprint arXiv:2003.05161. Cited by: §1.
 [27] (2019) Compositional generalization in a deep seq2seq model by separating syntax and semantics. arXiv preprint arXiv:1904.09708. Cited by: §1.
 [28] (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §3.5, §3.6.
 [29] (2007) The machinery of colour vision. Nature Reviews Neuroscience 8 (4), pp. 276–286. Cited by: §2.3.
 [30] (2020) Locality and compositionality in zeroshot learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [31] (2020) Teaching pretrained models to systematically reason over implicit knowledge. arXiv preprint arXiv:2006.06609. Cited by: §1.
Appendix A Partial observation of output combinations
In Section 3.3, we discussed architecture design that enables a component representation to contain at least the information of a component. One condition here is that when prediction equals to groundtruth , all the component outputs are correct . As an extended topic, in some complicated cases, the condition may not be met when the same output can ambiguously correspond to different combinations of component values, if it discards a part of information for the combinations.
However, even in such cases, the arguments still hold with disambiguation. Broadly speaking, how to disambiguate is another type of prior knowledge. For example, reducing entropy of each component representation makes the ambiguity disappear in some tasks. Then, the entropy regularization also performs disambiguation. This mechanism is used in [22], where a combination is for syntax tree and words on nodes, but only contains words.
Appendix B Entropy regularization in language learning
We like to share a joke for “law of entropy increase” in human language learning. Law of entropy increase originally says in an isolated system, the entropy increases over time. Here, a beginner of learning a second language is likely to ”overuse” compositional generalization to create unnatural phrases. As one becomes more fluent over time, the problem is less (less highly compositional). Since entropy reduction helps compositional generalization, this means entropy increases over time.
Appendix C Conjectures for system 1 and system 2 cognition
We like to share some conjectures for system 1 and system 2 cognition [18]. System 1 is a fast and unconscious cognition process. System 2 is a slow and conscious cognition process. Figure 12 is a borrowed example (https://youtu.be/4KpZBiKda0k). System 1 is driving on a familiar road. The driver is relaxed, and can drive while chatting. System 2 is driving on an unfamiliar road. The driver needs to focus on driving.
In system 2, humans need more time and attention. What do we do with these resources? The conjecture is we are doing optimization. More precisely, when the input is familiar, the encoding network works well, so that optimization is simple and fast (maybe fewer optimization steps). When the input is unfamiliar, the encoding network does not provide a good initial hidden representation, so the optimization is difficult and slow.
Another conjecture is that our longterm memory is used for manifold regularization, which requires storing training samples.
Appendix D Inference for language
When we read new articles, such as news, there are many new sentences. If sentence structures are simple, we can often read at a constant speed.
This phenomena might be explained by transferable units, such as words. In different situations, a word is likely to have the same word extraction information, syntactic information and semantic information. These types of information are so stable that we can create a general purpose dictionary (e.g., Figure 13). The extraction information is the spell, the syntactic information is partofspeech and the semantic information is the explanations. Note that the information might not be deterministic, but the entries are stable. This means we can find words in the same way in different situations, and use the same values for each component by concatenating word level component representations. Because of these properties, we avoid complicated optimization during inference, and have constant speed in some language tasks.
Comments
There are no comments yet.