Capacity, Bandwidth, and Compositionality in Emergent Language Learning

10/24/2019 ∙ by Cinjon Resnick, et al. ∙ 9

Many recent works have discussed the propensity, or lack thereof, for emergent languages to exhibit properties of natural languages. A favorite in the literature is learning compositionality. We note that most of those works have focused on communicative bandwidth as being of primary importance. While important, it is not the only contributing factor. In this paper, we investigate the learning biases that affect the efficacy and compositionality of emergent languages. Our foremost contribution is to explore how capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, compositional language learning in the context of multi agent emergent communication has been extensively studied (Sukhbaatar et al., 2016; Foerster et al., 2016; Lazaridou et al., 2016; Baroni, 2019). These works have collectively found that while most emergent languages do not tend to be compositional, they can be guided towards this attribute through artificial task-specific constraints (Kottur et al., 2017; Lee et al., 2017).

In this paper, we focus on how a neural network, specifically a generative one, can learn a compositional language. Moreover, we ask how this can occur without task-specific constraints. To accomplish this, we first define what is a language and what we mean by compositionality. In tandem, we introduce precision and recall

, two metrics that help us measure how well a generative model at large has learned a grammar from a finite set of training instances. We then use a variational autoencoder with a discrete sequence bottleneck to investigate how well the model learns a compositional language and what biases that learning. This allows us to derive

residual entropy

, a third metric that reliably measures compositionality in our particular environment. We use this metric to cross-validate precision and recall.

Our environment lets us experiment with a syntactic, compositional language while varying the channel width and the number of parameters, our surrogate for the capacity of a model. Our experiments reveal that our smallest models are only able to solve the task when the channel is wide enough to allow for a surface-level compositional representation. In contrast, large models learn a language as long as the channel is large enough. However, large models also have the ability to memorize the training set. We hypothesize that this memorization would lead to non-compositional representations and overfitting, albeit this does not yet manifest empirically. This setup allows us to test our hypothesis that there is a network capacity above which models will tend to produce languages with non-compositional structure.

2 Related Work

In recent years, there has been renewed interest in studies of emergent language (Foerster et al., 2016; Lazaridou et al., 2016; Havrylov and Titov, 2017) that originated with works such as Steels and Kaplan (2000); Steels (1997). Some of these approaches use referential games (Evtimova et al., 2017; Lazaridou et al., 2018) to produce an emergent language that ideally has properties of human languages, with compositionality being a commonly sought after property (Barrett et al., 2018; Baroni, 2019; Chaabouni et al., 2019a).

Our paper is most similar to Kottur et al. (2017), which showed that compositional language arose only when certain constraints on the agents are satisfied. While the constraints they examined were either making their models memoryless or having a minimal vocabulary in the language, we hypothesized about the importance for agents to have small capacity relative to the number of disentangled representations (concepts) to which they are exposed. This is more general because both of the scenarios they described fall under the umbrella of reducing model capacity. To ask this, we built a much bigger dataset to illuminate how capacity and channel width effect the resulting compositionality in the language. Another difference is that they had a back and forth exchange of agents; we had a single forward exchange. While a limitation, this helped us focus the scope.

Liska et al. (2018)

suggests that the average training run for recurrent neural networks does not converge to a compositional solution, but that a large random search will produce compositional solutions. This implies that the optimization approach biases learning, which is also confirmed in our experiments. However, we further analyze other biases.

Spike et al. (2017) describes three properties that bias models towards successful learned signaling: the creation and transmission of referential information, a bias against ambiguity, and information loss. This lies on a similar spectrum to our work, but pursues a different intent in that they study biases that lead to optimal signaling; we seek compositionality. Verhoef et al. (2016); Kirby et al. (2015); Zaslavsky et al. (2018) all examine the trade-off between expression and compression in both emergent and natural languages, in addition to how that trade-off affects the learners. We differ in that we target a specific aspect of the agent (capacity) and ask how that aspect biases the learning. Chen et al. (2018)

describes how the probability distribution on the set of all strings produced by a recurrent model can be interpreted as a weighted language; this is relevant to our formulation of the language.

Most other works studying compositionality in emergent languages (Andreas, 2019; Mordatch and Abbeel, 2017; Chaabouni et al., 2019b; Lee et al., 2019) have focused on learning interpretable representations. See Hupkes et al. (2019) for a broad survey of the different approaches. By and large, these are orthogonal to our work because none pose the question we ask - how does an agent’s capacity effect the resulting language’s compositionality?

3 Compositional Language and Learning

We start by defining a language and what it means for a language to be compositional. We then discuss what it means for a network to learn a compositional language, based on which we derive evaluation metrics.

3.1 Compositional Language

A language is a subset of , where denotes a set of alphabets and denotes a string:

In this paper, we constrain a language to contain only finite-length strings, i.e., , implying that is a finite language. We use to denote the maximum length of any .

We define a generator from which we can sample one valid string at a time. It never generates an invalid string and generates all the valid strings in in a finite number of trials such that

(1)

We define the length of the description of as the sum of the number of non-terminal symbols and the number of production rules , where and . Each production rule takes as input an intermediate string and outputs another string .111 We recover generative grammer (Chomsky, 2002) if

The generator starts from an empty string and applies an applicable production rule (uniformly selected at random) until the output string consists only of alphabets (terminal symbols).

Languages and compositionality

When the number of such production rules plus the number of intermediate symbols is smaller than the size of the language that generates, we call a compositional language. In other words, is compositional if and only if .

One such example is when we have sixty alphabets, , and six intermediate symbols, , for a total of production rules :

  • .

  • For each , , where

From these production rules and intermediate symbols, we obtain a language of size . We thus consider this language to be compositional and will use it in our experiments.

3.2 Learning a language

We consider the problem of learning an underlying language from a finite set of training strings randomly drawn from it:

where is the minimal length generator associated with . We assume and our goal is to use to learn a language that approximates as well as possible. We know that there exists an equivalent generator for

, and so our problem becomes estimating a generator from this finite set rather than reconstructing an entire set of strings belonging to the original language

.

We cast the problem of estimating a generator as density modeling, in which case the goal is to estimate a distribution . Sampling from is equivalent to generating a string from the generator . Language learning is then

(2)

where is a regularization term, is its strength, and is a model space.

Evaluation metrics

When the language was learned perfectly, any string sampled from the learned distribution must belong to . Also, any string in must be assigned a non-zero probability under . Otherwise, the set of strings generated from this generator, implicitly defined via , is not identical to the original language . This observation leads to two metrics for evaluating the quality of the estimated language with the distribution , precision and recall:

(3)
(4)

where is the indicator function. These metrics are designed to be fit for any compositional structure rather than one-off evaluation approaches. Because these are often intractable to compute, we approximate them using Monte-Carlo by sampling samples from for calculating precision and uniform samples from for calculating recall.

(5)
(6)

where and is a uniform sample from .

3.3 Compositionality, learning, and capacity

When learning an underlying compositional language , there are three possible outcomes:

Overfitting: could memorize all the strings that were presented when solving Eq. (2) and assign non-zero probabilities to those strings and zero probabilities to all others. This would maximize precision, but recall will be low as the estimated generator does not cover .

Systematic generalization: could capture the underlying compositional structures of characterized by the production rules and intermediate symbols . In this case, will assign non-zero probabilities to all the strings that are reachable via these production rules (and zero probability to all others) and generalize to strings from that were unseen during training, leading to high precision and recall. This behavior was characterized in Lake and Baroni (2017).

Failure: may neither memorize the entire training set nor capture the underlying production rules and intermediate symbols, resulting in both low precision and recall.

We hypothesize that a major factor determining the compositionality of the resulting language is the capacity of the most complicated distribution within the model space .222 As the definition of a model’s capacity heavily depends on the specific construction, we do not concretely define it here but do later when we introduce a specific family of models with which we run experiments. The hypothesis is that when the model capacity is too high, the first case of total memorization is likely. When it is too low, the third case of catastrophic failure will happen. Only when the model capacity is just right will language learning correctly capture the compositional structure underlying the original language and exhibit systematic generalization (Bahdanau et al., 2019). We are interested in empirically investigating this hypothesis using a neural network as a language learner. We do so by using a variational autoencoder with a discrete sequence bottleneck to model . This is useful because it admits the interpretation of a two-player ReferIt game (Lewis, 1969) in addition to being a density estimator, attributes that together allow us to utilize recall and precision to describe a language that has resulted from an emergent communication process.

4 Variational autoencoders and their capacity

A variational autoencoder (Kingma and Welling, 2013) consists of two neural networks which are often referred to as an encoder , a decoder , and a prior . These two networks are jointly updated to maximize the variational lower bound to the marginal log-probability of training instances :

(7)

We can use as a proxy to the true captured by this model. Once trained, we can efficiently sample from the prior and then sample a string from .

The usual formulation of variational autoencoders uses a continuous latent variable

, which conveniently admits reparametrization that reduces the variance in the gradient estimate. However, this infinitely large latent variable space (from continuous variable) makes it difficult to understand the resulting capacity of the model. We thus constrain the latent variable to be a binary string of a fixed length

, i.e., . Assuming deterministic decoding, i.e., , this puts a strict upperbound of on the size of the language captured by the variational autoencoder.

4.1 Variational autoencoder as a communication channel

As described above, using variational autoencoders with a discrete sequence bottleneck allows us to analyze the capacity of the model in terms of computation and bandwidth. We can now interpret this variational autoencoder as a communication channel in which a novel protocol must emerge as a by-product of learning. If each string in the original language is a description of underlying concepts, then the goal of the encoder is to encode those concepts in a binary string following an emergent communication protocol. The decoder receives this string and must interpret which set of concepts were originally described by the speaker.

Our setup

We simplify and assume that all the alphabets in the string indicate the underlying concepts. While the inputs are ordered according to the sequential concepts, our model encodes them using a bag of words (BoW) representation.

The encoder is parameterized using a recurrent policy which receives the sequence of concatenated one-hot input tokens of and converts each of them to an embedding. It then runs an LSTM (Hochreiter and Schmidhuber, 1997) non-autoregressively for timesteps taking the flattened representation of the input embeddings as its input and linearly projecting each result to a probability distribution over

. This results in a sequential Bernoulli distribution over

latent variables defined as:

From this distribution, we can sample a latent string of length .

The decoder receives and uses a bag of words (BoW) representation to encode them into its own embedding space. Taking the flattened representation of these embeddings as input, we run an LSTM for time steps, each time outputting a probability distribution over the full alphabet set :

To train the whole system end-to-end (Sukhbaatar et al., 2016; Mordatch and Abbeel, 2017) via backpropogation, we apply a continuous approximation to that depends on a learned temperature parameter . We use Straight-Through Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2017) to convert the continuous distribution to a discrete distribution for each

. This corresponds to the original discrete distribution in the zero-temperature limit. The final sequence of one hot vectors encoding

is our message which is passed to the decoder . If

and the Bernoulli random variable corresponding to

has class probabilities and , then .

The prior encodes the message using bag of words (BoW) representation. It gives the probability of according to the prior (binary) distribution for each and is defined as:

This can be used both to compute the prior probability of a latent string and also to efficiently sample a string from

using ancestral sampling. Penalizing the KL divergence between the speaker’s distribution and the prior distribution in Eq. (7) encourages the emergent protocol to use latent strings that are as diverse as possible.

4.2 Capacity of a variational autoencoder with discrete sequence bottleneck

This view of a variational autoencoder with discrete sequence bottleneck presents an opportunity for us to separate the model’s capacity into two parts. The first part is the capacity of the communication channel, imposed by the size of the latent variable. As described earlier, the size of the original language that can be perfectly captured by this model is strictly upper bounded by , where is the preset length of the latent string . If , the model will not be able to learn the language completely, although it may memorize all the training strings if . A resulting question is whether is a sufficient condition for the model to learn from a finite set of training strings.

The second part involves the capacity of the encoder and decoder to map between the latent variable and a string in the original language . Taking the parameter count as a proxy to the number of patterns that could be memorized by a neural network,333

This is true in certain scenarios such as radial-basis function networks.

we can argue that the problem can be solved if the encoder and decoder each have parameters, in which case they can implement a hashmap between a string in the original language and that of the learned latent language defined by the strings.

However, when the underlying language is compositional as defined in §3.1, we can have a much more compact representation of the entire language than a hashmap. Given the status quo understanding of neural networks, it is unfortunately impossible to even approximately correlate the parameter count with the language specification (production rules and intermediate symbols) and the complexity of using that language. It is however reasonable to assume that there is a monotonic relationship between the number of parameters , or , and the capacity of the network to encode the compositional structures underlying the original language (Collins et al., 2016). Thus, we use the number of parameters as a proxy to measure the capacity.

In summary, there are two axes in determining the capacity of the proposed variational autoencoder with a discrete sequence bottleneck: the length of the latent sequence and the number of parameters () in the encoder (decoder).444 We design the variational autoencoder to be symmetric so that the numbers of the parameters in the encoder and decoder are roughly the same. We vary these two quantities in the experiments and investigate how they affect compositional language learning by the proposed variational autoencoder.

4.3 Implications and hypotheses on compositionality

Under this framework for language learning, we can make the following observations:

If the length of the latent sequence , it is impossible for the model to avoid the failure case because there will be strings in that cannot be generated from the trained model. Consequently, recall cannot be maximized. However, this may be difficult to check using the sample-based estimate as the chance of sampling decreases proportionally to the size of . This is especially true when the gap is narrow.

When , there are three cases. The first is when there are not enough parameters to learn the underlying compositional grammar given by , , and , in which case cannot be learned. The second case is when the number of parameters is greater than that required to store all the training strings, i.e., . Here, it is highly likely for the model to overfit as it can map each training string with a unique latent string without having to learn any of ’s compositional structure. Lastly, when the number of parameters lies in between these two poles, we hypothesize that the model will capture the underlying compositional structure and exhibit systematic generalization.

In short, we hypothesize that the effectiveness of compositional language learning is maximized when both the length of the latent sequence is large enough, i.e., , and the number of parameters is between and for some positive integer . We test this experimentally by varying the length of the latent sequence and the number of parameters while checking the sample-based estimates of precision and recall (Eq. (5)–(6)).

5 Experiments

Data

As described in §3.1, we run experiments where the size of the language is much larger than the number of production rules. The task is to communicate concepts, each of which have possible values with a total dataset size of . We build three datasets containing a finite number of strings each:

where uniformly selects random items from without replacement. The randomly selected concept values and ensure that concept combinations are unique to each set. The symbol refers to any number of concepts, as in regular expressions.

Models and Learning

We train the proposed variational autoencoder (described in §4.1) on , using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of and weight decay coefficient of . The Gumbel-Softmax temperature parameter in is initialized to 10. Since systematic generalization may only happen in some training runs (Weber et al., 2018), each model is trained for each of 10 seeds over 200k steps.

Total parameters Encoder Decoder
Model Encoder Decoder Embedding LSTM Linear Embedding LSTM
A 708k 825k 100 200 300 300 300
B 670k 690k 40 300 60 125 325
Table 1:

The base model hyperparameters used in our experiments (see §

4.1 for architecture details) along with the total number of parameters for 2 models.

We train two sets of models. Each set is built from the base model, the architectures of which are described in Table 1. We gradually decrease the number of LSTM units from the base model by a factor . This is how we control the number of parameters ( and ), a factor we hypothesize to influence the resulting compositionality. We obtain seven models from each of these by varying the length of the latent sequence from . These were chosen because we both wanted to show a range of bits and because we need at least bits to cover the strings in ().

5.1 Evaluation: residual entropy

Our setup allows us to design a metric by which we can check the compositionality of the learned language by examining how the underlying concepts are described by a string. For instance, describes that , , , , and . Furthermore, we know that the value of a concept is independent of the other concepts , and so our custom generative setup with a discrete latent sequence allows us to inspect a learned language by considering .

Let be a sequence of partitions of . We define the degree of compositionality as the ratio between the variability of each concept and the variability explained by a latent subsequence indexed by an associated partition . More formally, the degree of compositionality given the partition sequence is defined as a residual entropy

where there are concepts by the definition of our language. When each term inside the summation is close to zero, it implies that a subsequence explains most of the variability of the specific concept , and we consider this situation compositional. The residual entropy of a trained model is then the smallest over all possible sequences of partitions and spans from (compositional) to (non-compositional).

5.2 Results

Figure 1: Main results for model A showing best and worst performances of the proposed metrics over 10 seeds. See Section 5.2 for detailed analysis. Panels (a) and (f) show the accuracy of the training data, (b) and (d) show entropy, (e) and (g) show recall over the test data, and (c) plots the max difference in accuracy between training and test.

Fig. 1 shows the main findings of our research. In plot (a), we see the minimum parameter count below which the model cannot solve the task but above which it is solvable. Further, note the curve delineated by the lower left corner of the shift from unsuccessful to successful models. This inverse relationship between bits and parameters shows that the more parameters in the model, the fewer bits it needs to solve the task. Note however that it could only solve the task with fewer bits if it was forming a non-compositional code, suggesting that higher parameter models are able to do so while lower parameter ones cannot.

Observe further that all of our models above the minimum threshold (72,400) have the capacity to learn a compositional code. This is shown by the perfect training accuracy achieved by all of those models in plot (a) for 24 bits and by the perfect compositionality (zero entropy) in plot (b) for 24 bits. Together with the above, this validates that learning compositional codes requires less capacity than learning non-compositional codes.

Plot (c) confirms our hypothesis that large models can memorize the entire dataset. Observe that the 24 bit model with 971,400 parameters achieves a train accuracy of 1.0 and a validation accuracy of 0.0. Cross-validating this with plots (d) and (g), we find that a member of the same parameter class is non-compositional and that there is one that achieves unusually low recall. We verified that these are all the same seed, which shows that the agents in this model are memorizing the dataset.

Plots (b) and (e) show that our compositionality metrics pass two sanity checks - high recall and perfect entropy can only be achieved with a channel that is sufficiently large (i.e. 24 bits) to allow for a compositional latent representation.

Plot (f) shows that while the capacity does not affect the ability to learn a compositional language across the model range, it does dramatically change the learnability. Here we find that smaller models can fail to solve the task for any bandwidth, which coincides with literature suggesting a link between overparameterisation and learnability (Li and Liang, 2018; Du et al., 2018).

Finally, as expected, we find that no model learns to solve the task with bits, validating that the minimum required number of bits for learning a language of size is . We also see that no model learns to solve it for bits, which is likely due to optimization difficulties.

(a) Precision
(b) Recall
(c) Entropy
Figure 2: Histograms showing precision and recall over the test set, and entropy results for model A. Each bit/parameter combination is trained for 10 seeds over 200k steps. Precision and Recall are computed as described in Eqs. (5) and (6) with and .

In Fig. 2, we present histograms showing precision, recall and residual entropy measured for each bit and parameter combination over the test set. The histograms show the distributions of these metrics, upon which we make a number of observations.

We first confirm the effectiveness of training by observing that almost all the models achieve perfect precision (Fig. 2 (a)), implying that , where is the language learned by the model. This occurs even with our learning objective in Eq. (7) encouraging the model to capture all training strings rather than to focus on only a few training strings.

A natural follow-up question is how large is. We measure this with recall (Fig. 2

 (b)), which shows a clear phase transition according to the model capacity when

. This agrees with what we saw in Fig 1 and is equivalent to saying at a value that is close to our predicted boundary of . We attribute this gap to the difficulty in learning a perfectly-parameterized neural network.

Even when , we observe training runs that fail to achieve optimal recall when the number of parameters is . Due to insufficient understanding of the relationship between the number of parameters and the capacity in a deep neural network, we cannot make a rigorous conclusion. We however conjecture that this is the upperbound to the minimal model capacity necessary to capture the tested compositional language. Above this threshold, the recall is almost always perfect, implying that the model has likely captured the compositional structure underlying from a finite set of training strings. We run further experiments up to parameters, but do not observe the expected overfitting. We also run experiments with the number of categories reduced to from (see Appendix for the associated histograms) and similarly do not find the upperbound. It is left for future studies to determine whether this is due to the lack of model capacity to memorize the hash map between all the strings in and latent strings or due to an inclination towards compositionality in our variational autoencoder.

These results clearly confirm the first part of our hypothesis - the latent sequence length must be at least as large as . They further confirm that there is a lowerbound on the number of parameters over which this variational autoencoder can learn the underlying language well. We have not been able to verify the upper bound in our experiments, which may require either a more (computationally) extensive set of experiments with even more parameters or a better theoretical understanding of the inherent biases behind learning variational autoencoders with a discrete sequence bottleneck, such as recent work on overparameterized models (Belkin et al., 2019).

6 Conclusion

In this paper, we hypothesize a thus far ignored connection between learnability, capacity, bandwidth, and compositionality for language learning. We empirically verfiy that learning the underlying compositional structure requires less capacity than memorizing a dataset. We also introduce a set of metrics to analyze the compositional properties of a learned language. These metrics are not only well motivated by theoretical insights, but are cross-validated by our task-specific metric.

This paper opens the door for a vast amount of follow-up research. All our models were sufficiently large to represent the compositional structure of the language when given sufficient bandwidth, however there should be an upper bound for representing this compositional structure that we did not reach. We consider answering that to be the foremost question.

Furthermore, while large models did overfit, this was an exception rather than the rule. We hypothesize that this is due to the large number of examples in our language, which almost forces the model to generalize, but note that there are likely additional biases at play that warrant further investigation.

References

7 Appendix

Figure 3: Model A Entropy vs Overfitting: Charts showing per-bit results for Entropy vs (Train - Validation) over the parameter range. Observe the two models in bits 23 and 24 which were too successful in producing a non-compositional code and consequently overfit to the data.
(a) Model A Training
(b) Model A Testing
(c) Model B Training
(d) Model B Testing
Figure 4: Efficacy results for models A and B.
(a) Model A Precision
(b) Model A Recall
(c) Model B Precision
(d) Model B Recall
Figure 5: Precision and recall for models A and B. Similarly to model A, we see that model B has perfect precision. However, its recall chart shows a different story. The first takeaway is that while there is still a strong region in the top right bounded below by parameters, it does not extend to bits on the left side. This supports our notion of a minimal capacity threshold but adds a wrinkle in that this architecture influences the model’s ability to succeed with fewer bits.
(a) Model A
(b) Model B
Figure 6: Entropy metric for models A and B as described in §5.1. Similarly to our analysis of model A in the main section, we see that model B’s chart supports the view we had of its recall in Figure 5.
(a) Model A Training
(b) Model A Testing
(c) Model A Recall
(d) Model A Entropy
Figure 7: Results when running Model A with categories instead of . Here we need at least bits to cover all the input combinations. Observe that there is not much difference in the histograms to the scenario.