Countering Language Drift with Seeded Iterated Learning

03/28/2020 ∙ by Yuchen Lu, et al. ∙ 1

Supervised learning methods excel at capturing statistical properties of language when trained over large text corpora. Yet, these models often produce inconsistent outputs in goal-oriented language settings as they are not trained to complete the underlying task. Moreover, as soon as the agents are finetuned to maximize task completion, they suffer from the so-called language drift phenomenon: they slowly lose syntactic and semantic properties of language as they only focus on solving the task. In this paper, we propose a generic approach to counter language drift by using iterated learning. We iterate between fine-tuning agents with interactive training steps, and periodically replacing them with new agents that are seeded from last iteration and trained to imitate the latest finetuned models. Iterated learning does not require external syntactic constraint nor semantic knowledge, making it a valuable task-agnostic finetuning protocol. We first explore iterated learning in the Lewis Game. We then scale-up the approach in the translation game. In both settings, our results show that iterated learn-ing drastically counters language drift as well as it improves the task completion metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, neural language modeling methods have achieved a high level of performance on standard natural language processing tasks 

(Adiwardana et al., 2020; Radford et al., 2019). Those agents are trained to capture the statistical properties of language by applying supervised learning techniques over large datasets (Bengio et al., 2003; Collobert et al., 2011). While such approaches correctly capture the syntax and semantic components of language, they give rise to inconsistent behaviors in goal-oriented language settings, such as question answering and other dialogue-based tasks (Gao et al., 2019)

. Conversational agents trained via traditional supervised methods tend to output uninformative utterances such as, for example, recommend generic locations while booking for a restaurant 

(Bordes et al., 2017). As models are optimized towards generating grammatically-valid sentences, they fail to correctly ground utterances to task goals  (Strub et al., 2017; Lewis et al., 2017).

A natural follow-up consists in rewarding the agent to solve the actual language task, rather than solely training it to generate grammatically valid sentences. Ideally, such training would incorporate human interaction (Skantze & Hjalmarsson, 2010; Li et al., 2016a), but doing so quickly faces sample-complexity and reproducibility issues. As a consequence, agents are often trained by interacting with a second model to simulate the goal-oriented scenarios (Levin et al., 2000; Schatzmann et al., 2006; Lemon & Pietquin, 2012)

. In the recent literature, a common setting is to pretrain two neural models with supervised learning to acquire the language structure; then, at least one of the agents is finetuned to maximize task-completion with either reinforcement learning, e.g., policy gradient 

(Williams, 1992)

, or Gumbel softmax straight-through estimator 

(Jang et al., 2017; Maddison et al., 2017). This finetuning step has shown consistent improvement in dialogue games (Li et al., 2016b; Strub et al., 2017; Das et al., 2017), referential games (Havrylov & Titov, 2017; Yu et al., 2017) or instruction following (Fried et al., 2018).

Figure 1: Sketch of Seeded Iterated Learning. The model is first trained with human data to initialize the prime student. We then follow an iterative procedure where the student is first duplicated. First, one of the student is finetuned thought interaction to obtain the teacher. Second, the teacher generates training behavioral data under different task scenarios. Third, the duplicate student is used to seed

the imitation learning on

teacher’s dataset to obtain the offspring. Finally, the offspring is copied as the student of the next loop.

Unfortunately, interactive learning gives rise to the language drift phenomenon. As the agents are solely optimizing for task completion, they have no incentive to preserve the initial language structure. They start drifting away from the pretrained language output by shaping a task-specific communication protocol. We thus observe a co-adaptation and overspecialization of the agent toward the task, resulting in significant changes to the agent’s language distribution. In practice, there are different forms of language drift (Anonymous, 2020) including (i) structural drift: removing grammar redundancy (e.g. ”is it a cat?” becomes ”is cat?” (Strub et al., 2017)), (ii) semantic drift: altering word meaning (e.g. ”an old teaching” means ”an old man” (Lee et al., 2019)) or (iii) functional drift: the language results in unexpected actions (e.g. after agreeing on a deal, the agent performs another trade (Li et al., 2016b)). As a result, these agents perform poorly when paired with humans (Chattopadhyay et al., 2017; Zhu et al., 2017; Anonymous, 2020).

In this paper, we introduce the Seeded Iterated Learning (SIL) protocol to counter language drift. This process is directly inspired by the iterated learning procedure to model the emergence and evolution of language structure (Kirby, 2001; Kirby et al., 2014). SIL does not require human knowledge intervention, it is task-agnostic, and it preserves natural language properties while improving task objectives.

As illustrated in Figure 1, SIL starts from a pretrained agent that instantiates a first generation of student agent. The student is first trained for a short period through interaction with another agent in a simulator. Through this interactive training, the student becomes a teacher agent. The teacher then generates a training dataset by performing the task over multiple scenarios. Finally, a duplicate of the student is finetuned – via supervised learning – to imitate the teacher data, resulting in the iterated learning offspring. This process is then repeated by re-initializing the student agent with the offspring. As further detailed in Section 3, the imitation learning step induces a bias toward preserving the well-structured language, while discarding the emergence of specialized and inconsistent language structure (Kirby, 2001). Finally, SIL successfully interleaves interactive and supervised learning agents to improves task completions while preserving language properties.

Our contribution In this work, we propose Seeded Iterated Learning and empirically demonstrate its effectiveness in countering language drift. More precisely,

  1. [leftmargin=*]

  2. We study core Seeded Iterated Learning properties on the one-turn Sender-Receiver version of the Lewis Game.

  3. We demonstrate the practical viability of Seeded Iterated Learning on the French-German translation game that was specifically designed to assess natural language drift (Lee et al., 2019). We observe that our method preserves both the semantic and syntactic structure of language, successfully countering language drift while outperforming strong baseline methods.

  4. We provide empirical evidence towards understanding the algorithm mechanisms.

2 Related Works

Countering Language Drift

The recent literature on countering language drift includes a few distinct groups of methods. The first group requires an external labeled dataset, that can be used for visual grounding (i.e. aligning language with visual cues (Lee et al., 2019)), reward shaping (i.e. incorporating a language metric in the task success score (Li et al., 2016b)) or KL minimization (Havrylov & Titov, 2017). Yet, these methods depends on the existence of an extra supervision signal and ad-hoc reward engineering, making them less suitable for general tasks. The second group are the population-based methods, which enforces social grounding through a population of agents, preventing them to stray away from the common language (Agarwal et al., 2019).

The third group of methods involve an alternation between an interactive training phase and a supervised training phase on a pretraining dataset (Wei et al., 2018; Lazaridou et al., 2016). This approach has been formalized in Gupta et al. (2019) as Supervised-2-selfPlay (S2P). Empirically, the S2P approach has shown impressive resistance to language drift and, being relatively task-agnostic, it can be considered a strong baseline for SIL. However, the success of S2P is highly dependent on the quality of the fixed training dataset, which in practice may be noisy, small, and only tangentially related to the task. In comparison, SIL is less dependent on an initial training dataset since we keep generating new training samples from the teacher throughout training.

Iterated Learning in Emergent Communication

Iterated learning was initially proposed in the field of cognitive science to explore the fundamental mechanisms of language evolution and the persistence of language structure across human generations (Kirby, 2001, 2002). In particular, Kirby et al. (2014) showed that iterated learning consistently turns unstructured proto-language into stable compositional communication protocols in both mathematical modelling and human experiments. Recent works (Guo et al., 2019; Li & Bowling, 2019; Ren et al., 2020; Cogswell et al., 2019; Dagan et al., 2020)

have extended iterated learning into deep neural networks. They show that the inductive learning bottleneck during the imitation learning phase encourages compositionality in the emerged language. Our contribution differs from previous work in this area as we seek to

preserve the structure of an existing language rather than emerge a new structured language.

Lifelong Learning

One of the key problem for neural networks is the problem of catastrophic forgetting (McCloskey & Cohen, 1989). We argue that the problem of language drift can also be viewed as a problem of lifelong learning, since the agent needs to keep the knowledge about language while acquiring new knowledge on using language to solve the task. From this perspective, S2P can be viewed as a method of task rehearsal strategy (Silver & Mercer, 2002) for lifelong learning. The success of iterated learning for language drift could motivate the development of similar methods in countering catastrophic forgetting.

Self-training

Self-training augments the original labeled dataset with unlabeled data paired with the model’s own prediction (He et al., 2020)

. After noisy self-training, the student may out-perform the teacher in fields like conditional text generation 

(He et al., 2020), image classification (Xie et al., 2019) and unsupervised machine translation (Lample et al., 2018). This process is similar to the imitation learning phase of SIL except that we only use the self labeled data.

3 Method

Learning Bottleneck in Iterated Learning

The core component of iterated learning is the existence of the learning bottleneck (Kirby, 2001): a newly initialized student only acquires the language from a limited number of examples generated by the teacher. This bottleneck implicitly favors any structural property of the language that can be exploited by the learner to generalize, such as compositionality.

Yet, Kirby (2001) assumes that the student to be a perfect inductive learner that can achieve systematic generalization (Bahdanau et al., 2019). Neural networks are still far from achieving such goal. Instead of using a limited amount of data as suggested, we propose to use a regularization technique, like limiting the number of imitation steps, to reduce the ability of the student network to memorize the teacher’s data, effectively simulating the learning bottleneck.

Seeded Iterated Learning

As previously mentioned, Seeded Iterated Learning (SIL) is an extension of Iterated Learning that aims at preserving an initial language distribution while finetuning the agent to maximize task-score. SIL iteratively refines a pretrained agent, namely the student. The first student generation is initialized as the pretrained model that is trained from human data. The student then undergoes an interactive training phase to maximize task score and produces a new agent: the teacher. The teacher generates a new training dataset by providing pseudo-labels. This synthetic dataset is used in an imitation learning phase via supervised learning of the offspring. We seed the offspring with a copy of the student. Finally, the trained offspring is obtained and is used to initialize the student in the next iteration. We repeat the process until the task score converges. The full pipeline is illustrated in Figure 1. Methodologically, the key modification of SIL from the original iterated learning framework is the use of the student agent to seed the offspring rather than being initialized from tabula-rasa

or a pretrained model. Our motivation is to smoothly refine the seeded model, reducing at the same time the training sample-complexity. Although this paper focuses on countering language drift, we emphasize that SIL is task-agnostic and can be extended to other machine learning settings.

4 The Sender-Receiver Framework

Figure 2: In the translation game, the sentence is translated into English then into German. The second and fourth cases are regular failures, while the third case reveals a form of agent co-adaptation.

We here introduce the experimental framework we use to study the impact of SIL on language drift. We first introduce the Sender-Receiver (S/R) Game to assess language learning and then detail the instantiation of SIL for this setting.

Sender-Receiver Games

S/R Games are cooperative two-player language games in which the first player, the sender, must communicate its knowledge to the second player, the receiver, to solve an arbitrary given task. The game can be multi-turn with feedback messages, or single-turn where the sender outputs a single utterance. In this paper, we focus on the single-turn scenario as it eases the language analysis. Yet, our approach may be generalized to multi-turn scenarios. Figures 2 and 3 show two instances of the S/R games studied here: the Translation game (Lee et al., 2019) and the Lewis game (Kottur et al., 2017).

Formally, a single-turn S/R game is defined as a -tuple . At the beginning of each episode, an observation (or scenario) is sampled. Then, the sender emits a message , where the message can be a sequence of words from a vocabulary . The receiver gets the message and performs an action . Finally, both agents receive the same reward which they aim to maximize.

0:  Pretrained parameters of sender and receiver .
0:  Training scenarios {or scenario generator}
1:  Copy to {Prepare Iterated Learning}
2:  repeat
3:     Copy to {Initialize Teacher}
4:     for  to  do
5:         Sample a batch
6:         Get and to have
7:         Update and to maximize
8:     end for
9:     Copy to {Initialize Offspring Sender}
10:     for  to  do
11:         Sample a batch of
12:         Sample
13:         Update with supervised learning on
14:     end for
15:     Copy to {Initialize Offspring Receiver}
16:     for  to  do
17:         Sample a batch of
18:         Get and to have
19:         Update to maximize
20:     end for
21:     Copy to . {Re-initialize Student}
22:  until Convergence or maximum steps reached
Algorithm 1 Seeded Iterate Learning for S/R Games

SIL For S/R Game

We consider two parametric models, the sender

and the receiver . Following the SIL pipeline, we use the uppercase script , , and to respectively denote the parameters of the student, teacher, and offspring. For instance, refers to the teacher receiver. We also assume that we have a set of scenarios that are fixed or generated on the fly. We detail the SIL protocol for single-turn S/R games in Algorithm 1.

In one-turn S/R games, the language is only emitted by the sender while the receiver’s role is to interpret the sender’s message and use it to perform the remaining task. With this in mind, we train the sender through the SIL pipeline as defined in Section 3 (i.e., interaction, generation, imitation), while we train the receiver to quickly adapt to the sender language distribution with a goal of stabilizing training (Ren et al., 2020). First, we jointly train and during the SIL interactive learning phase. Second, the sender offspring imitates the labels generated by through greedy sampling. Third, the receiver offspring, initialized with the student receiver parameters, is trained by maximizing the task score where and

. In other words, we finetune the receiver with interactive learning while freezing the sender parameters. After each iteration, the student agents are re-initialized with the offspring parameters. SIL has three training hyperparameters: (i)

, the number of interactive learning steps that are performed to obtain the teacher agents, (ii) , the number of sender imitation steps, (iii) , the number of interactive steps that are performed to finetune the receiver with offspring sender. Unless stated otherwise, we define .

Figure 3: Lewis game. Given the input object, the sender emits a compositional message that is parsed by the receiver to retrieve object properties. In the language drift setting, both models are trained toward identity map while solving the reconstruction task.

Gumbel Straight-Through Estimator

In the one-turn S/R game, the task success can generally be described as a differentiable loss such as cross-entropy to update the receiver parameters. Therefore, we here assume that the receiver can maximize task-completion by minimizing classification or regression errors. To estimate the task loss gradient with respect to the sender

parameters, the receiver gradient can be further backpropagated using the Gumbel softmax straight-through estimator (GSTE) 

(Jang et al., 2017; Maddison et al., 2017). Hence, the sender parameters are directly optimized toward task loss. Given a sequential message , we define as follows:

(1)

where

is the categorical probability of next word given the sender observation

and previously generated tokens, and is the Gumbel temperature that levels exploration. When not stated otherwise, we set . Finally, we sample the next word by taking before using the straight-through gradient estimator to approximate the sender gradient:

(2)

SIL can be applied with RL methods when dealing with non-differential reward metrics (Lee et al., 2019)

, however RL has high gradient variance and we want to GSTE as a start. Since GSTE only optimizes for task completion, language drift will also appear.

5 Building Intuition: The Lewis Game

In this section, we explore a toy-referential game based on the Lewis Game (Lewis, 1969) to have a fine-grained analysis of language drift while exploring the impact of SIL.

(a) Task Score
(b) Sender Language Score
Figure 4: Task Score and Language Score for SIL ( vs baselines (). SIL clearly outperforms the baselines. For SIL: . The emergent language score is close to zero. All results are averaged over four seeds.

Experimental Setting

We summarize the Lewis game instantiation described in Gupta et al. (2019) to study language drift, and we illustrate it in Figure 3. First, the sender observes an object with properties and each property has possible values: for . The sender then sends a message of length from the vocabulary of size , equal to the number of property values. Our predefined language uniquely map each property value to each word, and the message is defined as . We study whether this language mapping is preserved during S/R training.

The sender and receiver are modeled by two-layer feed-forward networks. In our task, we use with a total of 3125 unique objects. We split this set of objects into three parts: the first split(pre-train) is labeled with correct messages to pre-train the initial agents. The second split is used for the training scenarios. The third split is held out (HO) for final evaluation. The dataset split and hyper-parameters can be found in the Appendix B.1.

We use two main metrics to monitor our training: Sender Language Score (LS) and Task Score (TS). For the sender language score, we enumerate the held-out objects and compare the generated messages with the ground-truth language on a per token basis. For task accuracy, we compare the reconstructed object vs. the ground-truth object for each property. Formally, we have:

(3)
(4)

where is the indicator function.

Baselines

In our experiments, we compare SIL with different baselines. All methods are initialized with the same pretrained model unless stated otherwise. The Gumbel baselines are finetuned with GSTE during interaction. These correspond to naive application of interactive training and are expected to exhibit language drift. Emergent is a random initializion trained with GSTE. S2P indicates that the agents are trained with Supervised-2-selfPlay. Our S2P is realized by using a weighted sum of the losses at each step: where is the loss on the pre-train dataset and is a hyperparameter with a default value of 1 as detailed in (Lazaridou et al., 2016; Anonymous, 2020).

(a) SIL
(b) Emergent
(c) Gumbel
Figure 5:

Comparison of sender’s map, where the columns are words and rows are property values. Emergent communication uses the same word to refer to multiple property values. A perfect mapped language would be the identity matrix.

Results

We present the main results for the Lewis game in Figure 4. For each method we used optimal hyperparameters namely for SIL and for rest. We also observed that SIL outperforms the baselines for any . Additional results in Appendix B (Figures 1213).

The pretrained agent has an initial task score and language score of around 65%, showing an imperfect language mapping while allowing room for task improvement. Both Gumbel and S2P are able to increase the task and language score on the held-out dataset. For both baselines, the final task score is higher than the language score. This means that some objects are reconstructed successfully with incorrect messages, suggesting language drift has occurred.

Note that, for S2P, there is some instability of the language score at the end of training. We hypothesize that it could be because our pretrained dataset in this toy setting is too small, and as a result, S2P overfits that small dataset. Emergent communication has a sender language score close to zero, which is expected. However, it is interesting to find that emergent communication has slightly lower held-out task score than Gumbel, suggesting that starting from pretrained model provides some prior for the model to generalize better. Finally, we observe that SIL achieves a significantly higher task score and sender language score, outperforming the other baselines. A high language score also shows that the sender leverages the initial language structure rather than merely re-inventing a new language, countering language drift in this synthetic experiment.

To better visualize the underlying language drift in this settings, we display the sender’s map from property values to words in Figure 5. We observe that the freely emerged language results in re-using the same words for different property values. If the method has a higher language score, the resulting map is closer to the identity matrix.

(a) Heatmap for Task Score
(b) Heatmap for Language Score
Figure 6: Sweep over length of interactive learning phase and length of imitation phase on the Lewis game (darker is higher). Low or high result in poor task and language score. Similarly, low induces poor results while high do not reduce performance as one would expect.

SIL Properties

We perform a hyper-parameter sweep for the Lewis Game in Figure 6 over the core SIL parameters, and , which are, respectively, the length of interactive and imitation training phase. We simply set since in a toy setting the receiver can always adjust to the sender quickly. We find that for each , the best is in the middle. This is expected since a small would let the imitation phase constantly disrupt the normal interactive learning, while a large would entail an already drifted teacher. We see that must be high enough to successfully transfer teacher distributions towards the offspring. However, when a extremely large is set, we do not observe the expected performance drop predicted by the learning bottleneck: The overfitting of the student to the teacher should reduce SIL’s resistance to language drift. To resolve this dilemma, we slightly modify our imitation learning process. Instead of doing supervised learning on the samples from teachers, we explicitly let student imitate the complete teacher distribution by minimizing , which is equivalent to maximizing . The result is in Figure 7, and we can see that increasing now leads to a loss of performance, which confirms our hypotheses. In conclusion, SIL has good performance in a (large) valley of parameters, and a proper imitation learning process is also crucial for constructing the learning bottleneck.

6 Experiments: The Translation Game

Although being insightful, the Lewis game is missing some core language properties, e.g., word ambiguity or unrealistic word distribution etc. As it relies on a basic finite language, it would be premature to draw too many conclusions from this simple setting (Hayes, 1988). In this section, we present a larger scale application of SIL in a natural language setting by exploring the translation game (Lee et al., 2019).

(a) Language score (argmax)
(b) Language score (KL-distill)
Figure 7: Language score for different by imitating greedy sampling with cross-entropy (Left) vs distilling the teacher distribution with KL minimization (Right). As distillation relaxes the learning bottleneck, we observe a drop in language score when the student starts overfitting the teacher distribution.
(a) BLEU De (Task Score)
(b) BLEU En
(c) R1
(d) NLL
Figure 8: The task score and the language score of NIL, S2P, and Gumbel baselines. Fix Sender indicates the maximum performance the sender may achieve without agent co-adaptation. We observe that Gumble language start drifting when the task score increase. Gamble Ref Len artificially limits the English message length, which caps the drift. Finally, SIL manages to both increase language and task score
(a) BLEU De (Task Score)
(b) BLEU En
Figure 9: S2P sweep over imitation loss weight vs. interactive loss. S2P displays a trade-off between a high task score, which requires a low imitation weight, and high language score, which requires high imitation weight. SIL appears less susceptible to a tradeoff between these metrics

Experimental Setting

The translation game is a S/R game where two agents translate a text from a source language, French (FR), to a target language, German (De), through a pivot language, English (En). This framework allows the evaluation of the English language evolution through translation metrics while optimizing for the FrDe translation task, making it a perfect fit for our language drift study.

The translation agents are sequence-to-sequence models with gated recurrent units 

(Cho et al., 2014) and attention (Bahdanau et al., 2015). First, they are independently pretrained on the IWSLT dataset (Cettolo et al., 2012) to learn the initial language distribution. The agents are then finetuned with interactive learning by sampling new translation scenarios from the Multi30k dataset (Elliott et al., 2016), which contains 30k images with the same caption translated in French, English, and German. Generally, we follow the experimental setting of Lee et al. (2019) for model architecture, dataset, and pre-processing, which we describe in Appendix C.1 for completeness. However, in our experiment, we use GSTE to optimize the sender, whereas Lee et al. (2019) rely on policy gradient methods to directly maximize the task score.

Evaluation metrics

We monitor our task score with BLEU(De) (Papineni et al., 2002), it estimates the quality of the FrDe translation by comparing the translated German sentences to the ground truth German. We then measure the sender language score with three metrics. First, we evaluate the overall language drift with the BLEU(En) score from the ground truth English captions. As the BLEU score controls the alignment between intermediate English messages and the French input texts, it captures basic syntactic and semantic language variations. Second, we evaluate the structural drift with the negative log-likelihood (NLL

) of the generated English under a pretrained language model. Third, we evaluate the semantic drift by computing the image retrieval accuracy (

) with a pretrained image ranker; the model fetches the ground truth image given 19 distractors and generated English. The language and image ranker models are further detailed in Appendix D.2.

Results

We show our main results in Figure 8, and a full summary in Table 2 in Appendix C

. Runs are averaged over five seeds and shaded areas are one standard deviation. The x-axis shows the number of interactive learning steps.

After pretraining our language agents on the IWSLT corpus, we obtain the single-agent BLEU score of for FrEn and for EnDe on the Multi30k captions. When combining the two agents, the FrDe task score drops to , showing a compounding error in the translation pipeline. We thus aim to overcome this misalignment between translation agents through interactive learning while preserving an intermediate fluent English language.

As a first step, we freeze the sender to evaluate the maximum task score without agent co-adaptation. The Fix Sender then improves the task score by 5.3 BLEU(De) while artificially maintaining the language score constant. As we latter achieve a higher task score with Gumbel, it shows that merely fixing the sender would greatly hurt the overall task performance.

We observe that the Gumbel agent improves the task score by 11.32 BLEU(De) points but the language score collapse by 10.2 BLEU(En) points, clearly showing language drift while the two agents co-adapt to solve the translation game. Lee et al. (2019) also constrain the English message length to not exceed the French input caption length, as they observe that language drift often entails long messages. Yet, this strong inductive bias only slows down language drift, and the language score still falls by 6.0 BLEU(En) points. Finally, SIL improves the task score by 12.6 BLEU(De) while preserving the language score of the pretrained model. Thus, SIL successfully counters language drift in the translation game while optimizing for task-completion.

S2P vs SIL

We compare the S2P and SIL learning dynamics in Figure 9 and Figure 15 in Appendix C. S2P balances the supervised and interactive losses by setting a weight for the imitation loss (Lazaridou et al., 2016). First, we observe that a low value, i.e, 0.1, improves the task score by 11.8 BLEU(De), matching SIL performances, but the language score diverges. We thus respectively increase to 1, and 5, which stops the language drift, and even outperforms SIL language score by 1.2 BLEU(En) points. However, this language stabilization also respectively lowers the task score by 0.9 BLEU(De) and 3.6 BLEU(De) compared to SIL. In other words, S2P has an inherent trade-off between task score (with low ), and language score (with high ), whereas SIL consistently excels on both task and language scores. We assume that S2P is inherently constrained by the initial training dataset.

SIL successfully prevent language drift SIL can remain close to the valid pretrained models
Human two men, one in blue and one in red, compete in a boxing match. there are construction workers working hard on a project
Pretrain two men, one in blue and the other in red, fight in a headaching game there are workers working hard work on a project.
Gumbel two men one of one in blue and the other in red cfighting in a acacgame……… there are construction working hard on a project ………..
S2P two men, one in blue and the other in red, fighting in a kind of a kind. there are workers working hard working on a project ..
SIL two men, one in blue and the other in red, fighting in a game. there are workers working hard on a project .

SIL partially recovers the sentence without drifting SIL/S2P still drift when facing rare word occurrences (shaped lollipop)
Human a group of friends lay sprawled out on the floor enjoying their time together. a closeup of a child’s face eating a blue , heart shaped lollipop.
Pretrain a group of friends on the floor of fun together. a big one ’s face plan a blue box.
Gumbel a group of defriends comadeof on the floor together of of of of of together…………… a big face of a child eating a blue th-acof of of of chearts…….
S2P a group of friends of their commodities on the floor of fun together. a big face plan of eating a blue of the kind of hearts.
SIL a group of friends that are going on the floor together. a big plan of a child eating a blue datadof the datadof the datadof the data@@
Table 1: Selected generated English captions. Vanilla Gumble drifts by losing grammatical structure, repeating patches of words, and inject noisy words. Both S2P and SIL counter language drift by generating approximately correct and understandable sentences. However, they become unstable when dealing with rare word occurrences.

Syntactic and Semantic Drifts

As described in Section 6, we attempt to decompose the Language Drift into syntactic drifts, by computing language likelihood (), and semantic drifts, by aligning images and generated captions (). In Figure 8, we observe a clear correlation between those two metrics and a drop in the language BLEU(En) score. For instance, Vanilla-Gumble simultaneously diverges on these three scores, while the sequence length constraint caps the drifts. We observe that SIL does not improve language semantics, i.e., remains constant during training, whereas it produces more likely sentences as the is improved by 11%. Yet, S2P preserves slightly better semantic drift, but its language likelihood does not improve as the agent stays close to the initial distribution.

Figure 10: of the teacher and the offspring after each imitation learning phase. In the majority of iterations, the offspring obtains a lower NLL than the teacher, after supervised training on the teacher’s generated data.

SIL Mechanisms

We here verify the initial motivations behind SIL by examining the impact of the learning bottleneck in Figure 10 and the structure-preserving abilities of SIL in Figure 11. As motivated in Section 3, each imitation phase in the SIL aims to filtering-out emergent unstructured language by generating an intermediate dataset to train the offspring. To verify this hypothesis, we examine the language likelihood () evolution from one teacher to its offspring during the training. We observe that the offspring consistently improves the language likelihood of its ancestor, indicating a more regular language production induced by the imitation step. In another experiment, we stop the iterated learning loop after 20k, 40k and 60k steps, only finetuning the final offspring through interaction. We observe that the agent’s language score starts dropping dramatically as soon as we stop SIL while the task score keep improving. This finding supports the view that SIL persists in preventing language drift throughout training, and that the language drift phenomenon itself appear to be robust and not a result of some unstable initialization point.

Qualitative Analysis

In Table 1, we show some hand-selected examples of English messages from the translation game. As expected, we observe that the vanilla Gumbel agent diverges from the pretrained language models into unstructured sentences, repeating final dots or words. It also introduce unrecognizable words such as ”cfighting” or ”acacgame” by randomly pairing up sub-words whenever it faces rare word tokens. S2P and SIL successfully counter the language drift, producing syntactically valid language. However, they can still produce semantically inconsistent captions, which may be due to the poor pretrained model, and the lack of grounding (Lee et al., 2019). Finally, we still observe language drift when dealing with rare word occurrences. Additional global language statistics can be found in Appendix that supports that SIL preserves language statistical properties.

(a) BLEU De
(b) BLEU En
Figure 11: Effect of stopping SIL earlier in the training process. SIL maximum steps set at 20k, 40k and 60k. SIL appears to be important in preventing language drift through-out training.

7 Conclusion

In this paper we proposed a method to counter language drift in task-oriented language settings. The method, named Seeded Iterated Learning is based on the broader principle of iterated learning. It alternates imitation learning and task optimisation steps. We modified the iterative learning principle so that it starts from a seed model trained on actual human data, and preserve the language properties during training. Our extensive experimental study revealed that this method outperforms standard baselines both in terms of keeping a syntactic language structure and of solving the task. As future work, we plan to test this method on complex dialog tasks involving stronger cooperation between agents.

8 *Acknowledgement

We thank the authors of the paper Countering Language Drift via Visual Grounding, i.e, Jason Lee, Kyunghyun Cho, Douwe Kiela for sharing their original codebase with us. We thank Angeliki Lazaridou for her multiple insightful guidance alongside this project. We also thank Anna Potapenko, Olivier Tieleman and Philip Paquette for helpful discussions. This research was enabled in part by computations support provided by Compute Canada (www.computecanada.ca).

References

  • Adiwardana et al. (2020) Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
  • Agarwal et al. (2019) Agarwal, A., Gurumurthy, S., Sharma, V., Lewis, M., and Sycara, K. Community regularization of visually-grounded dialog. In Proc. of International Conference on Autonomous Agents and MultiAgent Systems, 2019.
  • Anonymous (2020) Anonymous. Multi-agent communication meets natural language: Synergies between functional and structural language learning. ACL, Under Submission, 2020.
  • Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. of of International Conference on Learning Representations, 2015.
  • Bahdanau et al. (2019) Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., de Vries, H., and Courville, A. Systematic generalization: What is required and can it be learned? In Proc. of International Conference on Learning Representations, 2019.
  • Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • Bordes et al. (2017) Bordes, A., Boureau, Y.-L., and Weston, J. Learning end-to-end goal-oriented dialog. In Proc. of International Conference on Learning Representations, 2017.
  • Cettolo et al. (2012) Cettolo, M., Girardi, C., and Federico, M. Wit3: Web inventory of transcribed and translated talks. In Proc. of Conference of european association for machine translation, 2012.
  • Chattopadhyay et al. (2017) Chattopadhyay, P., Yadav, D., Prabhu, V., Chandrasekaran, A., Das, A., Lee, S., Batra, D., and Parikh, D. Evaluating visual conversational agents via cooperative human-ai games. In Proc. of AAAI Conference on Human Computation and Crowdsourcing, 2017.
  • Chazelle & Wang (2017) Chazelle, B. and Wang, C. Self-sustaining iterated learning. In Proc. of the Innovations in Theoretical Computer Science Conference, 2017.
  • Chazelle & Wang (2019) Chazelle, B. and Wang, C. Iterated learning in dynamic social networks. The Journal of Machine Learning Research, 20(1):979–1006, 2019.
  • Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proc. of Empirical Methods in Natural Language Processing, 2014.
  • Cogswell et al. (2019) Cogswell, M., Lu, J., Lee, S., Parikh, D., and Batra, D. Emergence of compositional language with deep generational transmission. arXiv preprint arXiv:1904.09067, 2019.
  • Collobert et al. (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537, 2011.
  • Dagan et al. (2020) Dagan, G., Hupkes, D., and Bruni, E. Co-evolution of language and agents in referential games. arXiv preprint arXiv:2001.03361, 2020.
  • Das et al. (2017) Das, A., Kottur, S., Moura, J. M., Lee, S., and Batra, D. Learning cooperative visual dialog agents with deep reinforcement learning. In

    Proc. of International Conference on Computer Vision

    , 2017.
  • Elliott et al. (2016) Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30k: Multilingual english-german image descriptions. In Proc. of Workshop on Vision and Language, 2016.
  • Faghri et al. (2017) Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
  • Fried et al. (2018) Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. Speaker-follower models for vision-and-language navigation. In Proc. of Neural Information Processing Systems, 2018.
  • Gao et al. (2019) Gao, J., Galley, M., Li, L., et al. Neural approaches to conversational ai. Foundations and Trends in Information Retrieval, 13(2-3):127–298, 2019.
  • Griffiths & Kalish (2005) Griffiths, T. L. and Kalish, M. L. A bayesian view of language evolution by iterated learning. In Proc. of the Annual Meeting of the Cognitive Science Society, 2005.
  • Guo et al. (2019) Guo, S., Ren, Y., Havrylov, S., Frank, S., Titov, I., and Smith, K. The emergence of compositional languages for numeric concepts through iterated learning in neural agents. arXiv preprint arXiv:1910.05291, 2019.
  • Gupta et al. (2019) Gupta, A., Lowe, R., Foerster, J., Kiela, D., and Pineau, J. Seeded self-play for language learning. In Proc. of Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), 2019.
  • Havrylov & Titov (2017) Havrylov, S. and Titov, I. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proc. of Neural Information Processing Systems, 2017.
  • Hayes (1988) Hayes, P. J. The second naive physics manifesto. Formal theories of the common sense world, 1988.
  • He et al. (2020) He, J., Gu, J., Shen, J., and Ranzato, M. Revisiting self-training for neural sequence generation. In Proc. of International Conference on Learning Representations, 2020.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2016.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In Proc. of International Conference on Learning Representations, 2017.
  • Kalish et al. (2007) Kalish, M. L., Griffiths, T. L., and Lewandowsky, S. Iterated learning: Intergenerational knowledge transmission reveals inductive biases. Psychonomic Bulletin & Review, 14(2):288–294, 2007.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirby (2001) Kirby, S. Spontaneous evolution of linguistic structure-an iterated learning model of the emergence of regularity and irregularity.

    IEEE Transactions on Evolutionary Computation

    , 5(2):102–110, 2001.
  • Kirby (2002) Kirby, S. Natural language from artificial life. Artificial life, 8(2):185–215, 2002.
  • Kirby et al. (2014) Kirby, S., Griffiths, T., and Smith, K. Iterated learning and the evolution of language. Current opinion in neurobiology, 28:108–114, 2014.
  • Koehn et al. (2007) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.

    Moses: Open source toolkit for statistical machine translation.

    In Proc. of tof the association for computational linguistics companion volume proceedings of the demo and poster sessions, 2007.
  • Kottur et al. (2017) Kottur, S., Moura, J. M., Lee, S., and Batra, D. Natural language does not emerge ’naturally’ in multi-agent dialog. In Proc. of Empirical Methods in Natural Language Processing, 2017.
  • Lample et al. (2018) Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Unsupervised machine translation using monolingual corpora only. Proc. of Internation Conference on Learning Representations, 2018.
  • Lazaridou et al. (2016) Lazaridou, A., Peysakhovich, A., and Baroni, M. Multi-agent cooperation and the emergence of (natural) language. In Proc. of Internation Conference on Learning Representations, 2016.
  • Lee et al. (2019) Lee, J., Cho, K., and Kiela, D. Countering language drift via visual grounding. In Proc. of Empirical Methods in Natural Language Processing, 2019.
  • Lemon & Pietquin (2012) Lemon, O. and Pietquin, O. Data-driven methods for adaptive spoken dialogue systems: Computational learning for conversational interfaces. Springer Science & Business Media, 2012.
  • Levin et al. (2000) Levin, E., Pieraccini, R., and Eckert, W. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing, 8(1):11–23, 2000.
  • Lewis (1969) Lewis, D. K. Convention: A Philosophical Study. Wiley-Blackwell, 1969.
  • Lewis et al. (2017) Lewis, M., Yarats, D., Dauphin, Y., Parikh, D., and Batra, D. Deal or no deal? end-to-end learning of negotiation dialogues. In Proc. of Empirical Methods in Natural Language Processing, pp. 2443–2453, 2017.
  • Li & Bowling (2019) Li, F. and Bowling, M. Ease-of-teaching and language structure from emergent communication. In Proc. of Neural Information Processing Systems, 2019.
  • Li et al. (2016a) Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J. Dialogue learning with human-in-the-loop. In Proc. of International Conference on Learning Representations, 2016a.
  • Li et al. (2016b) Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. Deep reinforcement learning for dialogue generation. In Proc. of Empirical Methods in Natural Language Processing, 2016b.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Proc. of European Conference on Computer Vision, 2014.
  • Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In Proc. of International Conference on Learning Representations, 2017.
  • Marcus et al. (1993) Marcus, M., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of english: The penn treebank. 1993.
  • McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
  • Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 1.8, 2019.
  • Ren et al. (2020) Ren, Y., Guo, S., Labeau, M., Cohen, S. B., and Kirby, S. Compositional languages emerge in a neural iterated learning model. In Proc. of International Conference on Learning Representations, 2020.
  • Schatzmann et al. (2006) Schatzmann, J., Weilhammer, K., Stuttle, M., and Young, S. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies.

    The knowledge engineering review

    , 21(2):97–126, 2006.
  • Silver & Mercer (2002) Silver, D. L. and Mercer, R. E. The task rehearsal method of life-long learning: Overcoming impoverished data. In Conference of the Canadian Society for Computational Studies of Intelligence, pp. 90–101. Springer, 2002.
  • Skantze & Hjalmarsson (2010) Skantze, G. and Hjalmarsson, A. Towards incremental speech generation in dialogue systems. In Proc.of the Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, 2010.
  • Strub et al. (2017) Strub, F., De Vries, H., Mary, J., Piot, B., Courville, A., and Pietquin, O. End-to-end optimization of goal-driven and visually grounded dialogue systems. In

    Proc. of International Joint Conferences on Artificial Intelligence

    , 2017.
  • Wei et al. (2018) Wei, W., Le, Q., Dai, A., and Li, J. Airdialogue: An environment for goal-oriented dialogue research. In Proc. of Empirical Methods in Natural Language Processing, 2018.
  • Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Xie et al. (2019) Xie, Q., Hovy, E., Luong, M.-T., and Le, Q. V. Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252, 2019.
  • Yu et al. (2017) Yu, L., Tan, H., Bansal, M., and Berg, T. L. A joint speaker-listener-reinforcer model for referring expressions. In Proc. of Computer Vision and Pattern Recognition, 2017.
  • Zhu et al. (2017) Zhu, Y., Zhang, S., and Metaxas, D. Interactive reinforcement learning for object grounding via self-talking. Visually Grounded Interaction and Language Workshop, 2017.

Appendix A Complementary Theoretical Intuition for SIL and Its Limitation

We here provide a complementary intuition of Seeded Iterated Learning by referring to some mathematical tools that were used to study Iterated Learning dynamics in the general case. These are not the rigorous proof but guide the design of SIL.

One concern is that, since natural language is not fully compositional, whether iterated learning may favor the emergence of a new compositional language on top of the initial one. In this spirit, Griffiths & Kalish (2005); Kalish et al. (2007) modeled iterated learning as a Markov Process, and showed that vanilla iterated learning indeed converges to a language distribution that (i) is independent of the initial language distribution, (ii) depends on the student language before the inductive learning step.

Fortunately, Chazelle & Wang (2017) show iterated learning can converge towards a distribution close to the initial one with high probability if the intermediate student distributions remain close enough of their teacher distributions and if the number of training observations increases logarithmically with the number of iterations.

This theoretical result motivates one difference between our framework and classical iterated learning: as we want to preserve the pretrained language distribution, we do not initialize the new students from scratch as in (Li & Bowling, 2019; Guo et al., 2019; Ren et al., 2020) because the latter approach exert a uniform prior on the space of language, while we would like to add a prior that favors natural language (e.g. favoring language whose token frequency satisfies Zipf’s Law).

A straightforward instantiation of the above theoretic results is to initialize new students as the pretrained model. However we empirically observe that, periodically resetting the model to initial pretrained model would quickly saturate the task score. As a result, we initialize new students from the previous offspring to propagate the task score, as well as retain the natural language properties from pretraining checkpoint.

However, we would also point out the limitation of existing theoretical results in the context of deep learning: The theoretical iterated learning results assume the agent to be perfect Bayesian learner (e.g. Learning is infering the posterior distribution of hypothesis given data). However, we only apply standard deep learning training procedure in our setup, which might not have this property. Because of the assumption of perfect Bayesian learner,

(Chazelle & Wang, 2019) suggests to use training sessions with increasing length. However in practice, increasing may be counter-productive because of overfitting issues (especially when we have limited number of training scenarios).

Appendix B Lewis Game

b.1 Experiment Details

In the Lewis game, the sender and the receiver architecture are 2-layer MLP with a hidden size of 200 and no-activation ( activations lead to similar scores). During interaction learning, we use a learning rate of 1e-4 for SIL. We use a learning rate of 1e-3 for the baselines as it provides better performance on the language and score tasks. In both cases, we use a training batch size of 100. For the teacher imitation phase, the student uses a learning rate of 1e-4.

In the Lewis game setting, we generate objects with properties, where each property may take values. Thus, it exists 3125 objects, which we split into 3 datasets: the pretraining, the interactive, and testing datasets. The pretraining split only contains 10 combination of objects. As soon as we provide additional objects, the sender and receiver fully solve the game by using the target language, which is not suitable to study the language drift phenomenon. The interactive split contains 30 objects. This choice is arbitrary, and choosing a additional objects gives similar results. Finally, the 3.1k remaining objects are held-out for evaluation.

b.2 Additional Plots

We sweep over different Gumbel temperatures to assess the impact of exploration on language drift. We show the results with Gumbel temperature in Fig 13 and Fig 12. We observe that the baselines are very sensitive to Gamble temperature: high temperature both decreases the language and tasks score. On the other side, Seeded Iterated Learning perform equally well on both temperatures and manage to maintain both task and language accuracies even with high temperature.

(a) Task Score (Held-Out)
(b) Sender Language Score (Held-Out)
(c) Receiver Language Score (Held-Out)
(d) Task Score (Train)
(e) Sender Language Score (Train)
(f) Receiver Language Score (Train)
Figure 12: Complete training curves for Task score and sender grounding in Lewis Game comparing SIL vs baselines for on the held-out dataset (bottom), and the interactive training split (bottom). We observe that the three methods reach 100% accuracy on the training task score, but their score differs on the held-out split. For SIL we use .
(a) Task Score (Held-Out)
(b) Sender Language Score (Held-Out)
(c) Receiver Language Score (Held-Out)
(d) Task Score (Train)
(e) Sender Language Score (Train)
(f) Receiver Language Score (Train)
Figure 13: Complete training curves for Task score and sender grounding in Lewis Game comparing SIL vs baselines for on the held-out dataset (bottom), and the interactive training split (bottom). For SIL we use .

b.3 Tracking Language Drift with Token Accuracy

To further visualize the language drift in Lewis game, we focus on the evolution of on the probability of speaking different word when facing the same concept. Formally, we track the change of conditional probability . The result is in Figure 14.

Figure 14: Change of conditional probability where and . Following pretraining, start with the highest probability. However, language drift gradually happens and eventually word 21 replaces the correct word 22.

Appendix C Translation Game

Method ref len BLEU NLL R1%
De En
Lee et al. (2019) Pretrained N/A 16.3 27.18 N/A N/A
PG 24.51 12.38 N/A N/A
PG+LM+G 28.08 24.75 N/A N/A
Ours Pretrained N/A 15.68 29.39 2.49 21.9
Fix Sender N/A 22.02 0.18 29.39 2.49 21.9
Gumbel 27.11 0.14 14.5 0.83 5.33 0.39 9.7 1.2
Gumbel 26.94 0.20 23.41 0.50 5.04 0.01 18.9 0.8
S2P() 27.43 0.36 19.16 0.63 4.05 0.16 13.6 0.7
S2P() 27.35 0.19 29.73 0.15 2.59 0.02 23.7 0.7
S2P() 24.64 0.16 30.84 0.07 2.51 0.02 23.5 0.5
NIL 28.29 0.16 29.4 0.25 2.15 0.12 21.7 0.2
Table 2: Translation Game Results. The checkmark in “ref len” means the method use reference length to constrain the output during training/testing. means higher the better and vice versa. Our results are averaged over 5 seeds, and reported values are extracted for the best BLEU(De) score during training. We here use a Gumbel temperature of 0.5.
(a) BLEU De (Task Score)
(b) BLEU En
(c) NLL
(d) R1
Figure 15: S2P has a trade-off between the task score and the language score while SIL is consistently high with both metrics.

c.1 Model Details and Hyperparameters

The model is a standard seq2seq translation model with attention (Bahdanau et al., 2015). Both encoder and decoder have a single-layer GRU (Cho et al., 2014) with hidden size 256. The embedding size is 256. There is a dropout after embedding layers for both encoder and decoder For decoder at each step, we concatenate the input and the attention context from last step.

Pretraining

For Fr-En agent, we use dropout ratio 0.2, batch size 2000 and learning rate 3e-4. We employ a linear learning rate schedule with the anneal steps of 500k. The minimum learning rate is 1e-5. We use Adam optimizer (Kingma & Ba, 2014) with

. We employ a gradient clipping of 0.1. For En-De, the dropout ratio is 0.3. We obtain a BLEU score of 32.17 for Fr-En, and 20.2 for En-De on the IWSLT test dataset 

(Cettolo et al., 2012).

Finetuning

During finetuning, we use batch size 1024 and learning rate 1e-5 with no schedule. The maximum decoding length is 40 and minimum decoding length is 3. For iterated learning, we use , and . We set Gumbel temperature to be . We use greedy sample from teacher speaker for imitation.

Appendix D S2P, Reward Shaping and KL Minimization

We find that multiple baselines for countering language drift can be summarized under the framework of KL minimization. Suppose the distribution of our model is and the reference model is . Then in order to prevent the drift of , we minimize or in addition to normal interactive training. We show that is related to the reward shaping Lee et al. (2019) and is related to S2P Gupta et al. (2019).

One find that

We can find that S2P can be obtained if we let to be the underlying data distribution. In the same spirit, one find that

The first term is equivalent to an entropy regularization term, while the second term is maximizing the reward . We implement the baseline by using the same Gumbel Softmax trick to optimize the term , where is the pretrained language model from MSCOCO captions. The training loss is defined as . We only show here and other values of do not yield better result.

The result can be found in Figure 16. Since KL can be decomposed into a reward reshaping term and a entropy maximizing term. So I compare to an extra baseline RwdShaping which remove the entropy term since encouraging exploration would make the drift worse. We find that KL baseline is even worse than Gumbel baseline for both task score and language score, mainly due to its emphasis on entropy maximization term. By removing that term, we see RwdShape can outperform Gumbel on both task score and language score, but compared with SIL, RwdShape still has larger drift.

(a) BLEU De (Task Score)
(b) BLEU En
(c) NLL
(d) R1
Figure 16: Comparison between SIL and different KL baselines

d.1 Mixing Pretraining Data in SIL Pipeline

We also tried mixing pretraining data as well as teacher labeled data in the speaker imitation phase. We fine a probability . For each speaker imitation step, it does a supervise update on pretraining labeled data from IWSLT with , and on teacher generated data with . When , we recover SIL. To our surprise, we find that mixing dataset does not yield extra benefit. Larger yields worse task performance as well as language score.

(a) BLEU De (Task Score)
(b) BLEU En
(c) NLL
(d) R1
Figure 17: Mixing pretraining data into SIL.

d.2 Language Model and Image Ranker Details

Our language model is a single-layer LSTM (Hochreiter & Schmidhuber, 1997) with hidden size 512 and embedding size 512. We use Adam and learning rate of 3e-4. We use a batch size of 256 and a linear schedule with 30k anneal steps. The language model is trained with captions from MSCOCO (Lin et al., 2014). For the image ranker, we use the pretrained ResNet-152 (He et al., 2016) to extract the image features. We use a GRU (Cho et al., 2014) with hidden size 1024 and embedding size 300. We use a batch size of 256 and use VSE loss (Faghri et al., 2017). We use Adam with learning rate of 3e-4 and a schedule with 3000 anneal steps (Kingma & Ba, 2014).

d.3 Data Preprocessing

We use Moses to tokenize the text (Koehn et al., 2007) and we learn byte-pair-encoding from Multi30K (Elliott et al., 2016) with all language. Then we apply the same BPE to different dataset. Our vocab size for En, Fr, De is 11552, 13331, and 12124.

d.4 Language Statistics

(a) POS tag distribution.
(b) Word Frequency Analysis
(c) Difference of Log of Word Frequency
Figure 18: Language statistics on samples from different method.

We here compute several linguistic statistics on the generated samples to assess language quality.

POS Tag Distribution

We compute the Part-of-Speech Tag (POS Tag (Marcus et al., 1993)) distribution by counting the frequency of POS tags and normalize it. The POS tag are sorted according to their frequencies in the reference, and we pick the 11 most common POS tag for visualization, which are:

  • NN Noun, singular or mass

  • DT Determiner

  • IN Preposition or subordinating conjunction

  • JJ Adjective

  • VBG Verb, gerund or present participle

  • NNS Noun, plural

  • VBZ Verb, 3rd person singular present

  • CC Coordinating conjunction

  • CD Cardinal number

The results are shown in Figure 17(a). The peak on “period” show that Gumbel has tendency of repeating periods at the end of sentences. However, we observe that both S2P and

Word Frequency

For each generated text, we sort the frequency of the words and plot the log of frequency vs. log of rank. We set a minimum frequency of 50 to exclude long tail results. The result is in Figure 17(b).

Word Frequency Difference

To further visualize the difference between generated samples and reference, we plot the difference between their log of word frequencies in Figure 17(c).

d.5 Samples

We list more samples from the Multi30k dataset with different baselines, i.e., Pretrain, Gumbel, S2P(. The Gumbel temperature is set to 0.5. The complete samples can be found in our code.

ref : a female playing a song on her violin .
Pretrain: a woman playing a piece on her violin .
Gumbel : a woman playing a piece on his violin . . . . . . . . . . . . .
S2P : a woman playing a piece on his violin .
SIL : a woman playing a piece on his violin .

ref : a cute baby is smiling at another child .
Pretrain: a nice baby smiles at another child .
Gumbel : a nice baby smiles of another child . . . . . . . . . .
S2P : a nice baby smiles at another child .
SIL : a beautiful baby smiles smiles at another child .

ref : a man drives an old-fashioned red race car .
Pretrain: a man conducted an old race car .
Gumbel : a man drives a old race of red race . . . .
S2P : a man drives an old of the red race .
SIL : a man drives a old race of the red race .

ref : a man in a harness climbing a rock wall
Pretrain: a man named after a rock man .
Gumbel : a man thththththththdeacdeaacc. of th. . . . . . .
S2P : a man ’s being a kind of a kind of a kind .
SIL : a man that the datawall of the datad.

ref : a man and woman fishing at the beach .
Pretrain: a man and a woman is a woman .
Gumbel : a man and a woman thaccbeach the beach . . . . . . . . . .
S2P : a man and a woman is in the beach .
SIL : a man and a woman that ’s going to the beach .

ref : a man cooking burgers on a black grill .
Pretrain: a man making the meets on a black slick of a black slick .
Gumbel : a man doing it of on a black barbecue . . . . . . . . . . . . . . . .
S2P : a man doing the kind on a black barbecue .
SIL : a man doing the datadon a black barbecue .

ref : little boy in cami crawling on brown floor
Pretrain: a little boy in combination with brown soil .
Gumbel : a small boy combincombinaccon a brown floor . . . brown . . . . . . . . .
S2P : a small boy combining the kind of brown floor .
SIL : a small boy in the combination of on a brown floor .

ref : dog in plants crouches to look at camera .
Pretrain: a dog in the middle of plants are coming to look at the goal .
Gumbel : a dog in middle of of of of thlooking at looking at objeobje. . . . . . . . . . . . . . . . . . .
S2P : a dog in the middle of the plants to watch objective .
SIL : a dog at the middle of plants are going to look at the objective .

ref : men wearing blue uniforms sit on a bus .
Pretrain: men wearing black uniforms are sitting in a bus .
Gumbel : men wearing blue uniforms sitting in a bus . . . . . . .
S2P : men wearing blue uniforms sitting in a bus .
SIL : men wearing blue uniforms are sitting in a bus .

ref : a group of scottish officers doing a demonstration .
Pretrain: a scottish officers group is doing a demonstration .
Gumbel : a group of officers scottish doing a dedemonstration . . . .
S2P : a group of officers scottish doing a demonstration .
SIL : a group of officers scottish doing a demo .

ref : the brown dog is wearing a black collar .
Pretrain: the brown dog is wearing a black collar .
Gumbel : the brown dog carries a black collar . . . . . . .
S2P : the brown dog carries a black collar .
SIL : the brown dog is wearing a black collar .

ref : twp children dig holes in the dirt .
Pretrain: two children are going to dig holes in the earth .
Gumbel : two children dig holes in the planplanplanplan. . . . . . . .
S2P : two children are going holes in the dirt .
SIL : two children dig holes in the earth .

ref : the skiers are in front of the lodge .
Pretrain: the health are in front of the bed .
Gumbel : the ththare ahead the thth. . . . . . .
S2P : the health are front of the whole .
SIL : the dataare are ahead of the datad.

ref : a seated man is working with his hands .
Pretrain: a man sitting working with his hands .
Gumbel : a man sitting working with his hands . . . . . . . . .
S2P : a man sitting working with his hands .
SIL : a man sitting working with its hands .

ref : a young girl is swimming in a pool .
Pretrain: a girl swimming in a swimming pool .
Gumbel : a young girl swimming in a pool . . . . . . . . . .
S2P : a young girl swimming in a pool .
SIL : a young girl swimming in a pool .

ref : a small blond girl is holding a sandwich .
Pretrain: a little girl who is a sandwich .
Gumbel : a yedegirl holding a sandwich . . . .
S2P : a small 1girl holding a sandwich .
SIL : a small 1girl holding a sandwich .

ref : two women look out at many houses below .
Pretrain: two women are looking at many of the houses in the computer .
Gumbel : two women looking many of many houses in itdeacede. . . . . . . .
S2P : two women looking at many houses in the kind .
SIL : two women looking at many houses in the data.

ref : a person is hang gliding in the ocean .
Pretrain: ( wind up instead of making a little bit of the board ) a person who is the board of the sailing .
Gumbel : ( cdthinplace of acacc) a person does thacthof th-acin the ocean . . . . . . . . . . . . . . . .
S2P : ( wind ’s instead of a kind ) a person does the kind in the ocean .
SIL : ( datadinstead of the input of the clinability ) a person does the board in the ocean .

ref : a man in a green jacket is smiling .
Pretrain: a green man in the green man .
Gumbel : a man jacket green smiles . . . . . . . . . . . .
S2P : a man in jacket green smiles .
SIL : a man in the green jacket smiles .

ref : a young girl standing in a grassy field .
Pretrain: a girl standing in a meadow .
Gumbel : a young girl standing in a gmeadow . . . . . . . . .
S2P : a young girl standing in a meadow .
SIL : a young girl standing in a meadow .