1 Introduction
Mastering the ability to communicate with humans is a fundamental goal of AI. Our interaction with machines is crucial for any future application of intelligent agents in our daily lives.
There are various ways for a model to determine what to say next in a conversation, though these methods can be distilled into two main approaches: generative models, which generate a sequence of text, and retrieval/ranking models, which rank candidates among a fixed set and select the optimal next utterance for a model. We focus on the latter approach in this work, as it remains superior than the former in terms of engagingness Shuster et al. (2018); Zhang et al. (2018a), and allows for more control over the possible outcomes.
Recently, substantial improvements to state-of-the-art benchmarks on a variety of language understanding tasks have been achieved through the use of deep pretrained language models Devlin et al. (2018); more generally, researchers have shown that by simply fine-tuning these large pretrained models, one can obtain performance gains on a number of language-related tasks. Specifically, we use the BERT models from Devlin et al. (2018), which have been pretrained on Wikipedia and the Toronto Books Corpus Zhu et al. (2015b). In our work, we additionally explore the pretraining of these large transformers using a different dataset that is more related to dialogue. We find that such pretraining yields better performance on the dialogue tasks that we have chosen to focus on.
With the design of systems that will actually communicate with humans, it is of paramount importance that models are able to perform accurate real-time inference. Achieving higher accuracy generally comes with a cost to computational complexity. Cross-encoders, which perform full self-attention over a given context and response candidate, tend to attain much higher accuracies than their counterparts, Bi-encoders, which perform self-attention over the context and response separately, combining them at the end for a final representation. As they encode the context and response separately, Bi-encoders are able to cache the encoded responses and reuse these representations for each given context. Unfortunately, Cross-encoders must recompute, for each context, the context-response encoding; as a result, we find that Cross-encoders, in a real-time setting, are prohibitively slow. To resolve this, we introduce Poly-encoders, which add an additional attention mechanism for better use of the response candidate representation when choosing the optimal next utterance. We show that these Poly-encoders outperform Bi-encoders at little to no cost in computation time, yielding an architecture that is suitable for real-time inference while more accurate than what is common today in the literature.
A variety of datasets exist that have been used to train dialogue-based models, including those curated from social media platforms Ritter et al. (2011); Mazaré et al. (2018), scraped from scripts or chatlogs Tiedemann (2012); Lowe et al. (2015), or deliberately collected via crowdsourcing Zhang et al. (2018a); Dinan et al. (2019b). For this work, we focus on two recent competitions with dialogue datsets: the Conversational Intelligence Challenge 2 (ConvAI2) Dinan et al. (2019a), and the Dialog System Technology Challenge 7 (DSTC7)111http://workshop.colips.org/dstc7/index.html.
2 Related Work
The task of scoring candidate labels given an input context is a classical problem in machine learning. While multi-class classification is a special case, the more general task involves candidates as structured objects rather than discrete classes; in our work we consider the inputs and the candidate labels to be sequences of text.
There is a broad class of models that map the input and a candidate label into a feature space wherein typically a dot product, cosine or (parameterized) non-linearity is used to measure their similarity. We refer to these models as Bi-encoders
. Such methods include vector space models
Salton et al. (1975), LSI Deerwester et al. (1990), supervised embeddings Bai et al. (2009) and classical siamese networks Bromley et al. (1994). For the next utterance prediction tasks we consider in this work, several Bi-encoder neural approaches have been considered, in particular Memory Networks Zhang et al. (2018a) and Transformer Memory networks Dinan et al. (2019b) as well as LSTMs Lowe et al. (2015) and CNNs Kadlec et al. (2015) which encode input and label separately. A major advantage of Bi-encoder methods is their ability to cache the representations of a large, fixed candidate set. Since the candidate encodings are independent of the input, Bi-encoders are very efficient during evaluation.Researchers have also studied a more rich class of models we refer to as Cross-encoders, which make no assumptions on the similarity scoring function between input and label. Instead, the concatenation of the input and a candidate serve as a new input to a nonlinear function that scores their match based on any dependencies it wants. This has been explored with Sequential Matching Network CNN-based architectures Wu et al. (2016), Deep Matching Networks Yang et al. (2018), Gated Self-Attention Zhang et al. (2018b), and most recently transformers Wolf et al. (2019); Vig and Ramea (2019); Urbanek et al. (2019). For the latter, concatenating the two sequences of text results in applying self-attention at every layer. This yields rich interactions between the input context and the candidate, as every word in the candidate label can attend to every word in the input context, and vice-versa. We found that the previous approach closest to ours is in Urbanek et al. (2019), where they also use a pretrained BERT model to fine-tune a Bi-encoder and a Cross-encoder to train a dialogue agent within a game-like world. They indeed find that, with a much deeper level of interaction between context and candidate, Cross-encoders tend to outperform Bi-encoders, for Bi-encoders may lose information due to the feature map bottleneck. However, the performance gains come at a steep computational cost; Cross-encoder representations are typically much slower to compute, which is often prohibitive when the number of candidates is large.
3 Tasks
We consider the setting of sentence selection in dialogue, a task extensively studied and recently featured in two competitions: the Neurips ConvAI2 competition, and the DSTC7 challenge, Track 1. This task involves selecting the next sentence in a dialogue given the dialogue context/history. To measure the success of a model, the task provides a set of candidate utterances for each test example; the model then ranks the utterances according to its prediction of the best fit for the next utterance, and from these rankings automatic ranking metrics can be computed.
The ConvAI2 task is based on the Persona-Chat dataset Zhang et al. (2018a) which involves dialogues between pairs of speakers. Each speaker is given a persona, which is a few sentences that describe a character they will imitate, e.g. ‘I love romantic movies’ or ‘I work in the catering industry.’ The speakers are then instructed to simply chat to get to know each other. Models that performed well in the competition conditioned their chosen responses on the dialogue history and the lines of persona. The sentence selection task involved picking the correct annotated utterrance from a set of 20 choices, where the remaining 19 were other randomly chosen utterances from the evaluation set. The best performing competitor in this task achieved 80.7% accuracy on the test set utilizing a pre-trained Transformer Radford et al. (2018) fine-tuned for this task Wolf et al. (2019).
The DSTC7 challenge we focus on is the Track 1 sentence selection task that uses the Ubuntu corpus Lowe et al. (2015). The corpus consists of two-person conversations extracted from Ubuntu chat logs, where one partner receives technical support for various Ubuntu-related problems from the other. The best performing competitor in this task achieved 64.5% R@1 Chen and Wang (2019). We summarize these two datasets and their statistics in Table 1.
ConvAI2 | DTSC7 | |
Train Exs. | 131,438 | 100,000 |
Valid Exs. | 7,801 | 10,000 |
Test Exs. | 6634 | 5,000 |
Candidates per Ex. | 20 | 100 |
4 Methods
In this section we describe the various models and methods that we explored.
4.1 Transformers and BERT
BERT model architecture
Our Bi-, Cross-, and Poly-encoders, described in sections 4.2, 4.3 and 4.4 respectively, are based on large pretrained transformer models. The weights of our models are initialized from those of the transformer model in Devlin et al. (2018), which was trained on a dataset combining sentences from Wikipedia and the Toronto Books Corpus. The specific model that we use, denoted BERT-base, has 12 layers, 12 attention heads, and a hidden size of 768.
In section 4.5, we describe a different pretraining scheme in which we use the same architecure as BERT-base. However, we instead train the transformer on a dataset of 800 million sentences, derived from the online platform Reddit Mazaré et al. (2018), with the same process as BERT. For training we used the open source implementation of XLM Lample and Conneau (2019).
Input representation
Each token input is represented as the sum of three embeddings: the token embedding, the position (in the sequence) embedding and the segment embedding. The segment embedding arises from Devlin et al. (2018), in which the segment refers to the sentence in which the token belongs. If the input sequence is a single sentence, the segment input is 0. If the input sequence is the concatenation of two sentences (eg. [QUESTION ANSWER]) segment inputs of first sentence tokens are 0 and segment inputs of second sentence tokens are 1.
Pretraining Procedure
The pretraining loss is the sum of a masked language model (MLM) loss and a next-sentence prediction loss. The MLM loss is chosen over a traditional language model loss as it allows for the training of bidirectional attention, and is computed as follows: 15% of the tokens are randomly selected and are either replaced by a [MASK] token (80% of the time), replaced by a random token (10% of the time) or kept unchanged (10% of the time). The masked sentence is encoded by the transformer, and the final hidden vectors corresponding to the masked tokens are fed into a linear layer and softmax function to predict the probability of the original token over the full vocabulary. The loss is a standard cross entropy loss.
In the next-sentence prediction task, the input sequence is the concatenation of sentence A and sentence B. 50% of the time A and B are two consecutive sentences in the dataset, and 50% of the time they are randomly picked. This pair of sentences is encoded through the transformer, and the hidden state corresponding to the [CLS] token is fed into a linear layer to predict if A and B are consecutive sentences. The loss is a binary cross entropy loss.
The pretraining procedure uses the Adam optimizer with learning rate of 1e-4, , , L2 weight decay of 0.01, linear learning rate warmup, and linear decay of the learning rate.
4.2 Bi-encoders
The Bi-encoder allows for quick, real-time inference, as the candidate representations can be cached. In this setting, both the context and the candidate are encoded into vectors:
where and are two transformers that have been pre-trained following the procedure described in 4.1, and is a function that reduces the sequence of vectors produced by the transformers into one vector. That is, suppose is the output of a transformer T. When using BERT, both context and candidates are surrounded by special tokens [CLS] and [SEP] and therefore corresponds to [CLS]. We considered two ways of reducing the output into one representation:
-
Choose the first output of the transformer (corresponding to the special token [CLS]).
-
Compute the average over all outputs.
Scoring
The score of a candidate is given by the dot-product .
The network is trained to minimize a cross-entropy loss in which the logits are
, where is the correct label and the others are chosen from the train set. Similar to what is done in Mazaré et al. (2018), during training we consider the other elements of the batch as negatives. This allows for much faster training, as we can reuse the embeddings computed for each candidate, and also use a larger batch size; e.g., in our experiments on ConvAI2, we were able to use batches of 512 elements.Evaluation speed
Within the context of a retrieval system, a Bi-encoder allows for the precomputation of the embeddings of all possible candidates of the system. After computing of the context embedding , the only operation remaining is a dot product between and every candidate embedding, which can scale to millions of candidates on a modern GPU, and potentially billions using nearest-neighbor libraries such as FAISS Johnson et al. (2017).
4.3 Cross-encoder
The Cross-encoder allows for rich interactions between the context and candidate, as they are jointly encoded to obtain a final representation. In this setting, the context and candidate are surrounded by the special tokens [CLS] and [SEP] and concatenated into a single vector, which is encoded using one transformer. We consider the first output of the transformer as the context-candidate embedding:
where is the function that takes the first vector of the sequence of vectors produced by the transformer. By using a single transformer, the Cross-encoder is able to perform self-attention between the context and candidate, resulting in the extraction of a lot of information.
Scoring
To score one candidate, a linear layer is applied to the embedding to reduce it from a vector to a scalar.
Similarly to what is done for Bi-encoder, the network is trained to minimize a cross entropy loss where the logits are where is the correct candidate and the others are negatives taken from the training set. Unlike in the Bi-encoder, we cannot recycle the other labels of the batch as negatives, so we use external negatives from the training set. The Cross-encoder uses much more memory than the Bi-encoder, resulting in a much smaller batch size.
Evaluation speed
The Cross-encoder does not allow for precomputation of the candidate embeddings. At inference time, every candidate must be concatenated with the context and must go through a forward pass of the entire model. Thus, this method cannot scale to a large amount of candidates. We discuss this bottleneck further in Section 5.4.
4.4 Poly-encoders

The Poly-encoder attempts to obtain the best of both worlds from the Bi-encoder and the Cross-encoder:
-
The candidates are represented by one vector as in the Bi-encoder, which allows for caching for fast inference time.
-
The context is jointly encoded with the candidate as in the Cross-encoder, thus being able to extract more information.
More generally, the Poly-encoder can be interpreted as an extension of the Bi-encoder. That is, we still use two separate transformer encoders for the context and the candidate, and the candidate is still encoded into a single vector . As such, the Poly-encoder method can be implemented using a precomputed cache of encoded responses, allowing its usage in a production setup. However, we represent the context with several vectors () instead of just one. To reduce the several vectors of context to a final representation, we use an attention layer with as the query:
Where:
An important question is how to obtain these vectors; we describe our process below.
A simple way to encode the context as different word vectors is to consider the first outputs of the context encoder (see Figure 1 for more details).
is a chosen hyperparameter, and the immediate drawback is that
in this case cannot exceed the number of tokens in the context. However, the encoder is followed by an attention layer which is flexible in the number of inputs. Therefore whenever the length of the context is below , we simply consider the first outputs. Note that in this setting the model must be able to dedicate a different role to each of those outputs during fine-tuning. This was a motivation to take the first outputs in order to best leverage the position embeddings provided to the encoder.4.5 Domain-specific Pretraining
In addition to using the pretrained transformers from Devlin et al. (2018), which were pretrained on Wikipedia and the Toronto Books Corpus Zhu et al. (2015a), we explore our own pretraining scheme, in which we use a dataset more adapted to dialogue. Specifically, we pretrain a transformer from scratch on 800 million comments from Reddit, while using the same transformer architecture as BERT-base - 12 layers, 12 attention heads and hidden size of 768.
The vocabulary used is slightly different from BERT - it is computed using BPE trained on lower-cased Wikipedia, the Toronto Books Corpus, and Open Subtitles Lison and Tiedemann (2016) with 30k merges. The resulting dictionary has 54,940 terms, with slightly different special tokens.
Our input is the concatenation of context and candidate, where both are surrounded with the special token [S], following Lample and Conneau (2019). The context is the concatenation of the utterances in the dialogue history separated by a special [NEWLINE] token. As in Devlin et al. (2018), we add segment embeddings to each token input; i.e., we add segment 0 embedding for the context and segment 1 for the candidate.
Pretraining Procedure
Our transformer is trained with a masked language model (MLM) task and a next-utterance prediction task, which is slightly different than Devlin et al. (2018) who use a next-sentence prediction task. An utterance can be composed of several sentences. During training 50% of the time the candidate is the actual next utterance and 50% of the time it is an utterance randomly taken from the dataset. The first output is followed by a linear layer to reduce it to a binary classification. We alternate between batches of the MLM task and the next-utterance prediction task.
Like in Lample and Conneau (2019) we use Adam optimizer with learning rate of 2e-4, , , no L2 weight decay, linear learning rate warmup, and inverse square root decay of the learning rate. We use a dropout probability of 0.1 on all layers, and a batch of 32000 tokens composed of pairs of [dialogue history - next utterance of the dialogue] with similar lengths. We train the model on 32 GPUs.
5 Experiments
We perform a variety of experiments to test our model architectures. For both tasks, we measure ”recall @ k”, abbreviated to R@K, which is the percentage of the time the correct response appears in the model’s top k ranked candidates.
5.1 Input data
In all of our experiments, the context is the concatenation of the history so far in the dialogue. In the case of ConvAI2, the context also contains the persona sentences. For both ConvAI2 and DSTC7, we cap the length of the context at 360 tokens and the length of each candidate at 72 tokens. These values ensure that 99.9% of the context and candidates are not truncated. Finally, we adopt the same strategy of data augmentation as Chen and Wang (2019): we consider each utterance of a training sample as a potential response, with the previous utterances as its context.
5.2 Bi-encoders and Cross-encoders
We fine-tune the Bi- and Cross-encoder architectures initialized with the weights provided by Devlin et al. (2018). In the case of the Bi-encoder, we can use a large number of negatives by considering the other batch elements as negative training samples, avoiding recomputation of their embeddings. On 8 Nvidia Volta v100 GPUs and using half-precision operations (i.e. float16 operations), this allows us to reach batches of 512 elements. Table 2 shows that in this setting, we obtain higher performance with a larger batch size, i.e. more negatives, with a batch size of 512 yielding the best results. The Cross-encoder is more computationally intensive, as the embeddings for the (context, response) pair must be recomputed each time. For the Cross-encoder, we keep the batch size fixed at 16 and provide as negatives random samples from the training set. For DSTC7, we choose 15 such negatives; For ConvAI2, the dataset provides 19 negative samples.
Negatives | 31 | 63 | 127 | 255 | 511 |
---|---|---|---|---|---|
Accuracy | 81.0 | 81.7 | 82.3 | 83.0 | 83.4 |
We try two optimizers, Adam Kingma and Ba (2014) with weight decay of 0.01 such as recommended by Devlin et al. (2018), and Adamax Kingma and Ba (2014)
without weight decay. The learning rate is initialized to 5e-5 with a warm up of 100 iteration for Bi- and Poly-encoders, and 1000 iterations for Cross-encoder. The learning rate decays by a factor of 0.4 upon plateau of the loss evaluated on the valid set every half epoch. In table
3 we show validation performance when fine-tuning various layers of the weights provided by Devlin et al. (2018), using Adam with decay optimizer. We notice that the performance is slightly better if we do not optimize the word embeddings. When initialized with the weights provided by Devlin et al. (2018), the Bi-encoder reaches 81.7% R@1 on ConvAI2 and 66.3% R@1 on DSTC7. The Cross-encoder scores 84.9% R@1 on ConvAI2 and 67.7% R@1 on DSTC7. Complete results can be found in Table 4.Trained parameters | Bi-encoder | Cross-encoder |
---|---|---|
Top layer | 74.2 | 80.6 |
Top 4 layers | 82.0 | 86.3 |
All but Embeddings | 83.2 | 87.4 |
Every Layer | 83.0 | 86.6 |
Dataset | ConvAI2 | DSTC 7 | ||||
---|---|---|---|---|---|---|
split | dev | test | dev | test | ||
metric | R@1/20 | R@1/20 | R@1/100 | R@1/100 | R@10/100 | MRR |
Hugging Face | 82.1 | 80.7 | - | - | - | - |
Dinan et al. (2019a) | ||||||
Chen and Wang (2019) | - | - | 57.3 | 64.5 | 90.2 | 73.5 |
(BERT-base) Bi-encoder | 83.3 0.2 | 81.7 0.2 | 55.5 0.4 | 66.3 0.7 | 88.4 0.5 | 73.9 0.5 |
(BERT-base) Poly-encoder 1 | 83.2 0.2 | 81.5 0.1 | 56.4 0.3 | 66.8 0.7 | 88.8 0.5 | 74.4 0.4 |
(BERT-base) Poly-encoder 4 | 83.4 0.2 | 81.6 01 | 56.9 0.5 | 67.2 1.3 | 88.9 0.6 | 74.8 1.0 |
(BERT-base) Poly-encoder 16 | 85.2 0.1 | 83.9 0.2 | 56.1 1.7 | 66.8 0.7 | 89.1 0.9 | 74.4 0.8 |
(BERT-base) Poly-encoder 64 | 86.0 0.2 | 84.2 0.2 | 57.7 0.6 | 67.1 0.1 | 89.0 0.5 | 74.7 0.2 |
(BERT-base) Poly-encoder 360 | 86.3 0.1 | 84.6 0.3 | 58.1 0.4 | 66.8 0.7 | 89.8 0.8 | 74.8 0.5 |
(BERT-base) Cross-encoder | 87.3 0.3 | 84.9 0.3 | 59.6 0.3 | 67.7 0.3 | 90.4 0.2 | 74.7 0.5 |
5.3 Poly-encoders
We perform experiments with both Poly-encoder variants on the DSTC 7 and ConvAI2 datasets. Specifically, we report in table 4 the recall@1 metrics for various numbers of intermediate context codes for each architecture.
We find that the Poly-encoder indeed proves to achieve better performance on both tasks than the Bi-encoder. On ConvAI2, the Poly-encoder architecture reaches 84.6% R@1 with 360 intermediate context codes, compared to the 81.7% R@1 score of the Bi-encoder. As we expected, we find that these numbers are slightly worse than our best Cross-encoder result of 84.9%. We find that the performance of the Poly-encoder increases as we increase the number of intermediate context codes; with the performance being roughly equivalent to that of the Bi-encoder when only one context code is provided.
5.4 Inference Speed
An important motivation for the Poly-encoder architecture is to achieve better results than the Bi-encoder while also performing at a reasonable speed. Though our Cross-encoder yields the highest results in all metrics, it is prohibitively slow. We perform speed experiments to determine the cost of improved performance from the Poly-encoder. Specifically, we predict the next utterance for 100 dialogue examples in the ConvAI2 validation set, where the model has access to candidates from the train set. We perform these experiments on both CPU-only and normal GPU setups. CPU computations were run on a 80 cores Intel Xeon processor CPU E5-2698. GPU computations were done on a single Nvidia Quadro GP100 using cuda 10.0 and cudnn 7.4.
We show the average time per example for each architecture in the CPU-only setup in table 5. The difference in timing between the Bi-encoder and the Poly-encoder architectures is rather minimal when there are only 1000 candidates for the model to consider. On the other hand, the degree of difference is more pronounced when considering 100k candidates, a setup more similar to what a real chatbot may encounter, as we see a 5-6x slowdown for the Poly-encoder variants.
These differences, however, are much smaller than the slowdown from using a Cross-encoder; to evaluate one example with 1000 candidates, the Cross-encoder experiences a slowdown of 2 orders of magnitude when compared to the Bi-encoder and Poly-encoder. These results indicate that the Poly-encoder’s improved performance over the Bi-encoder comes at a relatively minimal performance slowdown. In real-time inference, a difference between 0.1s and 0.6s per response is not nearly as noticeable as 0.1s and 21.7s.
Model | Scoring time (ms) | |||
---|---|---|---|---|
Model | CPU | GPU | ||
Candidates | 1k | 100k | 1k | 100k |
Bi-encoder | 115 | 160 | 19 | 22 |
Poly-encoder 16 | 119 | 551 | 17 | 37 |
Poly-encoder 64 | 124 | 570 | 17 | 39 |
Poly-encoder 360 | 160 | 837 | 17 | 45 |
Cross-encoder | 21692 | - | 2655 | - |
Dataset | ConvAI2 | DSTC 7 | ||||
---|---|---|---|---|---|---|
split | dev | test | dev | test | ||
metric | R@1/20 | R@1/20 | R@1/100 | R@1/100 | R@10/100 | MRR |
Hugging Face | 82.1 | 80.7 | - | - | - | - |
Dinan et al. (2019a) | ||||||
Chen and Wang (2019) | - | - | 57.3 | 64.5 | 90.2 | 73.5 |
(ours) Bi-encoder | 86.3 0.1 | 84.2 0.2 | 59.3 0.4 | 69.2 0.6 | 90.7 0.2 | 76.8 0.5 |
(ours) Poly-encoder 16 | 85.9 0.2 | 83.9 0.2 | 59.7 0.6 | 68.9 0.5 | 90.4 0.5 | 76.4 0.1 |
(ours) Poly-encoder 64 | 88.5 0.2 | 86.7 0.1 | 60.2 0.2 | 70.0 0.6 | 91.4 0.3 | 77.3 0.2 |
(ours) Poly-encoder 360 | 89.1 0.1 | 86.6 0.2 | 61.3 0.3 | 70.2 0.7 | 91.0 0.3 | 77.7 0.2 |
(ours) Cross-encoder | 89.9 0.3 | 87.4 0.2 | 63.1 0.5 | 72.0 0.2 | 92.1 0.3 | 79.0 0.4 |
5.5 Domain-specific Pretraining
We fine-tune our Reddit-pretrained transformer on ConvAI2 and DSTC7. The results are shown in table 6. The training schedule remains the same as in Subsection 4.2
. We also compare Adamax with Adam with weight decay and use the best of the two. In order to avoid any saturation of the attention layer in the Polyencoder, we rescaled the very last linear layer of the transformer so that the standard deviation of its output would match the one of BERT. Our pretraining outperforms BERT for the Bi-encoder, Cross-encoder, and Poly-encoder settings, with the Cross-encoder reaching a score of 87.4% R@1 on Convai2. Note that due to the large differences between the way the two pretraining models have been obtained, we can not clearly determine whether the cause of the performance improvement is due to the dataset or to the pre-training algorithm itself, and additional ablations are left for future work.
6 Conclusions
In this paper we explore how to use pretrained deep bidirectional transformers in next-sentence selection tasks. We note that the three methods we introduced in this work are not specific to dialogue, and can be used for any task where one is scoring a set of candidates.
On the one hand, the Cross-encoder allows for deep attention between context and candidate and obtains the highest accuracies; however, it is too slow to be effectively used in production settings. On the other hand, the Bi-encoder is very fast and can be scaled to a large number of candidates, given its ability to cache the candidate representations for each input example. With the intention of finding a trade off between these two methods, we introduce the Poly-encoder. Our method provides a mechanism for attending over the response candidate, while maintaining the ability to precompute each candidate’s representation, which allows for suitable real-time inference in a production setup. Moreover, the Poly-encoder has the advantage of obtaining an accuracy close to the Cross-Encoder. Finally, we show that using the deep bidirectional transformer that we pretrained from scratch on Reddit allows us to outperform the results we obtain with BERT, for all three model architectures.
References
- Bai et al. (2009) Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. 2009. Supervised semantic indexing. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 187–196. ACM.
-
Bromley et al. (1994)
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak
Shah. 1994.
Signature verification using a” siamese” time delay neural network.
In Advances in neural information processing systems, pages 737–744. - Chen and Wang (2019) Qian Chen and Wen Wang. 2019. Sequential attention-based network for noetic end-to-end response selection. CoRR, abs/1901.02609.
- Deerwester et al. (1990) Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Dinan et al. (2019a) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2019a. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098.
-
Dinan et al. (2019b)
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason
Weston. 2019b.
Wizard of Wikipedia: Knowledge-powered conversational agents.
In Proceedings of the International Conference on Learning Representations (ICLR). - Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734.
- Kadlec et al. (2015) Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. 2015. Improved deep learning baselines for ubuntu corpus dialogs. arXiv preprint arXiv:1510.03753.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
- Lison and Tiedemann (2016) Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909.
- Mazaré et al. (2018) P.-E. Mazaré, S. Humeau, M. Raison, and A. Bordes. 2018. Training Millions of Personalized Dialogue Agents. ArXiv e-prints.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
-
Ritter et al. (2011)
Alan Ritter, Colin Cherry, and William B. Dolan. 2011.
Data-driven response generation in social media.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
, EMNLP ’11, pages 583–593, Stroudsburg, PA, USA. Association for Computational Linguistics. - Salton et al. (1975) Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620.
- Shuster et al. (2018) Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. 2018. Engaging image chat: Modeling personality in grounded dialogue. arXiv preprint arXiv:1811.00945.
- Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
- Urbanek et al. (2019) Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. 2019. Learning to speak and act in a fantasy text adventure game. arXiv preprint arXiv:1903.03094.
-
Vig and Ramea (2019)
Jesse Vig and Kalai Ramea. 2019.
Comparison of transfer-learning approaches for response selection in multi-turn conversations.
Workshop on DSTC7. - Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149.
- Wu et al. (2016) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2016. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627.
- Yang et al. (2018) Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 245–254. ACM.
- Zhang et al. (2018a) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Zhang et al. (2018b) Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, Hai Zhao, and Gongshen Liu. 2018b. Modeling multi-turn conversation with deep utterance aggregation. arXiv preprint arXiv:1806.09102.
- Zhu et al. (2015a) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015a. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. CoRR, abs/1506.06724.
-
Zhu et al. (2015b)
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel
Urtasun, Antonio Torralba, and Sanja Fidler. 2015b.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
2015 IEEE International Conference on Computer Vision (ICCV)
, pages 19–27.