Real-time Inference in Multi-sentence Tasks with Deep Pretrained Transformers

04/22/2019 ∙ by Samuel Humeau, et al. ∙ Facebook 0

The use of deep pretrained bidirectional transformers has led to remarkable progress in learning multi-sentence representations for downstream language understanding tasks (Devlin et al., 2018). For tasks that make pairwise comparisons, e.g. matching a given context with a corresponding response, two approaches have permeated the literature. A Cross-encoder performs full self-attention over the pair; a Bi-encoder performs self-attention for each sequence separately, and the final representation is a function of the pair. While Cross-encoders nearly always outperform Bi-encoders on various tasks, both in our work and others' (Urbanek et al., 2019), they are orders of magnitude slower, which hampers their ability to perform real-time inference. In this work, we develop a new architecture, the Poly-encoder, that is designed to approach the performance of the Cross-encoder while maintaining reasonable computation time. Additionally, we explore two pretraining schemes with different datasets to determine how these affect the performance on our chosen dialogue tasks: ConvAI2 and DSTC7 Track 1. We show that our models achieve state-of-the-art results on both tasks; that the Poly-encoder is a suitable replacement for Bi-encoders and Cross-encoders; and that even better results can be obtained by pretraining on a large dialogue dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mastering the ability to communicate with humans is a fundamental goal of AI. Our interaction with machines is crucial for any future application of intelligent agents in our daily lives.

There are various ways for a model to determine what to say next in a conversation, though these methods can be distilled into two main approaches: generative models, which generate a sequence of text, and retrieval/ranking models, which rank candidates among a fixed set and select the optimal next utterance for a model. We focus on the latter approach in this work, as it remains superior than the former in terms of engagingness Shuster et al. (2018); Zhang et al. (2018a), and allows for more control over the possible outcomes.

Recently, substantial improvements to state-of-the-art benchmarks on a variety of language understanding tasks have been achieved through the use of deep pretrained language models Devlin et al. (2018); more generally, researchers have shown that by simply fine-tuning these large pretrained models, one can obtain performance gains on a number of language-related tasks. Specifically, we use the BERT models from Devlin et al. (2018), which have been pretrained on Wikipedia and the Toronto Books Corpus Zhu et al. (2015b). In our work, we additionally explore the pretraining of these large transformers using a different dataset that is more related to dialogue. We find that such pretraining yields better performance on the dialogue tasks that we have chosen to focus on.

With the design of systems that will actually communicate with humans, it is of paramount importance that models are able to perform accurate real-time inference. Achieving higher accuracy generally comes with a cost to computational complexity. Cross-encoders, which perform full self-attention over a given context and response candidate, tend to attain much higher accuracies than their counterparts, Bi-encoders, which perform self-attention over the context and response separately, combining them at the end for a final representation. As they encode the context and response separately, Bi-encoders are able to cache the encoded responses and reuse these representations for each given context. Unfortunately, Cross-encoders must recompute, for each context, the context-response encoding; as a result, we find that Cross-encoders, in a real-time setting, are prohibitively slow. To resolve this, we introduce Poly-encoders, which add an additional attention mechanism for better use of the response candidate representation when choosing the optimal next utterance. We show that these Poly-encoders outperform Bi-encoders at little to no cost in computation time, yielding an architecture that is suitable for real-time inference while more accurate than what is common today in the literature.

A variety of datasets exist that have been used to train dialogue-based models, including those curated from social media platforms Ritter et al. (2011); Mazaré et al. (2018), scraped from scripts or chatlogs Tiedemann (2012); Lowe et al. (2015), or deliberately collected via crowdsourcing Zhang et al. (2018a); Dinan et al. (2019b). For this work, we focus on two recent competitions with dialogue datsets: the Conversational Intelligence Challenge 2 (ConvAI2) Dinan et al. (2019a), and the Dialog System Technology Challenge 7 (DSTC7)111http://workshop.colips.org/dstc7/index.html.

2 Related Work

The task of scoring candidate labels given an input context is a classical problem in machine learning. While multi-class classification is a special case, the more general task involves candidates as structured objects rather than discrete classes; in our work we consider the inputs and the candidate labels to be sequences of text.

There is a broad class of models that map the input and a candidate label into a feature space wherein typically a dot product, cosine or (parameterized) non-linearity is used to measure their similarity. We refer to these models as Bi-encoders

. Such methods include vector space models

Salton et al. (1975), LSI Deerwester et al. (1990), supervised embeddings Bai et al. (2009) and classical siamese networks Bromley et al. (1994). For the next utterance prediction tasks we consider in this work, several Bi-encoder neural approaches have been considered, in particular Memory Networks Zhang et al. (2018a) and Transformer Memory networks Dinan et al. (2019b) as well as LSTMs Lowe et al. (2015) and CNNs Kadlec et al. (2015) which encode input and label separately. A major advantage of Bi-encoder methods is their ability to cache the representations of a large, fixed candidate set. Since the candidate encodings are independent of the input, Bi-encoders are very efficient during evaluation.

Researchers have also studied a more rich class of models we refer to as Cross-encoders, which make no assumptions on the similarity scoring function between input and label. Instead, the concatenation of the input and a candidate serve as a new input to a nonlinear function that scores their match based on any dependencies it wants. This has been explored with Sequential Matching Network CNN-based architectures Wu et al. (2016), Deep Matching Networks Yang et al. (2018), Gated Self-Attention Zhang et al. (2018b), and most recently transformers Wolf et al. (2019); Vig and Ramea (2019); Urbanek et al. (2019). For the latter, concatenating the two sequences of text results in applying self-attention at every layer. This yields rich interactions between the input context and the candidate, as every word in the candidate label can attend to every word in the input context, and vice-versa. We found that the previous approach closest to ours is in Urbanek et al. (2019), where they also use a pretrained BERT model to fine-tune a Bi-encoder and a Cross-encoder to train a dialogue agent within a game-like world. They indeed find that, with a much deeper level of interaction between context and candidate, Cross-encoders tend to outperform Bi-encoders, for Bi-encoders may lose information due to the feature map bottleneck. However, the performance gains come at a steep computational cost; Cross-encoder representations are typically much slower to compute, which is often prohibitive when the number of candidates is large.

3 Tasks

We consider the setting of sentence selection in dialogue, a task extensively studied and recently featured in two competitions: the Neurips ConvAI2 competition, and the DSTC7 challenge, Track 1. This task involves selecting the next sentence in a dialogue given the dialogue context/history. To measure the success of a model, the task provides a set of candidate utterances for each test example; the model then ranks the utterances according to its prediction of the best fit for the next utterance, and from these rankings automatic ranking metrics can be computed.

The ConvAI2 task is based on the Persona-Chat dataset Zhang et al. (2018a) which involves dialogues between pairs of speakers. Each speaker is given a persona, which is a few sentences that describe a character they will imitate, e.g. ‘I love romantic movies’ or ‘I work in the catering industry.’ The speakers are then instructed to simply chat to get to know each other. Models that performed well in the competition conditioned their chosen responses on the dialogue history and the lines of persona. The sentence selection task involved picking the correct annotated utterrance from a set of 20 choices, where the remaining 19 were other randomly chosen utterances from the evaluation set. The best performing competitor in this task achieved 80.7% accuracy on the test set utilizing a pre-trained Transformer Radford et al. (2018) fine-tuned for this task Wolf et al. (2019).

The DSTC7 challenge we focus on is the Track 1 sentence selection task that uses the Ubuntu corpus Lowe et al. (2015). The corpus consists of two-person conversations extracted from Ubuntu chat logs, where one partner receives technical support for various Ubuntu-related problems from the other. The best performing competitor in this task achieved 64.5% R@1 Chen and Wang (2019). We summarize these two datasets and their statistics in Table 1.

ConvAI2 DTSC7
Train Exs. 131,438 100,000
Valid Exs. 7,801 10,000
Test Exs. 6634 5,000
Candidates per Ex. 20 100
Table 1: Datasets used in this paper.

4 Methods

In this section we describe the various models and methods that we explored.

4.1 Transformers and BERT

BERT model architecture

Our Bi-, Cross-, and Poly-encoders, described in sections 4.2, 4.3 and 4.4 respectively, are based on large pretrained transformer models. The weights of our models are initialized from those of the transformer model in Devlin et al. (2018), which was trained on a dataset combining sentences from Wikipedia and the Toronto Books Corpus. The specific model that we use, denoted BERT-base, has 12 layers, 12 attention heads, and a hidden size of 768.

In section 4.5, we describe a different pretraining scheme in which we use the same architecure as BERT-base. However, we instead train the transformer on a dataset of 800 million sentences, derived from the online platform Reddit Mazaré et al. (2018), with the same process as BERT. For training we used the open source implementation of XLM Lample and Conneau (2019).

Input representation

Each token input is represented as the sum of three embeddings: the token embedding, the position (in the sequence) embedding and the segment embedding. The segment embedding arises from Devlin et al. (2018), in which the segment refers to the sentence in which the token belongs. If the input sequence is a single sentence, the segment input is 0. If the input sequence is the concatenation of two sentences (eg. [QUESTION ANSWER]) segment inputs of first sentence tokens are 0 and segment inputs of second sentence tokens are 1.

Pretraining Procedure

The pretraining loss is the sum of a masked language model (MLM) loss and a next-sentence prediction loss. The MLM loss is chosen over a traditional language model loss as it allows for the training of bidirectional attention, and is computed as follows: 15% of the tokens are randomly selected and are either replaced by a [MASK] token (80% of the time), replaced by a random token (10% of the time) or kept unchanged (10% of the time). The masked sentence is encoded by the transformer, and the final hidden vectors corresponding to the masked tokens are fed into a linear layer and softmax function to predict the probability of the original token over the full vocabulary. The loss is a standard cross entropy loss.

In the next-sentence prediction task, the input sequence is the concatenation of sentence A and sentence B. 50% of the time A and B are two consecutive sentences in the dataset, and 50% of the time they are randomly picked. This pair of sentences is encoded through the transformer, and the hidden state corresponding to the [CLS] token is fed into a linear layer to predict if A and B are consecutive sentences. The loss is a binary cross entropy loss.

The pretraining procedure uses the Adam optimizer with learning rate of 1e-4, , , L2 weight decay of 0.01, linear learning rate warmup, and linear decay of the learning rate.

4.2 Bi-encoders

The Bi-encoder allows for quick, real-time inference, as the candidate representations can be cached. In this setting, both the context and the candidate are encoded into vectors:

where and are two transformers that have been pre-trained following the procedure described in 4.1, and is a function that reduces the sequence of vectors produced by the transformers into one vector. That is, suppose is the output of a transformer T. When using BERT, both context and candidates are surrounded by special tokens [CLS] and [SEP] and therefore corresponds to [CLS]. We considered two ways of reducing the output into one representation:

  • Choose the first output of the transformer (corresponding to the special token [CLS]).

  • Compute the average over all outputs.

Scoring

The score of a candidate is given by the dot-product .

The network is trained to minimize a cross-entropy loss in which the logits are

, where is the correct label and the others are chosen from the train set. Similar to what is done in Mazaré et al. (2018), during training we consider the other elements of the batch as negatives. This allows for much faster training, as we can reuse the embeddings computed for each candidate, and also use a larger batch size; e.g., in our experiments on ConvAI2, we were able to use batches of 512 elements.

Evaluation speed

Within the context of a retrieval system, a Bi-encoder allows for the precomputation of the embeddings of all possible candidates of the system. After computing of the context embedding , the only operation remaining is a dot product between and every candidate embedding, which can scale to millions of candidates on a modern GPU, and potentially billions using nearest-neighbor libraries such as FAISS Johnson et al. (2017).

4.3 Cross-encoder

The Cross-encoder allows for rich interactions between the context and candidate, as they are jointly encoded to obtain a final representation. In this setting, the context and candidate are surrounded by the special tokens [CLS] and [SEP] and concatenated into a single vector, which is encoded using one transformer. We consider the first output of the transformer as the context-candidate embedding:

where is the function that takes the first vector of the sequence of vectors produced by the transformer. By using a single transformer, the Cross-encoder is able to perform self-attention between the context and candidate, resulting in the extraction of a lot of information.

Scoring

To score one candidate, a linear layer is applied to the embedding to reduce it from a vector to a scalar.

Similarly to what is done for Bi-encoder, the network is trained to minimize a cross entropy loss where the logits are where is the correct candidate and the others are negatives taken from the training set. Unlike in the Bi-encoder, we cannot recycle the other labels of the batch as negatives, so we use external negatives from the training set. The Cross-encoder uses much more memory than the Bi-encoder, resulting in a much smaller batch size.

Evaluation speed

The Cross-encoder does not allow for precomputation of the candidate embeddings. At inference time, every candidate must be concatenated with the context and must go through a forward pass of the entire model. Thus, this method cannot scale to a large amount of candidates. We discuss this bottleneck further in Section 5.4.

4.4 Poly-encoders

Figure 1: Diagrams of the three model architectures we consider. (a) The Bi-encoder encodes the context and candidate separately, allowing for the caching of candidate representations during real-time inference. (b) The Cross-encoder jointly encodes the context and candidate in a single transformer, yielding richer interactions between context and candidate at the cost of slower computation. (c) The Poly-encoder combines the strengths of the Bi-encoder and Cross-encoder by both allowing for caching of candidate representations and adding an additional attention mechanism to extract more information from the candidate before computing a final score.

The Poly-encoder attempts to obtain the best of both worlds from the Bi-encoder and the Cross-encoder:

  • The candidates are represented by one vector as in the Bi-encoder, which allows for caching for fast inference time.

  • The context is jointly encoded with the candidate as in the Cross-encoder, thus being able to extract more information.

More generally, the Poly-encoder can be interpreted as an extension of the Bi-encoder. That is, we still use two separate transformer encoders for the context and the candidate, and the candidate is still encoded into a single vector . As such, the Poly-encoder method can be implemented using a precomputed cache of encoded responses, allowing its usage in a production setup. However, we represent the context with several vectors () instead of just one. To reduce the several vectors of context to a final representation, we use an attention layer with as the query:

Where:

An important question is how to obtain these vectors; we describe our process below.

A simple way to encode the context as different word vectors is to consider the first outputs of the context encoder (see Figure 1 for more details).

is a chosen hyperparameter, and the immediate drawback is that

in this case cannot exceed the number of tokens in the context. However, the encoder is followed by an attention layer which is flexible in the number of inputs. Therefore whenever the length of the context is below , we simply consider the first outputs. Note that in this setting the model must be able to dedicate a different role to each of those outputs during fine-tuning. This was a motivation to take the first outputs in order to best leverage the position embeddings provided to the encoder.

4.5 Domain-specific Pretraining

In addition to using the pretrained transformers from Devlin et al. (2018), which were pretrained on Wikipedia and the Toronto Books Corpus Zhu et al. (2015a), we explore our own pretraining scheme, in which we use a dataset more adapted to dialogue. Specifically, we pretrain a transformer from scratch on 800 million comments from Reddit, while using the same transformer architecture as BERT-base - 12 layers, 12 attention heads and hidden size of 768.

The vocabulary used is slightly different from BERT - it is computed using BPE trained on lower-cased Wikipedia, the Toronto Books Corpus, and Open Subtitles Lison and Tiedemann (2016) with 30k merges. The resulting dictionary has 54,940 terms, with slightly different special tokens.

Our input is the concatenation of context and candidate, where both are surrounded with the special token [S], following Lample and Conneau (2019). The context is the concatenation of the utterances in the dialogue history separated by a special [NEWLINE] token. As in Devlin et al. (2018), we add segment embeddings to each token input; i.e., we add segment 0 embedding for the context and segment 1 for the candidate.

Pretraining Procedure

Our transformer is trained with a masked language model (MLM) task and a next-utterance prediction task, which is slightly different than Devlin et al. (2018) who use a next-sentence prediction task. An utterance can be composed of several sentences. During training 50% of the time the candidate is the actual next utterance and 50% of the time it is an utterance randomly taken from the dataset. The first output is followed by a linear layer to reduce it to a binary classification. We alternate between batches of the MLM task and the next-utterance prediction task.

Like in Lample and Conneau (2019) we use Adam optimizer with learning rate of 2e-4, , , no L2 weight decay, linear learning rate warmup, and inverse square root decay of the learning rate. We use a dropout probability of 0.1 on all layers, and a batch of 32000 tokens composed of pairs of [dialogue history - next utterance of the dialogue] with similar lengths. We train the model on 32 GPUs.

5 Experiments

We perform a variety of experiments to test our model architectures. For both tasks, we measure ”recall @ k”, abbreviated to R@K, which is the percentage of the time the correct response appears in the model’s top k ranked candidates.

5.1 Input data

In all of our experiments, the context is the concatenation of the history so far in the dialogue. In the case of ConvAI2, the context also contains the persona sentences. For both ConvAI2 and DSTC7, we cap the length of the context at 360 tokens and the length of each candidate at 72 tokens. These values ensure that 99.9% of the context and candidates are not truncated. Finally, we adopt the same strategy of data augmentation as Chen and Wang (2019): we consider each utterance of a training sample as a potential response, with the previous utterances as its context.

5.2 Bi-encoders and Cross-encoders

We fine-tune the Bi- and Cross-encoder architectures initialized with the weights provided by Devlin et al. (2018). In the case of the Bi-encoder, we can use a large number of negatives by considering the other batch elements as negative training samples, avoiding recomputation of their embeddings. On 8 Nvidia Volta v100 GPUs and using half-precision operations (i.e. float16 operations), this allows us to reach batches of 512 elements. Table 2 shows that in this setting, we obtain higher performance with a larger batch size, i.e. more negatives, with a batch size of 512 yielding the best results. The Cross-encoder is more computationally intensive, as the embeddings for the (context, response) pair must be recomputed each time. For the Cross-encoder, we keep the batch size fixed at 16 and provide as negatives random samples from the training set. For DSTC7, we choose 15 such negatives; For ConvAI2, the dataset provides 19 negative samples.

Negatives 31 63 127 255 511
Accuracy 81.0 81.7 82.3 83.0 83.4
Table 2: Validation performance on ConvAI2 after fine-tuning a Bi-encoder pretrained with BERT, averaged over 5 runs. The batch size is the number of training negatives + 1 as we use the other elements of the batch as negatives during training. The accuracy metric is recall@1/20, i.e., the accuracy when predicting the response among 19 distractors.

We try two optimizers, Adam Kingma and Ba (2014) with weight decay of 0.01 such as recommended by Devlin et al. (2018), and Adamax Kingma and Ba (2014)

without weight decay. The learning rate is initialized to 5e-5 with a warm up of 100 iteration for Bi- and Poly-encoders, and 1000 iterations for Cross-encoder. The learning rate decays by a factor of 0.4 upon plateau of the loss evaluated on the valid set every half epoch. In table

3 we show validation performance when fine-tuning various layers of the weights provided by Devlin et al. (2018), using Adam with decay optimizer. We notice that the performance is slightly better if we do not optimize the word embeddings. When initialized with the weights provided by Devlin et al. (2018), the Bi-encoder reaches 81.7% R@1 on ConvAI2 and 66.3% R@1 on DSTC7. The Cross-encoder scores 84.9% R@1 on ConvAI2 and 67.7% R@1 on DSTC7. Complete results can be found in Table 4.

Trained parameters Bi-encoder Cross-encoder
Top layer 74.2 80.6
Top 4 layers 82.0 86.3
All but Embeddings 83.2 87.4
Every Layer 83.0 86.6
Table 3: Validation performance on ConvAI2 when fine-tuning different set of parameters. Average over 5 runs (Bi-encoders) or 3 runs (Cross-encoders).
Dataset ConvAI2 DSTC 7
split dev test dev test
metric R@1/20 R@1/20 R@1/100 R@1/100 R@10/100 MRR
Hugging Face 82.1 80.7 - - - -
Dinan et al. (2019a)
Chen and Wang (2019) - - 57.3 64.5 90.2 73.5
(BERT-base) Bi-encoder 83.3 0.2 81.7 0.2 55.5 0.4 66.3 0.7 88.4 0.5 73.9 0.5
(BERT-base) Poly-encoder 1 83.2 0.2 81.5 0.1 56.4 0.3 66.8 0.7 88.8 0.5 74.4 0.4
(BERT-base) Poly-encoder 4 83.4 0.2 81.6 01 56.9 0.5 67.2 1.3 88.9 0.6 74.8 1.0
(BERT-base) Poly-encoder 16 85.2 0.1 83.9 0.2 56.1 1.7 66.8 0.7 89.1 0.9 74.4 0.8
(BERT-base) Poly-encoder 64 86.0 0.2 84.2 0.2 57.7 0.6 67.1 0.1 89.0 0.5 74.7 0.2
(BERT-base) Poly-encoder 360 86.3 0.1 84.6 0.3 58.1 0.4 66.8 0.7 89.8 0.8 74.8 0.5
(BERT-base) Cross-encoder 87.3 0.3 84.9 0.3 59.6 0.3 67.7 0.3 90.4 0.2 74.7 0.5
Table 4: BERT-base Models: Validation and test performances of Bi-, Poly- and Cross-encoders using the BERT-base transformer. Scores are shown for ConvAI2 and DSTC7 Track 1, and include the previous state-of-the-art models in the literature.

5.3 Poly-encoders

We perform experiments with both Poly-encoder variants on the DSTC 7 and ConvAI2 datasets. Specifically, we report in table 4 the recall@1 metrics for various numbers of intermediate context codes for each architecture.

We find that the Poly-encoder indeed proves to achieve better performance on both tasks than the Bi-encoder. On ConvAI2, the Poly-encoder architecture reaches 84.6% R@1 with 360 intermediate context codes, compared to the 81.7% R@1 score of the Bi-encoder. As we expected, we find that these numbers are slightly worse than our best Cross-encoder result of 84.9%. We find that the performance of the Poly-encoder increases as we increase the number of intermediate context codes; with the performance being roughly equivalent to that of the Bi-encoder when only one context code is provided.

5.4 Inference Speed

An important motivation for the Poly-encoder architecture is to achieve better results than the Bi-encoder while also performing at a reasonable speed. Though our Cross-encoder yields the highest results in all metrics, it is prohibitively slow. We perform speed experiments to determine the cost of improved performance from the Poly-encoder. Specifically, we predict the next utterance for 100 dialogue examples in the ConvAI2 validation set, where the model has access to candidates from the train set. We perform these experiments on both CPU-only and normal GPU setups. CPU computations were run on a 80 cores Intel Xeon processor CPU E5-2698. GPU computations were done on a single Nvidia Quadro GP100 using cuda 10.0 and cudnn 7.4.

We show the average time per example for each architecture in the CPU-only setup in table 5. The difference in timing between the Bi-encoder and the Poly-encoder architectures is rather minimal when there are only 1000 candidates for the model to consider. On the other hand, the degree of difference is more pronounced when considering 100k candidates, a setup more similar to what a real chatbot may encounter, as we see a 5-6x slowdown for the Poly-encoder variants.

These differences, however, are much smaller than the slowdown from using a Cross-encoder; to evaluate one example with 1000 candidates, the Cross-encoder experiences a slowdown of 2 orders of magnitude when compared to the Bi-encoder and Poly-encoder. These results indicate that the Poly-encoder’s improved performance over the Bi-encoder comes at a relatively minimal performance slowdown. In real-time inference, a difference between 0.1s and 0.6s per response is not nearly as noticeable as 0.1s and 21.7s.

Model Scoring time (ms)
Model CPU GPU
Candidates 1k 100k 1k 100k
Bi-encoder 115 160 19 22
Poly-encoder 16 119 551 17 37
Poly-encoder 64 124 570 17 39
Poly-encoder 360 160 837 17 45
Cross-encoder 21692 - 2655 -
Table 5: Average time in milliseconds to predict the next dialogue utterance from possible candidates.
Dataset ConvAI2 DSTC 7
split dev test dev test
metric R@1/20 R@1/20 R@1/100 R@1/100 R@10/100 MRR
Hugging Face 82.1 80.7 - - - -
Dinan et al. (2019a)
Chen and Wang (2019) - - 57.3 64.5 90.2 73.5
(ours) Bi-encoder 86.3 0.1 84.2 0.2 59.3 0.4 69.2 0.6 90.7 0.2 76.8 0.5
(ours) Poly-encoder 16 85.9 0.2 83.9 0.2 59.7 0.6 68.9 0.5 90.4 0.5 76.4 0.1
(ours) Poly-encoder 64 88.5 0.2 86.7 0.1 60.2 0.2 70.0 0.6 91.4 0.3 77.3 0.2
(ours) Poly-encoder 360 89.1 0.1 86.6 0.2 61.3 0.3 70.2 0.7 91.0 0.3 77.7 0.2
(ours) Cross-encoder 89.9 0.3 87.4 0.2 63.1 0.5 72.0 0.2 92.1 0.3 79.0 0.4
Table 6: Domain-specific Pretraining: Validation and test performances of Bi-, Poly- and Cross-encoders, using our domain-specific pretrained transformers. Scores are shown for ConvAI2 and DSTC7 Track 1, and include the previous state-of-the-art models in the literature.

5.5 Domain-specific Pretraining

We fine-tune our Reddit-pretrained transformer on ConvAI2 and DSTC7. The results are shown in table 6. The training schedule remains the same as in Subsection 4.2

. We also compare Adamax with Adam with weight decay and use the best of the two. In order to avoid any saturation of the attention layer in the Polyencoder, we rescaled the very last linear layer of the transformer so that the standard deviation of its output would match the one of BERT. Our pretraining outperforms BERT for the Bi-encoder, Cross-encoder, and Poly-encoder settings, with the Cross-encoder reaching a score of 87.4% R@1 on Convai2. Note that due to the large differences between the way the two pretraining models have been obtained, we can not clearly determine whether the cause of the performance improvement is due to the dataset or to the pre-training algorithm itself, and additional ablations are left for future work.

6 Conclusions

In this paper we explore how to use pretrained deep bidirectional transformers in next-sentence selection tasks. We note that the three methods we introduced in this work are not specific to dialogue, and can be used for any task where one is scoring a set of candidates.

On the one hand, the Cross-encoder allows for deep attention between context and candidate and obtains the highest accuracies; however, it is too slow to be effectively used in production settings. On the other hand, the Bi-encoder is very fast and can be scaled to a large number of candidates, given its ability to cache the candidate representations for each input example. With the intention of finding a trade off between these two methods, we introduce the Poly-encoder. Our method provides a mechanism for attending over the response candidate, while maintaining the ability to precompute each candidate’s representation, which allows for suitable real-time inference in a production setup. Moreover, the Poly-encoder has the advantage of obtaining an accuracy close to the Cross-Encoder. Finally, we show that using the deep bidirectional transformer that we pretrained from scratch on Reddit allows us to outperform the results we obtain with BERT, for all three model architectures.

References