Domain Adaptive Training BERT for Response Selection

08/13/2019 ∙ by Taesun Whang, et al. ∙ 0

We focus on multi-turn response selection in a retrieval-based dialog system. In this paper, we utilize the powerful pre-trained language model Bi-directional Encoder Representations from Transformer (BERT) for a multi-turn dialog system and propose a highly effective post-training method on domain-specific corpus. Although BERT is easily adopted to various NLP tasks and outperforms previous baselines of each task, it still has limitations if a task corpus is too focused on a certain domain. Post-training on domain-specific corpus (e.g., Ubuntu Corpus) helps the model to train contextualized representations and words that do not appear in general corpus (e.g.,English Wikipedia). Experiment results show that our approach achieves new state-of-the-art on two response selection benchmark datasets (i.e.,Ubuntu Corpus V1, Advising Corpus) performance improvement by 5.9



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human computer dialog system aims to have natural and consistent conversation. One of methods used in building dialog systems is response selection, which predicts most likely response from a set of candidates pool in a retrieval-based dialog system. Previous studies of response selection concentrate on recurrent neural model to enhance dialog-response matching Lowe et al. (2015); Zhou et al. (2016); Wu et al. (2017). Most recent model focuses on attention-based matching model that strengthens representing sophisticated segment representations Zhou et al. (2018). Pre-trained language model BERT is also applied to response selection in the work of Vig and Ramea (2019).
Recently, pre-trained language models (e.g., ELMo Peters et al. (2018), BERT Devlin et al. (2018)) are considered to be important on a wide range of NLP tasks, such as Natural Language Inference (NLI) and Question Answering (QA). Despite their huge success, they still have limits to represent contextual information in domain-specific corpus since it is trained on general corpus (e.g., English Wikipedia). For example, Ubuntu Corpus, which is the most used corpus in response selection task, contains a number of terminologies and manuals that do not usually appear in general corpus (e.g., apt-get, mkdir, and grep). Since the corpus is especially focused on certain domain, existing works have limitations of matching a dialog context and a response. In addition, conversation corpus, such as Twitter and Reddits, is mainly composed of colloquial expressions which are usually grammatically incorrect. One approach in response selection works of Chaudhuri et al. (2018) proposed using both domain knowledge embeddings and general word embeddings to make a model to fully understand Ubuntu manual. In other domain,there has been an attempt to learn domain specificity based on pre-trained model. Xu et al. (2019) proposed BERT based post-training method for Review Reading Comprehension (RRC) task to enhance domain-awareness. Since reviews and opinion-based texts have many differences compared to the original corpus of BERT, therefore, post-training of BERT with two powerful unsupervised objectives (i.e., masked language model (MLM), next sentence prediction (NSP)) on task-specific corpus enhance to produce task-awareness contextualized representations.
In this work, we propose an effective post-training method for a multi-turn conversational system. To the best of our knowledge, it is the first attempt to adopt BERT

on the most popular response selection benchmark data set, Ubuntu Corpus V1. We demonstrate that NSP is especially considered as an important task for response selection, since classifying whether given two sentences are

IsNext or NotNext is the ultimate objective of response selection. Also, we append [EOT] to the end of utterance that model can learn relationships among utterances during the period of post-training. Furthermore, our approach outperforms previous state-of-the-art performance by 5.9% on . We also evaluate on the recently released data set in the dialog system technology challenges 7 (DSTC 7) Lasecki (2019) and outperforms the 1st place of the challenges.

2 Related Work

Lowe et al. (2015) introduced a new benchmark dataset Ubuntu IRC Corpus and simultaneously proposed a baseline for the task (e.g.,

TF-IDF, RNN, LSTM). Bi-directional LSTM and Convolutional neural networks are also applied in the previous work of

Kadlec et al. (2015). Zhou et al. (2016)

utilized both token-level and utterance-level representations. Token-level gated recurrent units (GRU) representations and hierarchically constructed utterance-level representations are matched. CNN based utterance-response matching technique is applied to enhance relationships between dialog context and response

Wu et al. (2017); Zhang et al. (2018). The response selection task is similar to natural language inference problem; therefore, the most recent works are based on Enhanced Sequential Inference Model (ESIM) Dong and Huang (2018); Gu et al. (2019); Chen and Wang (2019). Some approaches using transformer encoder are introduced recently. Deep attention matching network (DAM), for example, which showed positive performance of aggregating self-attention and cross-attention Zhou et al. (2018). Tao et al. (2019) proposed fusion strategy that fuses multiple types of representations, such as word, contextual, and attention-based representations.

3 BERT for Response Selection

Our overall approach is described in Figure 1. We transform the task of ranking responses from the candidates pool into binary classification problem by approaching a pointwise method. We denote a training set as a triples , where is a dialog context consists of a set of utterances. An utterance is composed of a set of word tokens , where and is the length of th utterance. Response is denoted as ( is the number of tokens in the response), and ground truth . Maximum sequence length of each dialog and response is denoted as and , respectively. We define BERT input ([CLS], , [EOT], , [EOT], …, , [SEP], , …, , [SEP]). Unlike general sentence, multi-turn dialog system is composed of a set of utterances. Therefore, we append “ End Of Turn ” tag [EOT] to the end of each turn to make the model catch each utterance is finished at the point. Position, segment, and token embeddings are added and fed into the BERT layers. The BERT contextual representations of [CLS] token, , is utilized to classify whether a given dialog context and response is IsNextUtterance or not. We feed

to single-layer perceptron to compute the model prediction score


where is a task-specific trainable parameter. We use cross entropy loss as the objective function to optimize our model, formulated as

Figure 1: BERT for Response Selection

3.1 Domain Post-Training

BERT is designed to be easily applied to other nlp tasks with a fine-tuning manner. Since it is pre-trained on general corpus (e.g., Wikipedia Corpora), it is insufficient to have enough supervision of task-specific words and phrases during the period of fine-tuning. To alleviate this issue, we post-train BERT on our task-specific corpora that helps model understand certain domain. The model is trained with two objective functions, masked language model (MLM) and next sentence prediction (NSP), which are highly effective to learn contextual representations from the corpora. One example (Ubuntu Corpus) of domain post-training of BERT for response selection is described as below:

width=0.48 [Masked Language Model] Input : sud ##o apt - get install cuda [MASK] ##o apt - get install cuda [Next Sentence Prediction] Input : how do I [MASK] cuda library in terminal ? [EOT][SEP] (Text A) You can search with apt - cache search [MASK] [EOT] [SEP]  (Text B) Label : IsNext

In the example of Masked LM, model can learn that sud##o command is needed when trying apt install in Ubuntu system, which is not generally showed from universal corpora. By conducting NSP during the post-training, model also can train given two sentences are sequential and relevant, which is the common ultimate goal of response selection. To optimize the model domain post-training (DPT) loss is calculated adding mean Masked LM likelihood and mean next sentence prediction (NSP), formulated as


4 Experiments

4.1 Datasets and Training Setup

We evaluate our model on two multi-turn dyadic data sets, Ubuntu IRC (Internet Relay Chat) Corpus V1111 et al. (2015) and Advising Corpus222 Lasecki (2019). For the Ubuntu Corpus, training set is composed of 0.5M dialog context containing positive and negative response with the ratio of :

. Each validation and test set contains 50k dialog context with one positive response and nine negative responses. Advising Corpus consists of 100k dialogs for training set and 500 for validation and test set. All sets contain one positive response and 99 negative responses. We only use one negative sample for training to make same conditions with Ubuntu Corpus. For an evaluation metric, we use

, evaluating if the ground truth exists in top k from candidates Lowe et al. (2015); Wu et al. (2017). We also use another evaluation metric mean reciprocal rank (MRR) Voorhees and others (1999). Implementation details are summarized in Appendix A.

width=0.48 Model DualEncoder 0.403 0.547 0.819 DualEncoder 0.549 0.684 0.896 DualEncoder 0.638 0.784 0.949 DualEncoder 0.630 0.780 0.944 MultiView 0.662 0.801 0.951 DL2R 0.626 0.783 0.944 AK-DE-biGRU 0.747 0.868 0.972 SMN 0.726 0.847 0.961 DUA 0.752 0.868 0.962 DAM 0.767 0.874 0.969 IMN 0.777 0.888 0.974 ESIM 0.796 0.894 0.975 MRFN 0.786 0.886 0.976 BERT 0.817 0.904 0.977 BERT-DPT 0.851 0.924 0.984 BERT-VFT 0.855 0.928 0.985 BERT-VFT(DA) 0.858 0.931 0.985

Table 1: Model comparison on Ubuntu Corpus V1.

4.2 Baseline Methods

We compared our model with the following baseline methods:
Dual Encoder is based on RNN, CNN, LSTM, and BiLSTM Kadlec et al. (2015).
MultiView is constructed with token-level and utterance-level GRU layer Zhou et al. (2016).
DL2R Yan et al. (2016) proposed a method of reformulating the response with utterances.
AK-DE-biGRU Chaudhuri et al. (2018) proposed a method incorporating domain knowledge (i.e., Ubuntu manual description).
SMN Wu et al. (2017) and DUA Zhang et al. (2018) proposed utterance-response matching methods.
IMN Gu et al. (2019) and ESIM Chen and Wang (2019) models are based on ESIM proposed by Chen et al. (2017)
DAM Zhou et al. (2018) is based on transformer Vaswani et al. (2017) encoder and apply both self-attention and cross-attention.
MRFN Tao et al. (2019) proposed multi representation fusion network and highlighted the effectiveness of fusing strategy.
BERT Devlin et al. (2018) is a vanilla BERT model for response selection task .
BERT-DPT is a BERT model that is post-trained on domain-specific corpus (e.g., Ubuntu, Advising). Masked language model (MLM) and next sentence prediction (NSP) task are conducted during the post-training time.
BERT-VFT is a model that selects the number of top layers to fine-tuning in BERT-DPT. We evaluate the model with varying , where is the number of layers which are tuned during training time. BERT-VFT(DA) performs data augmentation technique by increasing the number of negative samples. We randomly choose the samples for training.

width=0.48 Model MRR Vig and Ramea (2019) 0.186 0.580 0.942 0.312 Chen and Wang (2019) 0.214 0.630 0.948 0.339 BERT 0.236 0.656 0.946 0.359 BERT-DPT 0.270 0.668 0.942 0.395 BERT-VFT 0.274 0.654 0.932 0.400 BERT-VFT(DA) 0.274 0.664 0.942 0.399

Table 2: Evaluation results on the Advising Corpus.

4.3 Results and Analysis

We conduct experiments on two data sets, Ubuntu Corpus V1 and Advising Corpus. Evaluation results and baseline comparisons are given in Table 1. It is observed that BERT-VFT achieves new state-of-the-art performance and it obtains performance improvement by 5.9%, 3.4%, 0.9% on , where =, respectively, compared to the previous state-of-the-art method. Focusing on metric, the performance of a vanilla BERT is 0.817 and comparing our main approach, which is BERT-DPT, shows much better performance (improvement by 3.4%). In addition, we especially point out comparing BERT-VFT with deep attention matching network (DAM) Zhou et al. (2018) since both models are based on transformer encoder. Domain-specific optimized BERT-VPT model outperforms with performance improvement by 8.8% compared to the general transformer based model.

As shown in Table 2, we compare our approach with two existing baselines on Advising Corpus, proposed by Vig and Ramea (2019) and Chen and Wang (2019) in DSTC 7. The former baseline evaluate BERT on the data set, but there is substantial performance difference from what we obtain. We believe that different implementation frameworks are the main reason why performance differences exist between our work and that of Vig and Ramea (2019). The 1st place of the challenges was the work conducted by Chen and Wang (2019), BERT-DPT outperforms 6% on .

Variable Fine-Tuning Ubuntu Advising
All layers 0.907 0.395
Top 10 layers 0.908 0.393
Top 8 layers 0.909 0.389
Top 6 layers 0.909 0.400
Top 4 layers 0.910 0.392
Top 2 layers 0.907 0.387
Only Embedding Layers 0.904 0.386
Table 3: Variable fine-tuning in BERT. All results are evaluated using mean reciprocal rank (MRR).

Houlsby et al. (2019) proposed efficient fine-tuning approach that only updates a few top layers of BERT. They suggested that small data sets may be sub-optimal, when fine-tuning the whole BERT layers. Inspired by this assumption, we conduct an experiment (Table 3) varying layers which is fine-tuned during the period of training (=). The models achieve the best performance when = and =, for Ubuntu and Advising, respectively. We demonstrate that utilizing this application is effective on not only small sets but also domain-specific sets.
We also experiment how effective data augmentation is for both data sets. Formerly, both training sets contain : ratio of positive and negative responses. We change the ratio of the samples to :

by increasing negative samples. The process of how we select negative samples is heuristic, and the best performance is obtained at ration of

:. It actually affects the performance improvement of Ubuntu, but not for Advising. We hypothesize that the Advising Corpus, unlike Ubuntu, is created with a number of sub-dialogs extracted from the original one. We obtain 0.3% higher compared with the model trained on original one. We set same experiment conditions with the BERT-VFT that shows the best performance.

5 Conclusion

In this paper, a highly effective post-training method for a multi-turn response selection is proposed and evaluated. Our approach achieved new state-of-the-art performance for two response selection benchmark data sets, Ubuntu Corpus V1 and Advising Corpus.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Link Cited by: Appendix A.
  • D. Chaudhuri, A. Kristiadi, J. Lehmann, and A. Fischer (2018) Improving response selection in multi-turn dialogue systems by incorporating domain knowledge. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 497–507. External Links: Link Cited by: §1, §4.2.
  • Q. Chen and W. Wang (2019) Sequential attention-based network for noetic end-to-end response selection. In 7th Edition of the Dialog System Technology Challenges at AAAI 2019, External Links: Link Cited by: §2, §4.2, §4.3, Table 2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1657–1668. External Links: Link, Document Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. Computing Research Repository arXiv:1810.04805. External Links: Link Cited by: §1, §4.2.
  • J. Dong and J. Huang (2018) Enhance word representation for out-of-vocabulary on ubuntu dialogue corpus. Computing Research Repository arXiv:1802.02614. External Links: Link Cited by: §2.
  • J. Gu, Z. Ling, and Q. Liu (2019) Interactive matching network for multi-turn response selection in retrieval-based chatbots. Computing Research Repository arXiv:1901.01824. External Links: Link Cited by: §2, §4.2.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019)

    Parameter-efficient transfer learning for nlp

    Computing Research Repository arXiv:1902.00751. External Links: Link Cited by: §4.3.
  • R. Kadlec, M. Schmid, and J. Kleindienst (2015)

    Improved deep learning baselines for ubuntu corpus dialogs

    Computing Research Repository arXiv:1510.03753. External Links: Link Cited by: §2, §4.2.
  • W. S. Lasecki (2019) DSTC7 task 1: noetic end-to-end response selection. In 7th Edition of the Dialog System Technology Challenges at AAAI 2019, External Links: Link Cited by: §1, §4.1.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 285–294. External Links: Link, Document Cited by: §1, §2, §4.1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019) Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 267–275. External Links: ISBN 978-1-4503-5940-5, Link, Document Cited by: §2, §4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2.
  • J. Vig and K. Ramea (2019) Comparison of transfer-learning approaches for response selection in multi-turn conversations. External Links: Link Cited by: §1, §4.3, Table 2.
  • E. M. Voorhees et al. (1999) The trec-8 question answering track report.. In Trec, Vol. 99, pp. 77–82. Cited by: §4.1.
  • Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 496–505. External Links: Link, Document Cited by: §1, §2, §4.1, §4.2.
  • H. Xu, B. Liu, L. Shu, and P. S. Yu (2019)

    BERT post-training for review reading comprehension and aspect-based sentiment analysis

    Computing Research Repository arXiv:1904.02232. External Links: Link Cited by: §1.
  • R. Yan, Y. Song, and H. Wu (2016) Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pp. 55–64. External Links: Link, Document Cited by: §4.2.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752. External Links: Link Cited by: §2, §4.2.
  • X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan (2016) Multi-view response selection for human-computer conversation. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    Austin, Texas, pp. 372–381. External Links: Link, Document Cited by: §1, §2, §4.2.
  • X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu (2018) Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1118–1127. External Links: Link Cited by: §1, §2, §4.2, §4.3.

Appendix A Implementation Details

The models are implemented using the Tensorflow library Abadi et al. (2016). We use the Uncased BERT model333 as a base code for our experiments. The batch size is set to 32 and the maximum sequence length is set to 320, specifically 280 for a dialog context and 40 for a response. We post-train the model more on Ubuntu Corpus V1 and Advising Corpus, 200,000 steps and 100,000 steps, respectively. The model is optimized using Adam weight decay optimizer with learning rate of 3e-5.