Unsupervised Domain Adaptation of Contextual Embeddings for Low-Resource Duplicate Question Detection

11/06/2019 ∙ by Alexandre Rochette, et al. ∙ Microsoft 0

Answering questions is a primary goal of many conversational systems or search products. While most current systems have focused on answering questions against structured databases or curated knowledge graphs, on-line community forums or frequently asked questions (FAQ) lists offer an alternative source of information for question answering systems. Automatic duplicate question detection (DQD) is the key technology need for question answering systems to utilize existing online forums like StackExchange. Existing annotations of duplicate questions in such forums are community-driven, making them sparse or even completely missing for many domains. Therefore, it is important to transfer knowledge from related domains and tasks. Recently, contextual embedding models such as BERT have been outperforming many baselines by transferring self-supervised information to downstream tasks. In this paper, we apply BERT to DQD and advance it by unsupervised adaptation to StackExchange domains using self-supervised learning. We show the effectiveness of this adaptation for low-resource settings, where little or no training data is available from the target domain. Our analysis reveals that unsupervised BERT domain adaptation on even small amounts of data boosts the performance of BERT.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Answering questions is a primary goal of many conversational systems or search products. While most current systems have focused on answering questions against structured databases or curated knowledge graphs, on-line community forums or frequently asked question (FAQ) lists offer an alternative source of information for question answering systems to utilize. For example, StackExchange (SE) is a popular community forum website containing posted questions and community supplied answers that span many different domains.111https://stackexchange.com/ To take advantage of such data sets, conversational systems need the ability to recognize when a question asked by a user is semantically identical to a previous asked and answered question contained within a forum site or FAQ list.

In this paper, we address the problem of duplicate question detection (DQD) [3, 8, 10, 11]. There are two main application scenarios of DQD in the forums: (i) given a database of questions, find the semantically equivalent duplicates; (ii) given a question as a query, rank the database of questions based on their pair-wise similarity to the query. Both cases are important for efficient information seeking.

For learning DQD models, we need question pairs annotated with duplicate labels. In SE, expert users do these annotations voluntarily. In practice, the annotations are sparse or even missing for many domains. It is therefore important to transfer knowledge from related tasks and domains in order to perform DQD in new domains. With deep learning models such domain adaptation is typically achieved with various forms of transfer learning (i.e., fine tuning models learned from other tasks and domains to new tasks and domains). Adversarial domain adaptation 

[5, 1] has also been successfully applied to multiple domains to improve cross-domain generalization in DQD substantially [13].

Recently, contextualized word embeddings such as ELMo [12] and BERT [4], trained on large data, have demonstrated remarkable performance gains across many NLP tasks. In our experiments, we use BERT as the base model that serves as the input into our duplicate question detection task model. To improve low-resource DQD, both within- and cross-domain, we address learning a domain-adapted DQD task model within a two stage approach starting from pretrained BERT. This process is depicted in Figure 1.

In the first stage, because BERT is pretrained on general purpose text from Wikipedia and books which are largely different from user generated posts in the SE domains, we explore unsupervised domain adaptation of the base BERT model to a new domain. The adaptation of BERT to scientific [2], biomedical [9], and historical English [6] domains has previously shown promising performance. Following a similar approach, we adapt the existing pretrained BERT model using self-supervised objectives of masked language modeling and next sentence prediction on unlabeled data from StackExchange. In the second stage, the BERT-adapted model on the target domain is then finetuned on the DQD task data and objectives to train the BERT task model.

We experiment on four domains related to computer systems and two other non-related domains, and show the effectiveness of our proposed approach. Our main findings are as follows: (i) We show that BERT helps the task of DQD in SE domains and outperforms the previous LSTM-based baseline. (ii) Our unsupervised adaptation of BERT on unlabeled domain data improves the results substantially, especially in low-resource settings where less labeled training data is available. (iii) We show that adapting BERT on even a small amount of unlabeled data from target domains is very effective. (iv) We demonstrate that unsupervised adaptation of BERT on a large number of diverse SE domains further improves performance for a variety of target domains.

2 Unsupervised Domain Adaptation of Contextual Embeddings: Background

Contextual embeddings are pretrained on large, topically-diverse text to learn generic representations useful for various downstream tasks. However, the effectiveness of these models decreases as the mismatch between the pretraining material and the task domains increases. To alleviate this issue, ULMFiT [7] trains LSTM models on generic unsupervised data, but then fine-tunes them on domain data before training task-specific models from them. When large amounts of unlabeled domain data is available, this domain adaptation step provides significant improvements even with only small amounts (1) of labeled training examples. We propose a similar process but adopt BERT [4] and apply it to our DQD task. We find that adaptation, even on little unlabeled data from the target domain, is effective.

Adapting BERT to target domains is also studied in several recent works. BioBERT [9] shows significant improvement for a suit of biomedical tasks by fine-tuning BERT on large biomedical data. SciBERT [2] does similarly but for scientific domains. AdaptaBERT [6] fine-tunes BERT to unlabeled historical English text in a domain adaptation scenario where training data comes from contemporary corpora. AdaptaBERT significantly improves BERT for POS tagging in this setting. Similar to this line of related work, we also fine-tune BERT on domain specific data but for the task of DQD in community forums. We complement this line of research with novel findings. For example, our scenario of unsupervised domain adaptation using small unlabeled domain data has not been addressed in any prior work in BERT (or other contextual models), to the best of our knowledge. This is a critical result given the importance of contextual models, because it shows for low-resource domains there is a big room for improvements even if they have small unlabeled data. Also for cross-domain, we are first to apply BERT to a semantic task (i.e., DQD). The other work [6] addresses POS tagging – a low-level task in NLP.

Figure 1: Our training process.

3 BERT for Duplicate Question Detection in Community Forums

The task is to find whether two questions are duplicate or not, based on their semantic similarity. To compute the similarity, we follow a similar setup as the BERT’s sequence pair classification experiments in [4]

. First, we learn a representation for the pair of questions. Then, a linear binary classifier with cross entropy loss is applied to the representation.

A question includes a title and a body. To obtain a representation for each pair of questions

, we compute two vectors: one for the titles and one for the bodies. Each vector is generated from the last-layer’s [CLS] token’s embedding, starting from an input of the form

, where takes the value of the title or the body for question . [CLS] and [SEP] are special tokens in BERT. Similar to [13], the two vectors are then summed to produce a single representation for the pair of questions.

3.1 Low-Resource Settings

We aim to improve generalization of DQD in low-resource settings. In SE, many domains have limited data, e.g., “Android” domain has 48,490 posts with around 2,000 duplicate annotations. Using domains in SE such as “AskUbuntu” or “Android”, we propose procedures to improve within- and cross-domain generalization for DQD. Our primary setup has small or no labeled within-domain data, limited unlabeled within-domain data, and large out-of-domain unlabeled data. When there is no within-domain labeled data, we use labeled data from other domains to learn the task.

3.2 Unsupervised BERT Domain Adaptation

Previous work in DQD [13] has addressed domain adaptation when no within-domain training data is available. They train randomly initialized BiLSTMs, with pretrained word embeddings on SE as input, and apply adversarial objectives for domain adaptation. We instead focus on transfer learning to adapt pre-trained models to new target domains. Our models are based on BERT, which has proven effective for many different NLP tasks. BERT is pre-trained on unlabeled data using two self-supervised objectives: masked language modeling (MLM) and next sentence prediction (NSP). The corpora BERT is trained on are: 2.5B tokens of Wikipedia and .8B tokens of BookCorpus [14].

Our domains are taken from SE, which includes various topics, such as sports, travel, food, programming, etc. The posts are written by diverse internet users, with variations in their vocabulary and syntax compared to BERT’s pretraining corpora written and edited mostly by professionals. Therefore, we adapt BERT by fine-tuning it on SE posts using the same self-supervised objectives of MLM and NSP. We refer to this model as BERT-adapted as opposed to the original BERT model.

4 Experiments

4.1 Baseline Experiments

Dataset Unsupervised Train Dev Test
AskUbuntu 305,769 9,106 1,000 1,000
SuperUser 390,378 9,106 1,000 1,000
Apple 93,399 2,000 1,000 1,000
Android 48,490 - 1,000 1,000
Table 1: Datasets statistics. The Unsupervised column indicates the number of questions. Train, Test and Dev columns specify the number of positive duplicates.

We use the DQD datasets of [13] as well as their training and testing protocol. Our target domains are (AskUbuntu, Android, Apple, SuperUser). The data contains pairs of questions. The positive examples are taken from the duplicate marks in SE. Unlabeled examples are extracted from the SE dumps.222https://archive.org/details/stackexchange We append the body to the title for each question and make that a contiguous paragraph for BERT self-supervised adaptation.333

We use the Pytorch implementation

https://github.com/huggingface/pytorch-pretrained-BERT Some statistics are shown in Table 1. For unsupervised adaptation of BERT, we also use unlabeled data sets from additional SE domains. Along with the datasets for the specific target domains, we craft two additional data sets from 20 and from 33 different stackexchange domains (including the four target domains). We list the selected 20 and 33 domains with total data amounts in the Appendix.

Since the annotations are incomplete, [13] propose to use AUC as the metric for DQD performance since it is more robust against false negatives. They report the normalized AUC@0.05, which is the area under the curve of the true positive rate as function of the false positive rate (), from to . We follow the same protocol and use AUC@0.05 metric.

ln Source Target Baseline BERT Target 20d 33d 20d-noTarget 20d-noNSP 20d-frozen
1 Apple Android .764 .826 .883 .910 .919 .889 .866 .771
2 AskUbuntu Android .790 .810 .907 .883 .909 .849 .863 .830
3 SuperUser Android .790 .849 .908 .907 .914 .886 .891 .805
4 AskUbuntu Apple .855 .857 .927 .931 .942 .916 .916 .889
5 SuperUser Apple .861 .881 .939 .943 .954 .931 .929 .887
6 Apple Apple .976 .982 .989 .988 .991 .991 .987 .870
7 Apple AskUbuntu .756 .683 .864 .872 .897 .812 .833 .767
8 SuperUser AskUbuntu .796 .779 .870 .874 .891 .826 .835 .812
9 AskUbuntu AskUbuntu .858 .899 .923 .920 .942 .921 .924 .845
10 Apple SuperUser .873 .925 .965 .968 .973 .959 .958 .916
11 AskUbuntu SuperUser .911 .932 .964 .966 .975 .957 .955 .937
12 SuperUser SuperUser .930 .958 .967 .974 .977 .968 .967 .937
13 Avg (Source Target) .822 .838 .914 .917 .930 .892 .897 .846
14 Avg (Source Target) .921 .946 .960 .961 .970 .960 .959 .884
15 Overall Avg .847 .865 .926 .928 .940 .909 .912 .855
Table 2: AUC@0.05 for the baseline [13], BERT, and BERT-adapted models. BERT-adapted models are shown by the domain(s) they adapted on. Results of BERT-adapted on 20d-noTarget and Target in each row correspond to one of the four target models, depending on the row’s target domain. The rows with white (gray) background belong to cross(within)-domain experiments.
(a) AskUbuntuAskUbuntu
(b) SuperUserSuperUser
Figure 2: AUC@0.05 vs training set size for AskUbuntuAskUbuntu (a) and SuperUserSuperUser (b). After 50%, all graphs remain flat.
Figure 3: AUC@0.05 vs unsupervised data size for the AppleAskUbuntu cross-domain scenario.

Results. Table 2 shows the main results for 12 different combinations (3 within- and 9 cross-domain) of training a DQD model using a source domain and testing on a target domain, with early stopping on the dev set of the target domain, similar to [13]. We also add three average rows for cross-domain (line 13), within-domain (line 14), and overall (line 15). We use [13]’s model as our baseline. For the scenarios not covered in the baseline’s paper, we run their published code.444To confirm, we were able to replicate the paper’s results using their published code.

For BERT, we use the BASE-CASED pretrained model with a fixed set of hyperparameters for task fine-tuning.

As Table 2 demonstrates, BERT outperforms the baseline on average. However, it is worse in two cases (lines 7 & 8). BERT-adapted on the target domain is better than BERT and the baseline in all rows. It substantially improves BERT for cross-domain (.838 to .914). This indicates that BERT-adapted learns better representations for the target-specific terms compared to BERT by its unsupervised training on unlabeled target data.555We also check the effect of seeing target’s test examples in the unsupervised fine-tuning. We analyze BERT-adapted on AppleAskUbuntu, and the results show that for the same size of data, whether or not the unlabeled test questions were included had no effect on the DQD performance. BERT-adapted by 20 domains (20d) does not degrade the performance of BERT-target on average. Note that this model is trained only once and used in all rows. We further see improvements by adding more domains: BERT-adapted on 33 domains outperforms all models consistently.

Excluding the row’s target domain from 20d hurts the performance for the cross-domain cases. (see 20d-noTarget column). However, it still outperforms BERT, emphasizing that related unlabeled data is also beneficial for adaptation. This is also clear in the effectiveness of 33 domains which spans more related data.

Another investigation is on the contribution of next sentence prediction (NSP) objective in the performance of BERT-adapted. Our 20d-noNSP results show that removing NSP in adaptation only decreases the overall average from .928 to .912, still significantly higher than BERT (.865).

We also analyze what happens if we freeze the BERT-adapted parameters when we train the task (20d-frozen in Table 2). Only the final-layer DQD classifier parameters have to be learned in this case. We experiment on the 12 scenarios, the frozen BERT-adapted results are significantly worse than their fine-tuned counterparts, but, interestingly, they are almost as good as the fine-tuned BERT models (i.e., models where BERT parameters are also updated) which are not adapted to target domains.

4.2 Limited Labeled Data Scenario

In Figure 2, we vary the DQD training size and report the performance of the AskUbuntuAskUbuntu (Figure 1(a)) and SuperUserSuperUser (Figure 1(b)) scenarios, performing supervised DQD training starting form both the BERT and target-specific BERT-adapted models. The notation indicates that we train on and evaluate on . For both datasets, BERT-adapted models tuned on only 1% of labeled data achieve better results than the BERT models tuned on 10%. For AskUbuntu the BERT-adapted model trained for DQD on only 1% of the labeled data even beats the BERT model trained for DQD on the full labeled data set. These results demonstrate that unsupervised adaptation of BERT to a domain significantly reduces the need for annotated data for supervised training of the DQD task. As we increase the training size, the gap between BERT and BERT-adapted shrinks. This confirms our assumption that adapting BERT is especially effective for sparse annotated data scenarios.

4.3 Limited Unlabeled Data Scenario

In Figure 3, we show how the DQD performance in the AppleAskUbuntu scenario improves when increasing the size of randomly sampled unlabeled data (number of questions) from only AskUbuntu or 33 different domains.

BERT-adapted models are improved as more unlabeled data is added, but even small amounts (1) of unlabeled data is helpful. Another interesting observation is that, even though sampling unsupervised examples from the target domain appears to be superior, the ability to sample more data from the combined set of 33 domains closes the gap (See Figure 3) and eventually achieves superior performance (see Table 2).

Quora AskUbuntu Apple Android SuperUser Academia
Quora 1.000 - - - - -
AskUbuntu 0.123 1.000 - - - -
Apple 0.154 0.169 1.000 - - -
Android 0.134 0.128 0.240 1.000 - -
SuperUser 0.131 0.221 0.174 0.118 1.000 -
Academia 0.157 0.085 0.168 0.173 0.083 1.000
Table 3:

Vocab Jaccard Index between datasets’ question titles.

Unigram Bigram
Positive Negative Positive Negative
AskUbuntu 0.160 0.030 0.049 0.003
Apple 0.170 0.027 0.053 0.003
Android 0.157 0.031 0.044 0.003
SuperUser 0.188 0.027 0.064 0.003
Academia 0.125 0.040 0.028 0.002
Quora 0.468 0.302 0.248 0.152
Table 4: Vocab Jaccard Index between question title pairs in each dataset. This shows how Quora’s annotation differ from StackExchange datasets in terms of lexical similarity between question pairs.
Source Target BERT Target Target-frozen
Academia AskUbuntu .601 .854 .758
Quora AskUbuntu .515 .609 .670
SuperUser AskUbuntu .779 .870 .821
AskUbuntu AskUbuntu .899 .923 .845
Table 5: Comparing AUC(0.05) DQD results of AskUbuntu when source training data comes from topically different domain (Academia and Quora) and when the non-duplicate distribution is quite different (Quora) with a related domain (SuperUser) and the target domain itself (AskUbuntu).

4.4 Large Domain Variation Scenario

To examine the effectiveness of adaptation when the domain difference between source and target domains is large, we examine performance on AskUbuntu with two additional datasets for training: Quora and Academia. Academia is similarly built from SE as datasets in [13]. Quora is taken from [13]. The annotation of Quora comes from the released Quora question pairs dataset.666https://data.quora.com/First-Quora-Dataset-Release-Question- Pairs

In Table 3, we show the lexical similarity between questions in different datasets. We also measure the similarity of pairs for positive (duplicate) vs negative (non-duplicate) examples, shown in Table 4. Focusing on AskUbuntu, we see that Academia and Quora both have low similarity with AskUbuntu. That is more clear for Academia where the vocab Jaccard Index is only 0.085. In addition to lexical variation with AskUbuntu, Quora has another significant difference: its negative pairs were deliberately selected to have high lexical overlap, while for AskUbuntu (and other SE datasets here) the negative examples are chosen randomly. It results in the vocabulary Jaccard Index between duplicate/non-duplicate questions being much higher in Quora (0.468/0.302) compared to Academia (0.125/0.040) and other SE domains. This gives Quora’s labeled annotations different distributional characteristic than those in SE. Basically, the labeling function of Quora is different from the labeling function used in the SE datasets.

In Table 5, we show the results of domain adaptation from Quora and Academia to AskUbuntu, for BERT and BERT-adapted on AskUbuntu (fine-tuned or frozen BERT) as the target domain. To understand how these results are compared when more related domains are used for training, we also include the results of SuperUser (as the best performing source domain for AskUbuntu) and AskUbuntu itself.

As it is shown, the base BERT works poorly when either Academia or Quora is the source dataset for DQD training. Results of Academia improve substantially when BERT is adapted on AskUbuntu and fine-tuned for DQD on Academia training data. In fact, performance from training on Academia is only slightly worse than training on the more topically-similar domain of SuperUser (0.854 vs. 0.870). Freezing BERT-adapted parameters before training on the DQD on Academia decreases the performance, similar to SuperUser or AskUbuntu results. For Quora, however, BERT-adapted results are still very low and interestingly freezing BERT parameters performs better. We believe this is primarily caused by the difference in the labeling function used in Quora, i.e., the model trained on Quora is learning a different query similarity definition which does not generalize well to SE.

In summary, our results show that for domain-adaptation DQD: (i) if the labeling function is similar, the behaviour for topically-different domains is similar to cases where domains are topically similar; (ii) when training data comes from a dataset with a different labeling function, it is better to freeze the BERT parameters and only update the classification layer.

5 Discussion and Conclusion

In this paper, we introduced a new process to improve the task of low-resource DQD. In general, our process includes two main steps: (i) domain adaptation of BERT on unlabeled data using its self-supervised objectives and (ii) use the adapted BERT from step (i) and fine-tune it on a DQD dataset using the supervised objective (cross-entropy). The main focus of the paper is on (i), and how it affects the task results. Through extensive evaluation with BERT, we showed that this adaptation improves the generalization of DQD models greatly when there is no or limited training data available for the target domain. Two main scenarios of within- and cross-domain were addressed for evaluation on five domains of StackExchange and also Quora.

Our results suggest that:

  • A combination of unsupervised training on a target domain with supervised training on a different source domain is an effective strategy for the DQD task.

  • Significantly less supervised DQD training data is needed if we first adapt BERT with unsupervised training to data from the target domain.

  • Unsupervised adaptation of BERT on even small amount of target data yields better models.

  • If the annotation function of the source material is different than the target material, then domain adaptation faces issues which cannot be fixed by unsupervised adaptation.

As future work, we plan to apply our process to other similar scenarios; we believe our approach is generally effective, especially for low-resource applications. As another direction, applying other techniques like adversarial domain adaptation to BERT while learning the task seems as a complementary component to our method. Our initial attempt on this by using simple approaches did not make any improvement, but more investigation is needed which we leave for future work.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    International Conference on Machine Learning

    pp. 214–223. Cited by: §1.
  • [2] I. Beltagy, A. Cohan, and K. Lo (2019) SciBERT: pretrained contextualized embeddings for scientific text. CoRR abs/1903.10676. Cited by: §1, §2.
  • [3] R. D. Burke, K. J. Hammond, V. Kulyukin, S. L. Lytinen, N. Tomuro, and S. Schoenberg (1997) Question answering from frequently asked question files: experiences with the faq finder system. AI magazine 18 (2), pp. 57–57. Cited by: §1.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1, §2, §3.
  • [5] Y. Ganin and V. S. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In Proceedings of the 32nd International Conference on Machine Learning, pp. 1180–1189. Cited by: §1.
  • [6] X. Han and J. Eisenstein (2019) Unsupervised domain adaptation of contextualized embeddings: A case study in early modern english. CoRR abs/1904.02817. Cited by: §1, §2.
  • [7] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 328–339. Cited by: §2.
  • [8] J. Jeon, W. B. Croft, and J. H. Lee (2005) Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 84–90. Cited by: §1.
  • [9] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR abs/1901.08746. Cited by: §1, §2.
  • [10] T. Lei, H. Joshi, R. Barzilay, T. S. Jaakkola, K. Tymoshenko, A. Moschitti, and L. Màrquez (2016) Semi-supervised question retrieval with gated convolutions. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 1279–1289. Cited by: §1.
  • [11] P. Nakov, L. Màrquez, A. Moschitti, W. Magdy, H. Mubarak, A. A. Freihat, J. Glass, and B. Randeree (2016) SemEval-2016 task 3: community question answering. In Proceedings of the 10th International Workshop on Semantic Evaluation, pp. 525–545. Cited by: §1.
  • [12] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237. Cited by: §1.
  • [13] D. Shah, T. Lei, A. Moschitti, S. Romeo, and P. Nakov (2018) Adversarial domain adaptation for duplicate question detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1056–1063. Cited by: §1, §3.2, §3, §4.1, §4.1, §4.1, §4.4, Table 2.
  • [14] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015-12) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    The IEEE International Conference on Computer Vision

    Cited by: §3.2.

Appendix: List of domains

20 domains
(1,162,487 posts) academia android apple askubuntu aviation bitcoin boardgames christianity cooking cs gaming hinduism judaism linguistics mechanics meta.superuser philosophy politics superuser workplace
33 domains
(1,531,797 posts) 20 domains + anime astronomy bicycles biology buddhism chemistry cogsci crypto islam meta.stackexchange skeptics sports unix
List of the 20 and 33 domains used in our unsupervised BERT adaptation experiments.