BERTSel: Answer Selection with Pre-trained Models

05/18/2019 ∙ by Dongfang Li, et al. ∙ Harbin Institute of Technology 0

Recently, pre-trained models have been the dominant paradigm in natural language processing. They achieved remarkable state-of-the-art performance across a wide range of related tasks, such as textual entailment, natural language inference, question answering, etc. BERT, proposed by Devlin et.al., has achieved a better marked result in GLUE leaderboard with a deep transformer architecture. Despite its soaring popularity, however, BERT has not yet been applied to answer selection. This task is different from others with a few nuances: first, modeling the relevance and correctness of candidates matters compared to semantic relatedness and syntactic structure; second, the length of an answer may be different from other candidates and questions. In this paper. we are the first to explore the performance of fine-tuning BERT for answer selection. We achieved STOA results across five popular datasets, demonstrating the success of pre-trained models in this task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Answer selection is the task of finding which of the candidates can answer the given question correctly. Since the emergency and development of deep learning methods in this task, the impressive results are yielded without relying on feature engineering or external knowledge bases 

Madabushi et al. (2018); Tay et al. (2018b); Sha et al. (2018); Tay et al. (2018a); Gonzalez et al. (2018); Wu et al. (2018). However, many of them are based on shallow pre-trained word embedding and task-specific network structures. Until recently, the pre-trained language representation models, such as ELMo, GPT, and BERT Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018). They achieve state of the art performance in many natural language processing tasks. In general, BERT is firstly pre-trained on vast amounts of text with expensive computational resources, by using two tasks: an unsupervised objective of masked language modeling and next-sentence prediction. Then, the pre-trained network is fine-tuned on task-specific labeled data. However, BERT has not yet been fine-tuned for answer selection. In this paper, we explore fine-tuning BERT for this task.

Figure 1: The fine-tuning method we used for answer selection task.

Briefly, our contributions are:

  • We explore the BERT pre-trained model to address the poor generalization capability of answer selection.

  • We conduct exclusive experiments to analysis the effectiveness of BERT in this task, and achieve the state-of-the-art results in 5 benchmarks datasets.

Dataset #Train #Dev #Test #TrainPairs
TrecQA 1229 65 68 504421
WikiQA 873 126 243 8995
YahooQA 50112 6289 6283 253440
SemEvalcQA-16 4879 244 327 75181
SemEvalcQA-17 4879 244 293 75181
Table 1: Statistics of datasets of Answer Selection.
TrecQA WikiQA YahooQA SemEvalcQA-16 SemEvalcQA-17
MRR MAP MRR MAP MRR MAP MRR MAP MRR MAP
epoch=3 0.927 0.877 0.770 0.753 0.942 0.942 0.872 0.810 0.951 0.909
epoch=5 0.944 0.883 0.784 0.769 0.942 0.942 0.890 0.816 0.953 0.908
SOTA 0.865 0.904 0.758 0.746 - 0.801 0.872 0.801 0.926 0.887
Table 2: Results of BERT in test set of five datasets with different epochs. The SOTA results are from Madabushi et al. (2018) (TrecQA),  Sha et al. (2018) (WikiQA, SemEvalcQA-16),  Tay et al. (2018b) (YahooQA),  Nakov et al. (2017) (SemEvalcQA-17).
TrecQA WikiQA YahooQA SemEvalcQA-16 SemEvalcQA-17
base large base large base large base large base large
MRR 0.927 0.961 0.770 0.875 0.942 0.938 0.872 0.911 0.951 0.958
MAP 0.877 0.904 0.753 0.860 0.942 0.938 0.810 0.844 0.909 0.907
Table 3: Results of BERT and BERT in test set of five datasets. The number of training epochs is 3.

2 Method

Our method is based on the pairwise approach, as depicted on Figure 1, which is taking a pair of candidate answer sentences and explicitly learns to predict which sentence is more relevant to the question. Given a question , a positive answer and a sampled negative answer , the model inputs are triples . For fine-tuning, we split the triple into and

, and send them to BERT to get [CLS] embedding respectively. Then a fully connected layer and a sigmoid function are applied in each output logits to get final score.

3 Experiments

3.1 Datasets

TrecQA Wang et al. (2007), WikiQA Yang et al. (2015), YahooQA Tay et al. (2017) and SemEval cQA task Nakov et al. (2016, 2017) have been widely used for benchmarking answer selection models. Table 1 summarizes the statistics of the datasets.

3.2 Loss Function

The training objective of the our model consists of two aspects. For the first part, the objective function is to maximize the negative cross-entropy of positive and negative examples. For the second part, the objective function is hinge loss function. By adding them, our final loss function is obtained as follows:

(1)

where , denote the predicted scores of positive and negative answer, , are weighted parameters.

3.3 Performance Measure

The performance of an answer selection system is measured in Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP), which are standard metrics in Information Retrieval and Question Answering. Given a set of question Q, MRR is calculated as follows:

(2)

where refers to the rank position of the first correct candidate answer for the question. In other words, MRR is the average of the reciprocal ranks of results for the questions in Q. On the other hand, if the set of correct candidate answers for a question is and is the set of ranked retrieval results from the top result until you get to the answer , then MAP is calculated as follows:

(3)

When a relevant answer is not retrieved at all for a question, the precision value for that question in the above equation is taken to be 0. Whereas MRR measures the rank of any correct answer, MAP examines the ranks of all the correct answers.

3.4 Results

We show the main result in Table 2 and 3. Despite training on a fraction of the data available, the proposed BERT-based models surpass the previous state-of-the-art models by a large margin on all datasets.

4 Conclusion

Answer selection is an important problem in natural language processing, and many deep learning methods have been proposed for the task. In this paper, We have described a simple adaptation of BERT that has become the state of the art on five datasets.

References

  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  • Gonzalez et al. (2018) Ana Gonzalez, Isabelle Augenstein, and Anders Søgaard. 2018. A strong baseline for question relevancy ranking. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4810–4815.
  • Madabushi et al. (2018) Harish Tayyar Madabushi, Mark Lee, and John Barnden. 2018. Integrating question classification and deep learning for improved answer selection. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3283–3294.
  • Nakov et al. (2017) Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. 2017. Semeval-2017 task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, pages 27–48.
  • Nakov et al. (2016) Preslav Nakov, Lluís Màrquez, Alessandro Moschitti, Walid Magdy, Hamdy Mubarak, Abed Alhakim Freihat, Jim Glass, and Bilal Randeree. 2016. Semeval-2016 task 3: Community question answering. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 525–545.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
  • Sha et al. (2018) Lei Sha, Xiaodong Zhang, Feng Qian, Baobao Chang, and Zhifang Sui. 2018.

    A multi-view fusion neural network for answer selection.

    In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    , pages 5422–5429.
  • Tay et al. (2017) Yi Tay, Minh C. Phan, Anh Tuan Luu, and Siu Cheung Hui. 2017. Learning to rank question answer pairs with holographic dual LSTM architecture. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pages 695–704.
  • Tay et al. (2018a) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018a. Cross temporal recurrent networks for ranking question answer pairs. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5512–5519.
  • Tay et al. (2018b) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018b. Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 583–591.
  • Wang et al. (2007) Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the jeopardy model? A quasi-synchronous grammar for QA. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 22–32.
  • Wu et al. (2018) Wei Wu, Xu Sun, and Houfeng Wang. 2018. Question condensing networks for answer selection in community question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1746–1755.
  • Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2013–2018.