1 Introduction
Answer selection is the task of finding which of the candidates can answer the given question correctly. Since the emergency and development of deep learning methods in this task, the impressive results are yielded without relying on feature engineering or external knowledge bases
Madabushi et al. (2018); Tay et al. (2018b); Sha et al. (2018); Tay et al. (2018a); Gonzalez et al. (2018); Wu et al. (2018). However, many of them are based on shallow pre-trained word embedding and task-specific network structures. Until recently, the pre-trained language representation models, such as ELMo, GPT, and BERT Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018). They achieve state of the art performance in many natural language processing tasks. In general, BERT is firstly pre-trained on vast amounts of text with expensive computational resources, by using two tasks: an unsupervised objective of masked language modeling and next-sentence prediction. Then, the pre-trained network is fine-tuned on task-specific labeled data. However, BERT has not yet been fine-tuned for answer selection. In this paper, we explore fine-tuning BERT for this task.
Briefly, our contributions are:
-
We explore the BERT pre-trained model to address the poor generalization capability of answer selection.
-
We conduct exclusive experiments to analysis the effectiveness of BERT in this task, and achieve the state-of-the-art results in 5 benchmarks datasets.
Dataset | #Train | #Dev | #Test | #TrainPairs |
---|---|---|---|---|
TrecQA | 1229 | 65 | 68 | 504421 |
WikiQA | 873 | 126 | 243 | 8995 |
YahooQA | 50112 | 6289 | 6283 | 253440 |
SemEvalcQA-16 | 4879 | 244 | 327 | 75181 |
SemEvalcQA-17 | 4879 | 244 | 293 | 75181 |
TrecQA | WikiQA | YahooQA | SemEvalcQA-16 | SemEvalcQA-17 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MRR | MAP | MRR | MAP | MRR | MAP | MRR | MAP | MRR | MAP | |
epoch=3 | 0.927 | 0.877 | 0.770 | 0.753 | 0.942 | 0.942 | 0.872 | 0.810 | 0.951 | 0.909 |
epoch=5 | 0.944 | 0.883 | 0.784 | 0.769 | 0.942 | 0.942 | 0.890 | 0.816 | 0.953 | 0.908 |
SOTA | 0.865 | 0.904 | 0.758 | 0.746 | - | 0.801 | 0.872 | 0.801 | 0.926 | 0.887 |
TrecQA | WikiQA | YahooQA | SemEvalcQA-16 | SemEvalcQA-17 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
base | large | base | large | base | large | base | large | base | large | |
MRR | 0.927 | 0.961 | 0.770 | 0.875 | 0.942 | 0.938 | 0.872 | 0.911 | 0.951 | 0.958 |
MAP | 0.877 | 0.904 | 0.753 | 0.860 | 0.942 | 0.938 | 0.810 | 0.844 | 0.909 | 0.907 |
2 Method
Our method is based on the pairwise approach, as depicted on Figure 1, which is taking a pair of candidate answer sentences and explicitly learns to predict which sentence is more relevant to the question. Given a question , a positive answer and a sampled negative answer , the model inputs are triples . For fine-tuning, we split the triple into and
, and send them to BERT to get [CLS] embedding respectively. Then a fully connected layer and a sigmoid function are applied in each output logits to get final score.
3 Experiments
3.1 Datasets
3.2 Loss Function
The training objective of the our model consists of two aspects. For the first part, the objective function is to maximize the negative cross-entropy of positive and negative examples. For the second part, the objective function is hinge loss function. By adding them, our final loss function is obtained as follows:
(1) |
where , denote the predicted scores of positive and negative answer, , are weighted parameters.
3.3 Performance Measure
The performance of an answer selection system is measured in Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP), which are standard metrics in Information Retrieval and Question Answering. Given a set of question Q, MRR is calculated as follows:
(2) |
where refers to the rank position of the first correct candidate answer for the question. In other words, MRR is the average of the reciprocal ranks of results for the questions in Q. On the other hand, if the set of correct candidate answers for a question is and is the set of ranked retrieval results from the top result until you get to the answer , then MAP is calculated as follows:
(3) |
When a relevant answer is not retrieved at all for a question, the precision value for that question in the above equation is taken to be 0. Whereas MRR measures the rank of any correct answer, MAP examines the ranks of all the correct answers.
3.4 Results
4 Conclusion
Answer selection is an important problem in natural language processing, and many deep learning methods have been proposed for the task. In this paper, We have described a simple adaptation of BERT that has become the state of the art on five datasets.
References
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Gonzalez et al. (2018) Ana Gonzalez, Isabelle Augenstein, and Anders Søgaard. 2018. A strong baseline for question relevancy ranking. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4810–4815.
- Madabushi et al. (2018) Harish Tayyar Madabushi, Mark Lee, and John Barnden. 2018. Integrating question classification and deep learning for improved answer selection. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3283–3294.
- Nakov et al. (2017) Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. 2017. Semeval-2017 task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3-4, 2017, pages 27–48.
- Nakov et al. (2016) Preslav Nakov, Lluís Màrquez, Alessandro Moschitti, Walid Magdy, Hamdy Mubarak, Abed Alhakim Freihat, Jim Glass, and Bilal Randeree. 2016. Semeval-2016 task 3: Community question answering. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pages 525–545.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
-
Sha et al. (2018)
Lei Sha, Xiaodong Zhang, Feng Qian, Baobao Chang, and Zhifang Sui. 2018.
A multi-view fusion neural network for answer selection.
InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018
, pages 5422–5429. - Tay et al. (2017) Yi Tay, Minh C. Phan, Anh Tuan Luu, and Siu Cheung Hui. 2017. Learning to rank question answer pairs with holographic dual LSTM architecture. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pages 695–704.
- Tay et al. (2018a) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018a. Cross temporal recurrent networks for ranking question answer pairs. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5512–5519.
- Tay et al. (2018b) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018b. Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 583–591.
- Wang et al. (2007) Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the jeopardy model? A quasi-synchronous grammar for QA. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 22–32.
- Wu et al. (2018) Wei Wu, Xu Sun, and Houfeng Wang. 2018. Question condensing networks for answer selection in community question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1746–1755.
- Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2013–2018.
Comments
There are no comments yet.