Crossing Variational Autoencoders for Answer Retrieval

05/06/2020 ∙ by Wenhao Yu, et al. ∙ ibm University of Notre Dame 0

Answer retrieval is to find the most aligned answer from a large set of candidates given a question. Learning vector representations of questions/answers is the key factor. Question-answer alignment and question/answer semantics are two important signals for learning the representations. Existing methods learned semantic representations with dual encoders or dual variational auto-encoders. The semantic information was learned from language models or question-to-question (answer-to-answer) generative processes. However, the alignment and semantics were too separate to capture the aligned semantics between question and answer. In this work, we propose to cross variational auto-encoders by generating questions with aligned answers and generating answers with aligned questions. Experiments show that our method outperforms the state-of-the-art answer retrieval method on SQuAD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Answer retrieval is to find the most aligned answer from a large set of candidates given a question Ahmad et al. (2019); Abbasiyantaeb and Momtazi (2020). It has been paid increasing attention by the NLP and information retrieval community Yoon et al. (2019); Chang et al. (2020). Sentence-level answer retrieval approaches rely on learning vector representations (i.e., embeddings) of questions and answers from pairs of question-answer texts. The question-answer alignment and question/answer semantics are expected to be preserved in the representations. In other words, the question/answer embeddings must reflect their semantics in the texts of being aligned as pairs.

Question (1): What three stadiums did the NFL decide between for the game?
Question (2): What three cities did the NFL consider for the game of Super Bowl 50?
Question (17): How many sites did the NFL narrow down Super Bowl 50’s location to?
Answer: The league eventually narrowed the bids to three sites: New Orleans Mercedes-Benz Superdome, Miami Sun Life Stadium, and the San Francisco Bay Area’s Levi’s Stadium.
Table 1: The answer at the bottom of this table was aligned to 17 different questions at the sentence level.

One popular scheme “Dual-Encoders” (also known as “Siamese network”  Triantafillou et al. (2017); Das et al. (2016)) has two separate encoders to generate question and answer embeddings and a predictor to match two embedding vectors Cer et al. (2018); Yang et al. (2019). Unfortunately, it has been shown difficult to train deep encoders with the weak signal of matching prediction Bowman et al. (2015). Then there has been growing interests in developing deep generative models such as variational auto-encoders (VAEs) and generative adversial networks (GANs) for learning text embeddings Xu et al. (2017); Xie and Ma (2019). As shown in Figure 1(b), the scheme of “Dual-VAEs” has two VAEs, one for question and the other for answer Shen et al. (2018). It used the tasks of generating reasonable question and answer texts from latent spaces for preserving semantics into the latent representations.

(a) Dual-Encoders Yang et al. (2019)
(b) Dual-VAEs Shen et al. (2018)
(c) Dual-CrossVAEs (Ours)
Figure 1: (a)–(b) The Q-A alignment and Q/A semantics were learned too separately to capture the aligned semantics between question and answer. (c) We propose to cross VAEs by generating questions with aligned answers and generating answers with aligned questions.

Although Dual-VAEs was trained jointly on question-to-question and answer-to-answer reconstruction, the question and answer embeddings can only preserve isolated semantics of themselves. In the model, the Q-A alignment and Q/A semantics were too separate to capture the aligned semantics (as we mentioned at the end of the first paragraph) between question and answer. Learning the alignment with the weak Q-A matching signal, though now based on generatable embeddings, can lead to confusing results, when (1) different questions have similar answers and (2) similar questions have different answers. Table 1 shows an examples in SQuAD: 17 different questions share the same sentence-level answer.

Our idea is that if aligned semantics were preserved, the embeddings of a question would be able to generate its answer, and the embeddings of an answer would be able to generate the corresponding question. In this work, we propose to cross variational auto-encoders, shown in Figure 1(c), by reconstructing answers from question embeddings and reconstructing questions from answer embeddings. Note that compared with Dual-VAEs, the encoders do not change but decoders work across the question and answer semantics.

Experiments show that our method improves MRR and R@1 over the state-of-the-art method by 1.06% and 2.44% on SQuAD, respectively. On a subset of the data where any answer has at least 10 different aligned questions, our method improves MRR and R@1 by 1.46% and 3.65%, respectively.

2 Related Work

Answer retrieval (AR) is defined as the answer of a candidate question is obtained by finding the most similar answer between multiple candidate answers Abbasiyantaeb and Momtazi (2020). While another popular task on SQuAD dataset is machine reading comprehension (MRC), which is introduced to ask the machine to answer questions based on one given context Liu et al. (2019). In this section, we review existing work related to answer retrieval and variational autoencoders.

Answer Retrieval.

It has been widely studied with information retrieval techniques and has received increasing attention in the recent years by considering deep neural network approaches. Recent works have proposed different deep neural models in text-based QA which compares two segments of texts and produces a similarity score. Document-level retrieval  

Chen et al. (2017); Wu et al. (2018); Seo et al. (2018, 2019) has been studied on many public datasets including including SQuAD Rajpurkar et al. (2016), MsMarco Nguyen et al. (2016) and NQ Kwiatkowski et al. (2019) etc. ReQA proposed to investigate sentence-level retrieval and provided strong baselines over a reproducible construction of a retrieval evaluation set from the SQuAD data Ahmad et al. (2019). We also focus on sentence-level answer retrieval.

Variational Autoencoders. VAE consists of encoder and generator networks which encode a data example to a latent representation and generate samples from the latent space, respectively Kingma and Welling (2013)

. Recent advances in neural variational inference have manifested deep latent-variable models for natural language processing tasks 

Bowman et al. (2016); Kingma et al. (2016); Hu et al. (2017a, b); Miao et al. (2016). The general idea is to map the sentence into a continuous latent variable, or code, via an inference network (encoder), and then use the generative network (decoder) to reconstruct the input sentence conditioned on samples from the latent code (via its posterior distribution). Recent work in cross-modal generation adopted cross alignment VAEs to jointly learn representative features from multiple modalities Liu et al. (2017); Shen et al. (2017); Schonfeld et al. (2019). DeConv-LVM Shen et al. (2018) and VAR-Siamese Deudon (2018) are most relevant to us, both of which adopt Dual-VAEs models (see Figure 1(b)) for two text sequence matching task. In our work, we propose a Cross-VAEs for questions and answers alignment to enhance QA matching performance.

3 Proposed Method

Problem Definition. Suppose we have a question set and an answer set . Each question and answer have only one sentence. Each question and answer can be represented as , where

is a binary variable indicating whether

and are aligned. Therefore, the solution of sentence-level retrieval task could be considered as a matching problem. Given a question and a list of answer candidates , our goal is to predict of each input question with each answer candidate .

3.1 Crossing Variational Autoencder

Learning cross-domain constructions under generative assumption is essentially learning the conditional distribution and where two continuous latent variables are independently sampled from and :

(1)
(2)

The question-answer pair matching can be represented as the conditional distribution from latent variables and :

(3)

Objectives. We denote and as question and answer encoders that infer the latent variable and from a given question answer pair , and and as two different decoders that generate corresponding question and answer and from latent variables and . Then, we have cross construction loss:

(4)

Variational Autoencoder Kingma and Welling (2013) imposes KL-divergence regularizer to align both posteriors and :

(5)

where , are all parameters to be optimized. Besides, we have question answer matching loss from as:

(6)

where is a matching function and are parameters to be optimized. Finally, we obtain the overall object function to be minimized:

(7)

where , and are introduced as hyper-parameters to control the importance of each task.

3.2 Model Implementation

Dual Encoders.

We use Gated Recurrent Unit (GRU) as encoders to learn contextual words embeddings 

Cho et al. (2014). Question and answer embeddings are reduced by weighted sum through multiple hops self-attention Lin et al. (2017)

of GRU units and then fed into two linear transition to obtain mean and standard deviation as

and .

Dual Decoders. We adopt another Gated Recurrent Unit (GRU) for generating token sequence conditioned on the latent variables and .

Question Answer Matching.

We adopt cosine similarity with

normalization to measure the matching probability of a question answer pair.

4 Experiment

4.1 Dataset

Our experiments were conducted on SQuAD 1.1 Rajpurkar et al. (2016). It has over 100,000 questions composed to be answerable by text from Wikipedia documents. Each question has one corresponding answer sentence extracted from the Wikipedia document. Since the test set is not publicly available, we partition the dataset into 79,554 (training) / 7,801 (dev) / 10,539 (test) objects.

Method SQuAD
MRR R@1 R@5
InferSent 36.90 27.91 46.92
SenBERT 38.01 27.34 49.59
BERT 48.07 40.63 57.45
QA-Lite 50.29 40.69 61.38
USE-QA 61.23 53.16 69.93
Dual-GRUs 61.06 54.70 68.25
Dual-VAEs 61.48 55.01 68.49
Cross-VAEs 62.29 55.60 70.05
Table 2: Performance of answer retrieval on SQuAD.
Method SQuAD Subset
MRR R@1 R@5 SSE
BERT 37.90 30.81 45.24 0.23
USE-QA 47.06 40.90 53.44 0.14
Cross-VAEs 48.52 44.55 53.52 0.09
Table 3: Performance of answer retrieval on a subset of SQuAD in which any answer has more than 8 questions. Our method outperforms baselines much more. SSE indicates the sum of squared distances/errors between two different questions aligned to same answer.

4.2 Baselines

InferSent Conneau et al. (2017). It is not explicitly designed for answer retrieval, but it produces results on semantic tasks without requiring additional fine tuning.

USE-QA Yang et al. (2019) . It is based on Universal Sentence Encoder Cer et al. (2018), but trained with multilingual QA retrieval and two other tasks: translation ranking and natural language inference. The training corpus contains over a billion question answer pairs from popular online forums and QA websites (e.g, Reddit).

QA-Lite. Like USE-QA, this model is also trained over online forum data based on transformer. The main differences are reduction in width and depth of model layers, and sub-word vocabulary size.

BERT Devlin et al. (2019) . BERT first concatenates the question and answer into a text sequence , then passes through a 12-layers BERT and takes the

vector as input to a binary classifier.

SenBERT Reimers and Gurevych (2019) . It consists of twin structured BERT-like encoders to represent question and answer sentence, and then applies a similarity measure at the top layer.

4.3 Experimental Settings

Implementation details.

We initialize each word with a 768-dim BERT token embedding vector. If a word is not in the vocabulary, we use the average vector of its sub-word embedding vectors in the vocabulary. The number of hidden units in GRU encoder are all set as 768. All decoders are multi-layer perceptions (MLP) with one 768 units hidden layer. The latent embedding size is 512. The model is trained for 100 epochs by SGD using Adam optimizer 

Kingma and Ba (2014). For the KL-divergence, we use an KL cost annealing scheme Bowman et al. (2016), which serves the purpose of letting the VAE learn useful representations before they are smoothed out. We increase the weight of the KL-divergence by a rate of

per epoch until it reaches 1. We set learning rate as 1e-5, and implemented on Pytorch.

Competitive Methods.

We compare our proposed method cross variational autoencoder (Cross-VAEs) with dual-encoder model and dual variational autoencoder (Dual-VAEs). For fair comparisons, we all use GRU as encoder and decoder, and keep all other hyperparameters the same.

Evaluation Metrics. The models are evaluated on retrieving and ranking answers to questions using three metrics, mean reciprocal rank (MRR) and recall at K (R@K). R@K is the percentage of correct answers in topK out of all the relevant answers. MRR represents the average of the reciprocal ranks of results for a set of queries.

(a) USE-QA
(b) CrossVAEs
(c) Two questions were incorrectly matched by USE-QA, but correctly matched by CrossVAEs.
Figure 2: A case of 14 different questions aligned to the same answer. We use SVD to reduce embedding dimensions to 2, and then project them on the X-Y coordinate axis. The scale of X-Y axis is relative with no practical significance. We observe that our method makes questions that share the same answer to be closer with each other.

Comparing performance with baselines. As shown in Table 2, two BERT based models do not perform well, which indicates fune tuning BERT may not be a good choice for answer retrieval task due to unrelated pre-training tasks (e.g, masked language model). In contrast, using BERT token embedding can perform better in our retrieval task. Our proposed method outperforms all baseline methods. Comparing with USE-QA, our method improves MRR and R@1 by +1.06% and +2.44% on SQuAD, respectively. In addition, Dual variational autoencoder (Dual-VAEs) does not make much improvement on question answering retrieval task because it can only preserve isolated semantics of themselves. Our proposed crossing variational autoencoder (Cross-VAEs) could outperform dual-encoder model and dual variational autoencoder model, which improves MRR and R@1 by +1.23%/+0.81% and +0.90%/+0.59%, respectively.

Analyzing performance on sub-dataset. We extract a subset of SQuAD, in which any answer has at least eight different questions. As shown in Table 3, our proposed cross variational autoencoder (Cross-VAEs) could outperform baseline methods on the subset. Our method improves MRR and R@1 by +1.46% and +3.65% over USE-QA. Cross-VAEs significantly improve the performance when an answer has multiple aligned questions. Additionally, SSE of our method is smaller than that of USE-QA. Therefore, the questions of the same answer are closer in the latent space.

4.4 Case Study

Figures 2(a) and 2(b) visualize embeddings of 14 questions of the same answer. We observe that crossing variational autoencoders (CrossVAE) can better capture the aligned semantics between questions and answers, making latent representations of questions and answers more prominent. Figure 2(c) demonstrates two of example questions and corresponding answers produced by USE-QA and CrossVAEs. We observe that CrossVAEs can better distinguish similar answers even though they all share several same words with the question.

5 Conclusion

Given a candidate question, answer retrieval aims to find the most similar answer text between candidate answer texts. In this paper, We proposed to cross variational autoencoders by generating questions with aligned answers and generating answers with aligned questions. Experiments show that our method improves MRR and R@1 over the best baseline by 1.06% and 2.44% on SQuAD.

Acknowledgements

We thank Drs. Nicholas Fuller, Sinem Guven, and Ruchi Mahindru for their constructive comments and suggestions. This project was partially supported by National Science Foundation (NSF) IIS-1849816 and Notre Dame Global Gateway Faculty Research Award.

References

  • Z. Abbasiyantaeb and S. Momtazi (2020) Text-based question answering from information retrieval and deep neural network perspectives: a survey. arXiv preprint arXiv:2002.06612. Cited by: §1, §2.
  • A. Ahmad, N. Constant, Y. Yang, and D. Cer (2019) ReQA: an evaluation for end-to-end answer retrieval models. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, Cited by: §1, §2.
  • S. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
  • S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Cited by: §2, §4.3.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Cited by: §1, §4.2.
  • W. Chang, F. X. Yu, Y. Chang, Y. Yang, and S. Kumar (2020) Pre-training tasks for embedding-based large-scale retrieval. In Proceedings of 8th International Conference for Learning Representation (ICLR). Cited by: §1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §2.
  • K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.2.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Cited by: §4.2.
  • A. Das, H. Yenala, M. Chinnakotla, and M. Shrivastava (2016) Together we stand: siamese networks for similar question retrieval. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • M. Deudon (2018) Learning semantic similarity in a continuous space. In Advances in neural information processing systems (NeurIPS), pp. 986–997. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §4.2.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017a) Toward controlled generation of text. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    Cited by: §2.
  • Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing (2017b) On unifying deep generative models. In Proceedings of 5th International Conference for Learning Representation (ICLR). Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of 2nd International Conference for Learning Representation (ICLR). Cited by: §4.3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of 1st International Conference for Learning Representation (ICLR). Cited by: §2, §3.1.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (NeurIPS), Cited by: §2.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics. Cited by: §2.
  • Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In Proceedings of 5th International Conference for Learning Representation (ICLR). Cited by: §3.2.
  • M. Liu, T. Breuel, and J. Kautz (2017)

    Unsupervised image-to-image translation networks

    .
    In Advances in neural information processing systems (NeurIPS), Cited by: §2.
  • S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang (2019) Neural machine reading comprehension: methods and trends. Applied Sciences. Cited by: §2.
  • Y. Miao, L. Yu, and P. Blunsom (2016) Neural variational inference for text processing. In International conference on machine learning, Cited by: §2.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human-generated machine reading comprehension dataset. Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Cited by: §2, §4.1.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §4.2.
  • E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.
  • M. Seo, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Hajishirzi (2018) Phrase-indexed question answering: a new challenge for scalable document comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • M. Seo, J. Lee, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Hajishirzi (2019) Real-time open-domain question answering with dense-sparse phrase index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
  • D. Shen, Y. Zhang, R. Henao, Q. Su, and L. Carin (2018) Deconvolutional latent-variable model for text sequence matching. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: 1(b), §1, §2.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems (NeurIPS), Cited by: §2.
  • E. Triantafillou, R. Zemel, and R. Urtasun (2017) Few-shot learning through an information retrieval lens. In Advances in neural information processing systems (NeurIPS), Cited by: §1.
  • L. Wu, I. E. Yen, K. Xu, F. Xu, A. Balakrishnan, P. Chen, P. Ravikumar, and M. J. Witbrock (2018) Word mover’s embedding: from word2vec to document embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
  • Z. Xie and S. Ma (2019) Dual-view variational autoencoders for semi-supervised text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Cited by: §1.
  • W. Xu, H. Sun, C. Deng, and Y. Tan (2017) Variational autoencoder for semi-supervised text classification. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
  • Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. Sung, et al. (2019) Multilingual universal sentence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307. Cited by: 1(a), §1, §4.2.
  • S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung (2019) A compare-aggregate model with latent clustering for answer selection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Cited by: §1.