Contextualized Sparse Representation with Rectified N-Gram Attention for Open-Domain Question Answering

A sparse representation is known to be an effective means to encode precise lexical cues in information retrieval tasks by associating each dimension with a unique n-gram-based feature. However, it has often relied on term frequency (such as tf-idf and BM25) or hand-engineered features that are coarse-grained (document-level) and often task-specific, hence not easily generalizable and not appropriate for fine-grained (word or phrase-level) retrieval. In this work, we propose an effective method for learning a highly contextualized, word-level sparse representation by utilizing rectified self-attention weights on the neighboring n-grams. We kernelize the inner product space during training for memory efficiency without the explicit mapping of the large sparse vectors. We particularly focus on the application of our model to phrase retrieval problem, which has recently shown to be a promising direction for open-domain question answering (QA) and requires lexically sensitive phrase encoding. We demonstrate the effectiveness of the learned sparse representations by not only drastically improving the phrase retrieval accuracy (by more than 4 open-domain QA methods with up to x97 inference in SQuADopen and CuratedTrec.


page 1

page 2

page 3

page 4


Phrase Retrieval Learns Passage Retrieval, Too

Dense retrieval methods have shown great promise over sparse retrieval m...

Learning Dense Representations of Phrases at Scale

Open-domain question answering can be reformulated as a phrase retrieval...

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index

Existing open-domain question answering (QA) models are not suitable for...

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension

The current trend of extractive question answering (QA) heavily relies o...

Question-Answering with Grammatically-Interpretable Representations

We introduce an architecture, the Tensor Product Recurrent Network (TPRN...

Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking

Sparse lexical representation learning has demonstrated much progress in...

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

To extract answers from a large corpus, open-domain question answering (...

1 Introduction

Retrieving text from a large collection of documents is an important problem in several natural language tasks such as question answering (QA) and information retrieval. In the literature, sparse representations have been successfully used in encoding text at sentence or document level, capturing precise lexical information that can be sparsely activated by n-gram based features. For instance, frequency-based sparse representations such as tf-idf map each text segment to a vocabulary space where the weight of each dimension is determined by the associated word’s term and inverse document frequency. However, these sparse representations are mainly coarse-grained, task-specific, and not suitable for word-level representations since they statically assign the identical weight to each n-gram and do not change dynamically depending on the context.

In this paper, we introduce an effective method for learning a word-level sparse representation that encodes precise lexical information. Our model, CoSPR, learns a contextualized sparse phrase representation leveraging rectified self-attention weights on the neighboring n-grams and dynamically encoding important lexical information of each phrase (such as named entities) given its context. Consequently, in contrast to previous sparse regularization techniques on dense embedding of at most few thousand dimensions (Faruqui et al., 2015; Subramanian et al., 2018), our method is able to produce more interpretable representations with billion-scale cardinality. Such large sparse vector space is prohibitive for explicit mapping; we leverage the fact that our sparse representations only interact through inner product and kernelize the inner-product space for memory-efficient training, inspired by the kernel method in SVMs (Cortes and Vapnik, 1995). This allows us to handle extremely large sparse vectors without worrying about computational bottlenecks. The overview of our model is illustrated in Figure 1.

We demonstrate the effectiveness of our model in open-domain question answering (QA), the task of retrieving answer phrases given a web-scale collection of documents. Following Seo et al. (2019), we concatenate both sparse and dense vectors to encode every phrase in Wikipedia and use maximum similarity search to find the closest candidate phrase to answer each question. We only substitute (or augment) the baseline sparse encoding which is entirely based on frequency-based embedding (tf-idf) with our contextualized sparse representation (CoSPR). Our empirical results demonstrate its state-of-the-art performance in open-domain QA datasets, SQuAD and CuratedTrec. Notably, our method significantly outperforms DenSPI (Seo et al., 2019), the previous end-to-end QA model, by more than 4% with negligible drop in inference speed. Moreover, our method achieves up to 2% better accuracy and x97 speedup in inference compared to pipeline (retrieval-based) approaches. Our analysis particularly shows that fine-grained sparse representation is crucial for doing well in phrase retrieval task. In summary, the contributions of our paper are:

  1. we show that learning sparse representations for embedding lexically important context words of a phrase can be achieved by contextualized sparse representations,

  2. we introduce an efficient training strategy that leverages the kernelization of the sparse inner-product space, and

  3. we achieve the state-of-the-art performance in two open-domain QA datasets with up to x97 faster inference time.

Figure 1: Overview of our model. Using the billions of precomputed phrase representations, we perform a maximum inner product search between the phrase vectors and an input question vector. We propose to learn contextualized sparse phrase representations which are also very interpretable.

2 Related Work

Open-Domain Question Answering

Most open-domain QA models for unstructured texts use a retriever to find documents to read, and then apply a reading comprehension (RC) model to find answers (Chen et al., 2017; Wang et al., 2018a; Lin et al., 2018; Das et al., 2019; Lee et al., 2019; Yang et al., 2019; Wang et al., 2019). To improve the performance of the open-domain question answering, various modifications have been studied to the pipelined models which include improving the retriever-reader interaction (Wang et al., 2018a; Das et al., 2019), re-ranking paragraphs and/or answers (Wang et al., 2018b; Lin et al., 2018; Lee et al., 2018; Kratzwald et al., 2019), learning end-to-end models with weak supervision (Lee et al., 2019), or simply making a better retriever and a reader model (Yang et al., 2019; Wang et al., 2019). Due to the pipeline nature, however, these models inevitably suffer error propagation from the retrievers.

To mitigate this problem, Seo et al. (2019) propose to learn query-agnostic representations of phrases in Wikipedia and retrieve phrases that best answers a question. While Seo et al. (2019) have shown that encoding both dense and sparse representations for each phrase could keep the lexically important words of a phrase to some extent, their sparse representations are based on static tf-idf vectors which have globally the same weight for each n-gram.

Phrase Representations

In NLP, phrase representations can be either obtained in a similar manner as word representations (Mikolov et al., 2013), or by learning a parametric function of word representations (Cho et al., 2014). In extractive question answering, phrases are often referred to as spans, but most models do not consider explicitly learning phrase representations as these answer spans can be obtained by predicting only start and end positions in a paragraph (Wang and Jiang, 2017; Seo et al., 2017)

. Nevertheless, few studies have focused on directly learning and classifying phrase representations 

(Lee et al., 2017) which achieve strong performance when combined with attention mechanism. In this work, we are interested in learning query-agnostic sparse phrase representations which enables the precomputation of re-usable phrase representations (Seo et al., 2018).

Sparse Representations

One of the most basic sparse representations is the bag-of-words modeling, and recent works often emphasize the use of bag-of-words models as strong baselines for sentence classification and question answering (Joulin et al., 2017; Weissenborn et al., 2017). tf-idf is another good example of sparse representations that is used for document retrieval, and is still widely adopted both in IR and QA community. Our work could be seen as an attempt to build a trainable tf-idf model for phrase (n-gram) representations, which should be more fine-grained than paragraphs or documents. There have been some attempts in IR that learn sparse representations of documents for duplicate detection (Hajishirzi et al., 2010), inverted indexing (Zamani et al., 2018), and more. Unlike previous works, however, our method does not require hand-engineered features for sparse n-grams while keeping the original vocabulary space. In NLP, there are some works on training sparse representations specifically designed for improving interpretability of word representations (Faruqui et al., 2015; Subramanian et al., 2018), but they lose an important role of sparse represenations, which is keeping the exact lexical information, as they are trained by sparsifying dense representations of a higher (but much lower than ) dimension.

3 Background: Open-Domain QA through Phrase Retrieval

Open-Domain QA

We primarily focus on open-domain QA on unstructured data where the answer is a text span in the corpus. Formally, given a set of documents and a question , the task is to design a model that obtains the answer by

where is the score model to learn and is a phrase consisting of words from the -th to the -th word in the -th document. Typically, the number of documents () is in the order of millions for open-domain QA (e.g. 5 million for English Wikipedia), which makes the task computationally challenging. Pipeline-based methods typically leverage a document retriever to reduce the number of documents to read, but they suffer from error propagation when wrong documents are retrieved and can be slow if the reader model is computationally cumbersome.

Open-domain QA with Phrase Encoding and Retrieval

As an alternative, phrase-retrieval approaches (Seo et al., 2018, 2019) mitigate this problem by directly accessing all phrases in documents by decomposing into two functions,

where denotes inner product operation. Unlike running a complex reading comprehension model in pipeline-based approaches, query-agnostically encode (all possible phrases of) each document just once, so that we just need to compute (which is very fast) and perform similarity search on the phrase encoding (which is also fast). Seo et al. (2019) have shown that encoding each phrase with a concatenation of dense and sparse representations is effective, where the dense part is computed from BERT (Devlin et al., 2019) and the sparse part is obtained from the tf-idf vector of the document and the paragraph of the phrase. We briefly describe how the dense part is obtained below.

Dense Representation

Assuming that the document has words as , Seo et al. (2019) use BERT (Devlin et al., 2019) to compute contextualized representation of each word as . Based on the contextualized embeddings, we obtain dense phrase representations as follows: We split each into four parts where and (dimensions and are chosen to make ). Then each phrase is densely represented as follows:


where denotes inner product operation. and are start/end representations of a phrase, and the inner product of and is used for computing coherency of the phrase. Refer to Seo et al. (2019) for details; we mostly reuse its architecture.

4 Sparse Encoding of Phrases

Sparse representations are often suitable for keeping the precise lexical information present in the text, complementing dense representations that are often good for encoding semantic and syntactic information. While the sparsification of dense word embeddings has been explored previously (Faruqui et al., 2015), its main limitations are that (1) it starts from the dense embedding which might have already lost rich lexical information, and (2) its cardinality is in the order of thousands at max, which we hypothesize is too small to encode sufficient information. Our work hence focuses on creating the sparse representation of each phrase which is not bottlenecked by dense embedding and is capable of increasing the cardinality to billion-scale without much computational cost.

4.1 Why do we need to learn sparse representations?

Suppose we are given a question “How many square kilometres of the Amazon forest was lost by 1991?” and the target answer is in the following sentence,

Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres (160,000 to 227,000 sq mi).

To answer the question, the model should know that the target answer (415,000) corresponds to the year 1991 while the (confusing) phrase 587,000 corresponds to the year 2000, which requires syntactic understanding of English parallelism. The dense phrase encoding is likely to have a difficulty in precisely differentiating between 1991 and 2000 since it needs to also encode several different kinds of information. Window-based tf-idf would not help either because the year 2000 is closer (in word distance) to 415,000. This example illustrates the strong need to create an n-gram-based sparse encoding that is highly syntax- and context-aware.

4.2 Contextualized Sparse Representations

Our sparse model, unlike pre-computed sparse embeddings such as tf-idf, dynamically computes the weight of each n-gram that depends on the context. Intuitively, for each phrase, we want to compute a positive weight for each n-gram near the phrase depending on how important the n-gram is to the phrase.

Sparse Representation

Sparse representation of each phrase is also obtained as the concatenation of its start word’s and end word’s sparse embedding, i.e. . This way, similarly to how the dense phrase embedding is obtained, we can efficiently compute them without explicitly enumerating all possible phrases.

We obtain each (start and end) sparse embedding in the same way (with unshared parameters), so we just describe how we obtain the start sparse embedding here and omit the superscript ‘start’. Given the contextualized encoding of each document , we obtain its (start or end) sparse encoding by



are query, key matrices obtained by applying a (different) linear transformation on

(i.e., using ), and is an one-hot n-gram feature representation of the input document . That is, for instance, if we want to encode unigram (1-gram) features, will be simply a one-hot representation of the word , and will be equivalent to the vocabulary size. Note that will be very large, so it should always exists as an efficient sparse matrix format (e.g. csc) and one should not explicitly create its dense form. We also discuss how we can leverage it during training in an efficient manner in Section 4.3.

Note that Equation 2 is similar to the scaled dot-product self-attention (Vaswani et al., 2017)

, with two key differences that (1) ReLU instead of softmax is used, and (2) a sparse matrix

instead of (dense) value matrix is used. In fact, our sparse embedding is related to attention mechanism in that we want to compute how important each n-gram is for each phrase’s start and end word. Intuitively, contains a weighted bag-of-ngram representation where each n-gram is weighted by its relative importance on each start or end word of a phrase. Unlike attention mechanism whose role is to summarize the target vectors via weighted summation (thus softmax is used, which sums up to 1), we do not perform the summation and directly output the unnormalized attention weights to a large sparse embedding space. In fact, we experimentally observe that ReLU is more effective than softmax for this objective.

Since we want to handle several different sizes of n-grams, we create the sparse encoding for each n-gram and concatenate the resulting sparse encodings. That is, let be the sparse encoding for -gram, then the final sparse encoding is the concatenation of different -grams we consider. In practice, we experimentally find that unigram and bigram are sufficient for most use cases; in this case, the sparse vector for the (start) word will be . We do not share linear transformation parameters for across different -grams. Note that this is also analogous to multiple attention heads in Vaswani et al. (2017).

Question Encoding

For a question where [CLS] denotes a special token for BERT inputs, contextualized question representations are computed in a similar way (). We share the same BERT used for the phrase encoding. We compute sparse encodings on the question side () in a similar way to the document side, with the only difference that we use the [CLS] token instead of start and end words to represent the entire question. Linear transformation weights are not shared.

4.3 Training

Kernel Function

As training phrase encoders on the whole Wikipedia is computationally prohibitive, we use training examples from an extractive question answering dataset (SQuAD) to train our encoders. Given a pair of question and a golden document

(a paragraph in the case of SQuAD), we first compute the dense logit of each phrase

by .

Unlike  Seo et al. (2019)

, each phrase’s sparse embedding is also trained, so it needs to be considered in the loss function. We define the sparse logit for phrase

as . For brevity, we describe how we compute the first term corresponding to the start word (and dropping the superscript ‘start’); the second term can be computed in the same way.

where denote the question side query, key, and n-gram feature matrices, respectively. We can efficiently compute it if we precompute . Note that can be considered as applying a kernel function, i.e. where its -th entry is if and only if n-gram at -th position of the context is equivalent to -th n-gram of the question, which can be efficiently computed as well. One can also think of this as kernel trick (in the literature of SVM (Cortes and Vapnik, 1995)) that allows us to compute the loss function without explicit mapping.

The loss to minimize is computed from the negative log likelihood over the sum of the dense and sparse logits:



denote the true start and end positions of the answer phrase. While the loss above is an unbiased estimator, in practice, we adopt early loss summation as suggested by 

Seo et al. (2019) for larger gradient signals in early training. Additionally, we also add dense-only loss that omits the sparse logits (i.e. original loss in Seo et al. (2019)) to the final loss, in which case we find that we obtain higher-quality dense phrase representations.

Negative Sampling

We train our model on SQuAD v1.1 which always has a positive paragraph that contains the answer. To learn robust phrase representations, we concatenate negative paragraphs to the original SQuAD paragraphs. To each paragraph , we concatenate the paragraph which was paired with the question whose dense representation is most similar to the original dense question representation , following  Seo et al. (2019). Note the difference, however, that we concatenate the negative example instead of considering it as an independent example with no-answer option Levy et al. (2017). During training, we find that adding tf-idf matching scores on the word-level logits of the negative paragraphs further improves the quality of sparse representations as our sparse models have to give stronger attentions to positively related words in this biased setting.

5 Experiments

5.1 Datasets

We use two popular open-domain QA datasets to evaluate our model: SQuAD and CuratedTrec. SQuAD is the open-domain version of SQuAD (Rajpurkar et al., 2016). We use 87,599 examples with the golden evidence paragraph to train our encoders, and use 10,570 QA pairs from development set to test our model, as suggested by Chen et al. (2017). CuratedTrec consists of question-answer pairs from TREC QA (Voorhees and others, 1999) curated by Baudiš and Šedivỳ (2015). The queries mostly come from search engine logs generated by real users. We use 694 test set QA pairs for testing our model. Note that we only train on SQuAD and test on both SQuAD and CuratedTREC, relying on the generalization ability of our model for zero-shot inference on CuratedTREC. This is in a clear contrast to previous work that utilize distant supervision (Chen et al., 2017) or weak supervision (Lee et al., 2018; Min et al., 2019) on CuratedTREC.

5.2 Implementation Details

We use and finetune BERT for our encoders. We use BERT vocabulary which has 30522 unique tokens based on byte pair encodings. As a result, we have when using unigram feature for , and when using both uni/bigram features. We do not finetune the word embedding during training. We pre-compute and store all encoded phrase representations of all documents in Wikipedia (more than 5 million documents). It takes 600 GPU hours to index all phrases in Wikipedia. Each phrase representation has dimensions. We use the same storage reduction and search techniques by Seo et al. (2019). For storage, the total size of the index is 1.3 TB including unigram and bigram sparse representations. For search, we either perform dense search first and then rerank with sparse scores (DFS) or perform sparse search first and rerank with dense scores (SFS), and also consider a combination of both (Hybrid).

Constraining Sparse Encoding

We find that injecting some prior domain knowledge to avoid spurious matchings allows us to learn better sparse representations. For example, in machine reading comprehension, answer phrases do not occur as an exact phrase in questions (e.g., the phrase “August 4, 1961” would not appear in the question “When was Barack Obama born?”). Therefore, we can zero the elements in the diagonal axis of in Equation 2 which correspond to the attention values of target phrase itself. Additionally, we mask special tokens in BERT such as [SEP] or [CLS] to have zero weights as matching of these tokens means nothing.

5.3 Results

Model SQuAD CuratedTrec
EM F1 Exact Match s/Q
DrQA (Chen et al., 2017) 29.8** - 25.4* 35
R (Wang et al., 2018a) 29.1 37.5 28.4* -
Paragraph Ranker (Lee et al., 2018) 30.2 - 35.4* -
Multi-Step-Reasoner (Das et al., 2019) 31.9 39.2 - -
BERTserini (Yang et al., 2019) 38.6 46.1 - 115
ORQA (Lee et al., 2019) 20.2 - 30.1 -
Multi-passage BERT(Wang et al., 2019) 53.0 60.9 - -
DenSPI (Seo et al., 2019) 36.2 44.4 31.6 0.81
DenSPI + CoSPR (Ours) 40.7 49.0 35.7 1.19
  • Trained on distantly supervised training data.

  • Trained on multiple datasets

  • No supervision using target training data.

  • Concurrent work.

Table 1: Results on open-domain question answering datasets.

We evaluate the effectiveness of CoSPR by augmenting DenSPI (Seo et al., 2019) with contextualized sparse representations (DenSPI+CoSPR). We extensively compare the model with the original DenSPI and previous pipeline-based QA models.

Open-Domain QA Experiments

Table 1 shows experimental results on two open-domain question answering datasets. For DenSPI and ours, we use Hybrid search strategy. On both datasets, our model with contextualized sparse representations (DenSPI + CoSPR) significantly improves the performance of the phrase-indexing baseline model (DenSPI). Moreover, our model outperforms BERT-based pipeline approaches such as BERTserini (Yang et al., 2019) while being almost two orders of magnitude faster. We expect much bigger speed gaps between ours and other pipeline methods as most of them put additional complex components to the original pipelined methods.

On CuratedTrec, which is constructed from real user queries, our model also achieves the state-of-the-art performance. Even though our model is only trained on SQuAD (i.e. zero-shot), it outperforms all other models which are either distant- or semi-supervised with at least 29x faster inference. Note that our method is orthogonal to end-to-end training (Lee et al., 2019) or weak supervision (Min et al., 2019) methods and future work can potentially benefit from these.

Ablation Study

Table 2 (left) shows the effect of contextualized sparse representations by comparing different variants of our method on SQuAD. For these evaluations, we use a subset of Wikipedia dump (1/100, 1/10 scale, approximately 50K, 500K documents, respectively). We evaluate the effect of removing (1) CoSPR, (2) tf-idf sparse representations, (3) tf-idf bias on negative sample training (section 4.3), and (4) and adding trigram features to our sparse representation (i.e., ).

Ours (DFS) 60.0 51.6
CoSPR 55.9 (4.1) 48.4 (3.2)
doc./para. tf-idf 58.1 (1.9) 50.9 (0.7)
tf-idf bias training 58.1 (1.9) 49.1 (2.5)
trigram sparse 58.0 (2.0) 49.8 (1.8)
Model SQuAD CuratedTrec
SFS 33.3 36.9 (3.6) 28.8 30.0 (1.2)
DFS 28.5 34.4 (5.9) 29.5 34.3 (4.8)
Hybrid 36.2 40.7 (4.5) 31.6 35.7 (4.1)
Table 2: Ablations of our model. On the left, we show how each sparse representation contribute to the performance, and on the right, we show how much gain we obtain from CoSPR in different search strategies. Exact match scores are reported.

One interesting observation is that adding trigram features in our sparse representations is worse than using uni-/bigram representations only. We suspect that the model becomes too dependent on trigram features, which means we might need a stronger regularization for high-order n-gram features.

In Table 2 (right), we show how we consistently improve over DenSPI when CoSPR is added in different search strategies. Note that on SQuAD, SFS is better than DFS as questions in SQuAD are created by turkers given particular paragraphs, whereas on CuratedTrec where the questions more resemble real user queries, DFS outperforms SFS showing the effectiveness of dense search when not knowing which documents to read. Similar observation was pointed by Lee et al. (2019) who showed different performance tendencies between two datasets, but as we are using both dense and sparse representations, our model can achieve state-of-the-art performances on both datasets.

Context (phrase), Question Sparse rep. Top n-gram (weight)
Context: Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres. tf-idf amazon rose (9.65), , 1991 (2.26), 2000 (1.85)
CoSPR 1991 (1.65), amazon (0.82), , 2000 (0.28)
Context: Amongst these include 5 University of California campuses (); 12 California State University campuses (); tf-idf 12 california (8.80), 5 university (8.04)
CoSPR california state (1.94), state university (1.77)
Question: Who did Newton get a pass to in the Panther starting plays of Super Bowl 50? tf-idf starting plays (9.31), panther starting (9.14)
CoSPR 50 (2.20), newton (1.62), super bowl (1.52)
Table 3: Analysis of CoSPR compared to static tf-idf vectors. Given a context or a question (left), we list top n-grams weighted by tf-idf or CoSPR (right).
Q: When is Independence Day?
DrQA [Independence Day (1996 film)] Independence Day is a 1996 American science fiction …
[Independence Day: Resurgence] It is the sequel to the 1996 film “Independence Day”.
DenSPI+CoSPR [Independence Day (India)] … is annually observed on 15 August as a national holiday in India.
[Independence Day (Pakistan)] (…), observed annually on 14 August, is a national holiday …
Q: What was the GDP of South Korea in 1950?
DrQA [Economy of South Korea] In 1980, the South Korean GDP per capita was $2,300.
DenSPI [Economy of South Korea] In 1980, the South Korean GDP per capita was $2,300.
DenSPI + CoSPR [Developmental State] South Korea’s GDP grew from $876 in 1950 to $22,151 in 2010.
Table 4: Prediction samples from DrQA, DenSPI, and DenSPI+CoSPR. Each sample shows [document title], context, and predicted answer.

5.4 Analysis

Interpretability of Sparse Representations

Sparse representations often have better interpretability than dense representations as each dimension of a sparse vector corresponds to a specific word. We compare tf-idf vectors and CoSPR (uni/bigram) by showing top weighted n-grams in each representation. Note that the scale of weights in tf-idf vectors is normalized in open-domain setups to match the scale between tf-idf vectors and dense vectors. We observe that tf-idf vectors usually assign high weights on infrequent (often meaningless) n-grams, while CoSPR focuses on contextually important entities such as 1991 for 415,000 or california state, state university for 12. Our sparse question representation also learns meaningful n-gram weights compared to tf-idf vectors.

Prediction Samples

Table 4 shows the outputs of three OpenQA models: DrQA (Chen et al., 2017), DenSPI (Seo et al., 2019), and our DenSPI+CoSPR. DenSPI+CoSPR is able to retrieve various correct answers from different documents, and it often correctly answers questions with specific dates or numbers compared to DenSPI showing the effectiveness of learned sparse representations.

6 Conclusion

In this paper, we demonstrate the effectiveness of contextualized sparse vectors, CoSPR, for encoding phrase with rich lexical information in open-domain question answering. We efficiently train our sparse representations by kernelizing the sparse inner product space. Experimental results show that our fast open-domain QA model that augments the previous model (DenSPI) with learned sparse representation (CoSPR) outperforms previous open-domain QA models, including recent BERT-based pipeline models, with two orders of magnitude faster inference time. Future work includes extending the contextualized sparse representations to other challenging QA settings such as multi-hop reasoning and building a full inverted index of the learned sparse representations (Zamani et al., 2018) for more powerful Sparse-First Search (SFS).


  • P. Baudiš and J. Šedivỳ (2015) Modeling of the question answering task in the yodaqa system. In CLEF, Cited by: §5.1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading wikipedia to answer open-domain questions. In ACL, Cited by: §2, §5.1, §5.4, Table 1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §2.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning. Cited by: §1, §4.3.
  • R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum (2019) Multi-step retriever-reader interaction for scalable open-domain question answering. In ICLR, Cited by: §2, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §3, §3.
  • M. Faruqui, Y. Tsvetkov, D. Yogatama, C. Dyer, and N. A. Smith (2015) Sparse overcomplete word vector representations. In ACL-IJCNLP, Cited by: §1, §2, §4.
  • H. Hajishirzi, W. Yih, and A. Kolcz (2010) Adaptive near-duplicate detection via similarity learning. In SIGIR, Cited by: §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In ACL, Cited by: §2.
  • B. Kratzwald, A. Eigenmann, and S. Feuerriegel (2019) RankQA: neural question answering with answer re-ranking. In ACL, Cited by: §2.
  • J. Lee, S. Yun, H. Kim, M. Ko, and J. Kang (2018) Ranking paragraphs for improving answer recall in open-domain question answering. In EMNLP, Cited by: §2, §5.1, Table 1.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In ACL, Cited by: §2, §5.3, §5.3, Table 1.
  • K. Lee, S. Salant, T. Kwiatkowski, A. Parikh, D. Das, and J. Berant (2017) Learning recurrent span representations for extractive question answering. In ICLR, Cited by: §2.
  • O. Levy, M. Seo, E. Choi, and L. Zettlemoyer (2017) Zero-shot relation extraction via reading comprehension. In CoNLL, Cited by: §4.3.
  • Y. Lin, H. Ji, Z. Liu, and M. Sun (2018) Denoising distantly supervised open-domain question answering. In ACL, Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §2.
  • S. Min, D. Chen, H. Hajishirzi, and L. Zettlemoyer (2019) A discrete hard em approach for weakly supervised question answering. In EMNLP, Cited by: §5.1, §5.3.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In ACL, Cited by: §5.1.
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In ICLR, Cited by: §2.
  • M. Seo, T. Kwiatkowski, A. Parikh, A. Farhadi, and H. Hajishirzi (2018) Phrase-indexed question answering: a new challenge for scalable document comprehension. In EMNLP, Cited by: §2, §3.
  • M. Seo, J. Lee, T. Kwiatkowski, A. P. Parikh, A. Farhadi, and H. Hajishirzi (2019) Real-time open-domain question answering with dense-sparse phrase index. In ACL, Cited by: §1, §2, §3, §3, §4.3, §4.3, §4.3, §5.2, §5.3, §5.4, Table 1.
  • A. Subramanian, D. Pruthi, H. Jhamtani, T. Berg-Kirkpatrick, and E. Hovy (2018) Spine: sparse interpretable neural embeddings. In AAAI, Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §4.2, §4.2.
  • E. M. Voorhees et al. (1999) The trec-8 question answering track report. In Trec, Cited by: §5.1.
  • S. Wang and J. Jiang (2017) Machine comprehension using match-lstm and answer pointer. In ICLR, Cited by: §2.
  • S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang (2018a) R 3: reinforced ranker-reader for open-domain question answering. In AAAI, Cited by: §2, Table 1.
  • S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, T. Klinger, G. Tesauro, and M. Campbell (2018b) Evidence aggregation for answer re-ranking in open-domain question answering. In ICLR, Cited by: §2.
  • Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang (2019) Multi-passage bert: a globally normalized bert model for open-domain question answering. In EMNLP, Cited by: §2, Table 1.
  • D. Weissenborn, G. Wiese, and L. Seiffe (2017) Making neural qa as simple as possible but not simpler. In CoNLL, Cited by: §2.
  • W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin (2019) End-to-end open-domain question answering with bertserini. In NAACL Demo, Cited by: §2, §5.3, Table 1.
  • H. Zamani, M. Dehghani, W. B. Croft, E. Learned-Miller, and J. Kamps (2018) From neural re-ranking to neural ranking: learning a sparse representation for inverted indexing. In CIKM, Cited by: §2, §6.