Duet at TREC 2019 Deep Learning Track

12/10/2019 ∙ by Bhaskar Mitra, et al. ∙ Microsoft 0

This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Duet architecture was proposed by Mitra et al. (2017) for document ranking. Fig. 7 from The original paper show that the retrieval effectiveness of the model is still improving as the size of the training data approaches samples. The training data employed in that paper is a proprietary dataset from Bing. A similar plot was later reproduced on a public benchmark by Nanni et al. (2017), but in the context of a passage ranking dataset with synthetic queries. Variations of the Duet model (Mitra and Craswell, 2019; Mitra et al., 2019; Cohen et al., 2018) have since then been evaluated on other public passage ranking datasets. However, the lack of large scale training data prevented the public evaluation of Duet for document ranking.

The deep learning track at TREC 2019 makes large training datasets—suitable for traininig deep models with large number of learnable parameters—publicly available in the context of a document ranking and a passage ranking tasks. We benchmark the Duet model on both tasks.

In the context of the document ranking task, we adapt the Duet model to ingest a “multiple field” view of the documents, based on findings from Zamani et al. (2018). We refer to this new architecture as Duet with Multiple Fields (DuetMF) in the paper. Furthermore, we combine the relevance estimates from DuetMF with several other traditional and neural retrieval methods in a learning-to-rank (LTR) (Liu, 2009) framework.

For the passage ranking task, we submit a single run based on an ensemble of eight Duet models. The architecture and the training scheme resembles that of the “Duet V2 (Ensembled)” baseline listed on the MS MARCO leaderboard111http://www.msmarco.org/leaders.aspx.

2 TREC 2019 deep learning track

The TREC 2019 deep learning track introduces: a document retrieval task and a passage retrieval task. For both tasks, participants are provided a set of candidates— documents and passages, respectively—per query that should be ranked. Participants can choose to either rerank provided candidates or retrieve from the full collection. For the passage retrieval task, the track reuses the set of K+ manually-assessed binary training labels released as part of the Microsoft Machine Reading COmprehension (MS MARCO) challenge (Bajaj et al., 2016). For the document retrieval task, the passage-level labels are transferred to their corresponding source documents—producing a training dataset of size close to 400K labels. For evaluation, a shared test set of queries is provided for both tasks, of which two different overlapping set of queries were later selected for manual NIST assessments corresponding to the two tasks. Full details of all datasets is available on the track website222https://microsoft.github.io/TREC-2019-Deep-Learning/ and in the track overview paper (Craswell et al., 2019).

3 Methods and results

The Duet model proposed by Mitra et al. (2017)

employs two deep neural networks trained jointly towards a retrieval task:

The “distributed” sub-model learns useful representations of text for matching and the “local” sub-model estimates relevance based on patterns of exact term matches between query and document. Mitra and Craswell (2019) propose several modifications to the original Duet model that show improved performance on the MS MARCO passage ranking challenge. We adopt the updated Duet model from Mitra and Craswell (2019) and incorporate additional modifications, in particular to consider multiple fields for the document retrieval task. Table 1 summarizes the official evaluation results for all three runs. Table 1: Official TREC results. The recall metric is computed at position 100 for the document retrieval task and at position 1000 for the passage retrieval task. Run description Run ID Subtask MRR NDCG@10 MAP Recall Document retrieval task LTR w/ DuetMF as feature ms_ensemble fullrank 0.876 0.578 0.237 0.368 DuetMF model ms_duet rerank 0.810 0.533 0.229 0.387 Passage retrieval task Ensemble of 8 Duet models ms_duet_passage rerank 0.806 0.614 0.348 0.694

Duet model with Multiple Fields (DuetMF) for document ranking.

Zamani et al. (2018) study neural ranking models in the context of documents with multiple fields. In particular, they make the following observations:

  1. [label=Obs. 0:]

  2. It is more effective to summarize the match between query and individual document fields by a vector—as opposed to a single score—before aggregating to estimate full document relevance to the query.

  3. It is better to learn different query representations corresponding to each document field under consideration.

  4. Structured dropout (e.g., field-level dropout) is effective for regularization during training.

We incorporate all of these ideas to modify the Duet model from Mitra and Craswell (2019). The updated model is shown in Fig. 1.


Figure 1: The modified Duet model (DuetMF) that considers multiple document fields.

Documents in the deep learning track dataset contains three text fields: URL, title, and body. We employ the Duet architecture to match the query against each individual document fields. In line with Obs. 1 from (Zamani et al., 2018), the field-specific Duet architecture outputs a vector instead of a single score. We do not share the parameters of the Duet architectures between the field-specific instances based on Obs. 2. Following Obs. 3, we introduce structured dropouts at different stages of the model. We randomly dropout each of the local sub-models for of the training samples. Similarly, we also dropout different combinations of field-level models uniformly at random—taking care that at least one field-level model is always retained. We consider the first terms for queries and for document URLs and titles. For document body text, we consider the first terms. Similar to Mitra and Craswell (2019), we employ pretrained word embeddings as the input text representation for the distributed sub-models. We train the word embeddings using a standard word2vec (Mikolov et al., 2013) implementation in FastText (Joulin et al., 2016) on a combination of the MS MARCO document corpus and training queries. Similar to previous work (Mitra et al., 2017; Mitra and Craswell, 2019), the query and document field embeddings are learned by deep convolutional-pooling layers. We set the hidden layer size at all stages of the model to and dropout rate for different layers to . For training, we employ the RankNet loss (Burges et al., 2005) over triples and the Adam optimizer (Kingma and Ba, 2014)—with a minibatch size of and a learning rate of for training. We sample uniformly at random from the top candidates provided that are not positively labeled. When employing structured dropout, the same sub-models are masked for both and . In light of the recent success of large pretrained language models—e.g., (Nogueira and Cho, 2019)—we also experiment with an unsupervised pretraining scheme using the MS MARCO document collection. The pretraining is performed over —where and are randomly sampled from the collection and a pseudo-query is generated by picking the URL or the title of

randomly (with equal probability) and masking the corresponding field on the document side for both

and . We see faster convergence during supervised training when the DuetMF model is pretrained in this fashion on the MS MARCO document collection. We posit that a more formal study should be performed in the future on pretraining Duet models on large collections, such as Wikipedia and the BookCorpus (Zhu et al., 2015).

Learning-to-rank model for document ranking.

We train a neural LTR model with two hidden layers—each with hidden nodes. The LTR run reranks a set of document candidates retrieved by query likelihood (QL) (Ponte and Croft, 1998) with Dirichlet smoothing () (MacKay and Peto, 1995). Several ranking algorithms based on neural and inference networks act as features: DuetMF, Sequential Dependence Model (SDM) (Metzler and Croft, 2005), and Pseudo-Relevance Feedback (PRF) (Lavrenko and Croft, 2001; Lavrenko, 2008), BM25, (Robertson et al., 2009), and Dual Embedding Space Model (DESM) (Nalisnick et al., 2016; Mitra et al., 2016). We employ SDM with an order of , combine weight of , ordered window weight of , and an unordered window weight of as our base candidate scoring function. We use these parameters to retrieve from the target corpus as well as auxiliary corpora of English language Wikipedia (enwiki-20180901-pages-articles-multistream.xml.bz2), LDC Gigaword (LDC2011T07

). For PRF, initial retrievals—from either of the target, wikipedia, or gigaword corpora—adopted the SDM parameters above, however are used to rank 75-word passages with a 25-word overlap. These passages are then interpolated using the top

passages and standard relevance modeling techniques, from which we select the top words to use as an expanded query for the final ranking of the target candidates. We do not explicitly adopt RM3 (Abdul-Jaleel et al., 2004) because our LTR model implicitly combines our initial retrieval score and score from the expanded query. All code for the SDM and PRF feature computation is available at https://github.com/diazf/indri.

We evaluate two different BM25 models with hyperparameters

and .
Corresponding to each of the DuetMF, SDM, PRF, and BM25 runs we generate two features based on the score and the rank that the model predicts for a document w.r.t. the target query. We generate eight features by comparing the query against two different document fields (title and body) and using different DESM similarity estimates (INxIN, INxOUT, OUTxIN, OUTxOUT). Lastly, we add couple of features based on query length and domain quality—where the latter is defined simply as a ratio between how often documents from a given domain appear in the positively labeled training data and in the overall document collection.

Ensemble of Duet models for passage ranking.

For the passage ranking task, we adopt the exact same model and training procedure from (Mitra and Craswell, 2019). Our final submission is an ensemble of eight Duet models.

4 Discussion and conclusion

One of the main goals of the deep learning track is to create a public reusable dataset for benchmarking the growing body of neural information retrieval literature (Mitra and Craswell, 2018). We submit three runs based on the Duet architecture for the two—document and passage—retrieval tasks. Our main goal is to enrich the set of pooled documents for NIST assessments with documents that a Duet based architecture is likely to rank highly. As a secondary goal, we are also interested in benchmarking Duet against other state-of-the-art neural and traditional methods. A more detailed comparison of the performance of these Duet runs with other TREC submissions is provided in the track overview paper (Craswell et al., 2019).


  • N. Abdul-Jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade (2004) UMass at trec 2004: novelty and hard. Cited by: item 5.
  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al. (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: item 2.
  • C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005) Learning to rank using gradient descent. In

    Proceedings of the 22nd international conference on Machine learning

    pp. 89–96. Cited by: item 3.
  • D. Cohen, B. Mitra, K. Hofmann, and W. B. Croft (2018) Cross domain regularization for neural ranking models using adversarial learning. In Proc. SIGIR, pp. 1025–1028. Cited by: §1.
  • N. Craswell, B. Mitra, E. Yilmaz, and D. Campos (2019) Overview of the trec 2019 deep learning track. In TREC (to appear), Cited by: item 2, §4.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: item 3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: item 3.
  • V. Lavrenko and W. B. Croft (2001) Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 120–127. Cited by: item 3.
  • V. Lavrenko (2008) A generative theory of relevance. Vol. 26, Springer Science & Business Media. Cited by: item 3.
  • T. Liu (2009) Learning to rank for information retrieval. Foundation and Trends in Information Retrieval 3 (3), pp. 225–331. Cited by: §1.
  • D. J. MacKay and L. C. B. Peto (1995) A hierarchical dirichlet language model. Natural language engineering 1 (3), pp. 289–308. Cited by: §3.
  • D. Metzler and W. B. Croft (2005) A markov random field model for term dependencies. In Proc. SIGIR, pp. 472–479. Cited by: item 2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proc. NIPS, pp. 3111–3119. Cited by: item 3.
  • B. Mitra and N. Craswell (2018) An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval (to appear). Cited by: §4.
  • B. Mitra and N. Craswell (2019) An updated duet model for passage re-ranking. arXiv preprint arXiv:1903.07666. Cited by: §1, item 2, item 3, item 3, §3, §3.
  • B. Mitra, F. Diaz, and N. Craswell (2017) Learning to match using local and distributed representations of text for web search. In Proc. WWW, pp. 1291–1299. Cited by: §1, item 3, §3.
  • B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana (2016) A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137. Cited by: item 5.
  • B. Mitra, C. Rosset, D. Hawking, N. Craswell, F. Diaz, and E. Yilmaz (2019) Incorporating query term independence assumption for efficient retrieval and ranking using deep neural networks. arXiv preprint arXiv:1907.03693. Cited by: §1.
  • E. Nalisnick, B. Mitra, N. Craswell, and R. Caruana (2016) Improving document ranking with dual word embeddings. In Proc. WWW, Cited by: item 5.
  • F. Nanni, B. Mitra, M. Magnusson, and L. Dietz (2017) Benchmark for complex answer retrieval. In Proc. ICTIR, Cited by: §1.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085. Cited by: item 3.
  • J. M. Ponte and W. B. Croft (1998) A language modeling approach to information retrieval. In Proc. SIGIR, pp. 275–281. Cited by: §3.
  • S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: item 4.
  • H. Zamani, B. Mitra, X. Song, N. Craswell, and S. Tiwary (2018) Neural ranking models with multiple document fields. In Proc. WSDM, pp. 700–708. Cited by: §1, item 3, §3.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    pp. 19–27. Cited by: item 3.