Every day, billions of non-English speaking users  interact with search engines; however, commercial retrieval systems have been traditionally tailored to English queries, causing an information access divide between those who can and those who cannot speak this language . Non-English search applications have been equally under-studied by most information retrieval researchers. Historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on English corpora (e.g., [1, 5, 6, 23]). More recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [11, 14, 21, 24, 26, 35, 37]; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as BM25. Like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on English queries and documents.
The absence of rankers designed to operate on languages other than English can largely be attributed to a lack of suitable publicly available data sets. This aspect particularly limits supervised ranking methods, as they require samples for training and validation. For English, previous research relied on English collections such as TREC Robust 2004 , the 2009-2014 TREC Web Track , and MS MARCO . No datasets of similar size exist for other languages.
While most of recent approaches have focused on ad hoc retrieval for English, some researchers have studied the problem of cross-lingual information retrieval. Under this setting, document collections are typically in English, while queries get translated to several languages; sometimes, the opposite setup is used. Throughout the years, several cross lingual tracks were included as part of TREC. TREC 6, 7, 8  offer queries in English, German, Dutch, Spanish, French, and Italian. For all three years, the document collection was kept in English. CLEF also hosted multiple cross-lingual ad-hoc retrieval tasks from 2000 to 2009 . Early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations . More recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. For example, Vulic and Moens  proposed BWESG, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. Sasaki et al  leveraged a data set of Wikipedia pages in 25 languages to train a learning to rank algorithm for Japanese-English and Swahili-English cross-language retrieval. Litschko et al  proposed an unsupervised framework that relies on aligned word embeddings. Ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-English document retrieval.
A few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. For example, TREC produced several non-English test collections: Spanish , Chinese Mandarin , and Arabic . Other languages were explored, but the document collections are no longer available. The CLEF initiative includes some non-English monolingual datasets, though these are primarily focused on European languages . Recently, Zheng et al.  introduced Sogou-QCL, a large query log dataset in Mandarin. Such datasets are only available for languages that already have large, established search engines.
Inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non English languages using supervised neural approaches. In particular, to circumvent the lack of training data, we leverage transfer learning techniques to train Arabic, Mandarin, and Spanish retrieval models using English training data. In the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g.[16, 17, 29, 38]). Our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in English. We evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. By leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned  or unaligned  web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. During the peer review of this article, a preprint  was published with similar observations as ours. In summary, our contributions are:
We study zero shot transfer learning for IR in non-English languages.
We propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. Our approach outperforms several baselines on multiple non-English collections.
We show that including additional in-language training samples may help further improve ranking performance.
We release our code for pre-processing, initial retrieval, training, and evaluation of non-English datasets.111https://github.com/Georgetown-IR-Lab/multilingual-neural-ir We hope that this encourages others to consider cross-lingual modeling implications in future work.
Zero-shot Multi-Lingual Ranking. Because large-scale relevance judgments are largely absent in languages other than English, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. This setting makes use of relevance data from one language that has a considerable amount of training data (e.g., English) for model training and validation, and applies the trained model to a different language for testing.
More formally, let be a collection of relevance tuples in the source language, and be a collection of relevance judgments from another language. Each relevance tuple consists of a query, document, and relevance score, respectively. In typical evaluation environments, is segmented into multiple splits for training () and testing (), such that there is no overlap of queries between the two splits. A ranking algorithm is tuned on to define the ranking function , which is subsequently tested on . We propose instead tuning a model on all data from the source language (i.e., training ), and testing on a collection from the second language ().
Datasets. We evaluate on monolingual newswire datasets from three languages: Arabic, Mandarin, and Spanish. The Arabic document collection contains documents (LDC2001T55), and we use topics/relevance information from the 2001–02 TREC Multilingual track (25 and 50 topics, respectively). For Mandarin, we use news articles from LDC2000T52. Mandarin topics and relevance judgments are utilized from TREC 5 and 6 (26 and 28 topics, respectively). Finally, the Spanish collection contains articles from LDC2000T51, and we use topics from TREC 3 and 4 (25 topics each). We use the topics, rather than the query descriptions, in all cases except TREC Spanish 4, in which only descriptions are provided. The topics more closely resemble real user queries than descriptions.222Some have observed that the context provided by query descriptions are valuable for neural ranking, particularly when using contextualized language models . We test on these collections because they are the only document collections available from TREC at this time.333https://trec.nist.gov/data/docs_noneng.html
We index the text content of each document using a modified version of Anserini with support for the languages we investigate . Specifically, we add Anserini support for Lucene’s Arabic and Spanish light stemming and stop word list (via SpanishAnalyzer and ArabicAnalyzer). We treat each character in Mandarin text as a single token.
Modeling. We explore the following ranking models:
Unsupervised baselines. We use the Anserini  implementation of BM25, RM3 query expansion, and the Sequential Dependency Model (SDM) as unsupervised baselines. In the spirit of the zero-shot setting, we use the default parameters from Anserini (i.e., assuming no data of the target language).
KNRM  uses learned Gaussian kernel pooling functions over the query-document similarity matrix to rank documents.
We use the embedding layer output from base-multilingual-cased
model for PACRR and KNRM. In pilot studies, we investigated using cross-lingual MUSE vectors and the output representations from BERT, but found the BERT embeddings to be more effective.
Experimental Setup. We train and validate models using TREC Robust 2004 collection . TREC Robust 2004 contains 249 topics, documents, and relevance judgments in English (folds 1-4 from  for training, fold 5 for validation). Thus, the model is only exposed to English text in the training and validation stages (though the embedding and contextualized language models are
trained on large amounts of unlabeled text in the languages). The validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via nDCG@20). We train using pairwise softmax loss with Adam.
We evaluate the performance of the trained models by re-ranking the top 100 documents retrieved with BM25. We report MAP, Precision@20, and nDCG@20 to gauge the overall performance of our approach, and the percentage of judged documents in the top 20 ranked documents (judged@20) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments.
|Arabic (TREC 2002) |
|BM25 + RM3||0.3320||0.3705||0.2641||95.1%|
|Vanilla BERT multilingual||0.3790||0.4205||0.2876||97.4%|
|Arabic (TREC 2001) |
|BM25 + RM3||0.4700||0.5458||0.2903||85.6%|
|Vanilla BERT multilingual||0.5240||0.5628||0.3432||91.0%|
|Mandarin (TREC 6) |
|BM25 + RM3||0.5019||0.5571||0.2696||75.6%|
|Vanilla BERT multilingual||0.6615||0.6959||0.3589||92.7%|
|Mandarin (TREC 5) |
|BM25 + RM3||0.2768||0.3021||0.1698||64.6%|
|Vanilla BERT multilingual||0.4589||0.5196||0.2906||92.0%|
|Spanish (TREC 4) |
|BM25 + RM3||0.3360||0.3358||0.2024||85.2%|
|Vanilla BERT multilingual||0.4400||0.4898||0.1800||85.6%|
|Spanish (TREC 3) |
|BM25 + RM3||0.6100||0.6236||0.3887||93.0%|
|Vanilla BERT multilingual||0.6400||0.6672||0.2623||90.8%|
, respectively (paired t-test by query,).
We present the ranking results in Table 1. We first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, RM3 and SDM outperform BM25, whereas in other cases they under-perform. Similarly, the PACRR and KNRM neural models also vary in effectiveness, though more frequently perform much worse than BM25. This makes sense because these models capture matching characteristics that are specific to English. For instance, n-gram patterns captured by PACRR for English do not necessarily transfer well to languages with different constituent order, such as Arabic (VSO instead of SVO). An interesting observation is that the Vanilla BERT model (which recall is only tuned on English text) generally outperforms a variety of approaches across three test languages. This is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. The exceptions are the Arabic 2001 dataset, in which it performs only comparably to BM25 and the MAP results for Spanish. For Spanish, RM3 is able to substantially improve recall (as evidenced by MAP), and since Vanilla BERT acts as a re-ranker atop BM25, it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. In all cases, Vanilla BERT exhibits judged@20 above 85%, indicating that these test collections are still valuable for evaluation.
To test whether a small amount of in-language training data can further improve BERT ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. The in-language samples are interleaved into the English training samples. Results for this few-shot setting are shown in Table 2. We find that the added topics for Arabic 2001 (+50) and Spanish 4 (+25) significantly improve the performance. This results in a model significantly better than BM25 for Arabic 2001, which suggests that there may be substantial distributional differences in the English TREC 2004 training and Arabic 2001 test collections. We further back this up by training an “oracle” BERT model (training on the test data) for Arabic 2001, which yields a model substantially better (P@20=0.7340, nDCG@20=0.8093, MAP=0.4250).
We introduced a zero-shot multilingual setting for evaluation of neural ranking methods. This is an important setting due to the lack of training data available in many languages. We found that contextualized languages models (namely, BERT) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to English). We also found that additional in-language training data may improve the performance, though not necessarily. By releasing our code and models, we hope that cross-lingual evaluation will become more commonplace.
-  (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 357–389. Cited by: §1.
-  (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §1.
-  (2000) Cross-language information retrieval (clir) track overview. In TREC, Cited by: §1.
-  (2003) CLEF 2003–overview of results. In Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 44–63. Cited by: §1, §1.
Learning to rank: from pairwise approach to listwise approach.
Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §1.
-  (2012) A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR) 44 (1), pp. 1. Cited by: §1.
-  (2015) TREC 2014 web track overview. Technical report MICHIGAN UNIV ANN ARBOR. Cited by: §1.
-  (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §1, §2.
-  (2019) Deeper text understanding for ir with contextual neural language modeling. In SIGIR, Cited by: footnote 2.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Cited by: 4th item.
-  (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: §1.
-  (1995) Overview of the third text retrieval conference (trec-3). DIANE Publishing. Cited by: Table 1.
-  (1996) Overview of the fourth text retrieval conference (trec-4). NIST SPECIAL PUBLICATION SP, pp. 1–24. Cited by: §1, Table 1.
-  (2017) PACRR: a position-aware neural ir model for relevance matching. arXiv preprint arXiv:1704.03940. Cited by: §1, 2nd item.
-  (2014) Parameters learned in the comparison of retrieval models using term dependencies. Technical Report. Cited by: §2.
Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §1.
Cross-lingual transfer learning for pos tagging without cross-lingual resources.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2832–2838. Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §2.
-  (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1.
-  (2018) Unsupervised cross-lingual information retrieval using monolingual data only. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1253–1256. Cited by: §1.
-  (2019) CEDR: contextualized embeddings for document ranking. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, New York, NY, USA, pp. 1101–1104. External Links: Cited by: §1, 4th item.
-  (2019) Internet. Note: https://ourworldindata.org/internetLast accessed: 2019/09/15 Cited by: §1.
-  (2005) A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, New York, NY, USA, pp. 472–479. External Links: Cited by: §1.
-  (2018) An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13 (1), pp. 1–126. Cited by: §1.
-  (2002) The TREC 2002 arabic/english clir track. In TREC, Cited by: §1, Table 1.
-  (2018) Neural information retrieval: at the end of the early years. Information Retrieval Journal 21 (2-3), pp. 111–182. Cited by: §1.
-  (2012) Multilingual information retrieval: from research to practice. Springer Science & Business Media. Cited by: §1.
-  (2018-06) Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 458–463. Cited by: §1.
-  (2019-06) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3795–3805. Cited by: §1.
-  (2019) Cross-lingual relevance transfer for document retrieval. ArXiv abs/1911.02989. Cited by: §1.
-  (1998) The sixth text retrieval conference (trec-6). In The Text REtrieval Conference (TREC), Vol. 500, pp. 240. Cited by: §1, Table 1.
-  (1996) Overview of the fifth text retrieval conference (trec-5). In TREC, Vol. 97, pp. 1–28. Cited by: Table 1.
-  (2005) Overview of the TREC 2005 Robust Retrieval Track.. In TREC, Cited by: §1, §2.
-  (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, pp. 363–372. External Links: Cited by: §1.
-  (2017) End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pp. 55–64. Cited by: §1, 3rd item.
-  (2018) Anserini: reproducible ranking baselines using lucene. J. Data and Information Quality 10, pp. 16:1–16:20. Cited by: 1st item, §2.
-  (2019) Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972. Cited by: §1.
-  (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345. Cited by: §1.
-  (2015) The digital language divide. Note: http://labs.theguardian.com/digital-language-divide/Last accessed: 2019/09/15 Cited by: §1.
-  (2018) Sogou-qcl: a new dataset with click relevance label. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1117–1120. Cited by: §1.