DeepAI
Log In Sign Up

Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/30/2021

Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer

Despite their success, large pre-trained multilingual models have not co...
01/20/2022

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

The advent of transformer-based models such as BERT has led to the rise ...
08/24/2022

Improving video retrieval using multilingual knowledge transfer

Video retrieval has seen tremendous progress with the development of vis...
08/05/2022

A Semantic Alignment System for Multilingual Query-Product Retrieval

This paper mainly describes our winning solution (team name: www) to Ama...
08/18/2022

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation

Building dialogue generation systems in a zero-shot scenario remains a h...
10/11/2020

Detecting Foodborne Illness Complaints in Multiple Languages Using English Annotations Only

Health departments have been deploying text classification systems for t...
05/12/2022

Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022

This paper describes the SLT-CDT-UoS group's submission to the first Spe...

1 Introduction

Every day, billions of non-English speaking users [22] interact with search engines; however, commercial retrieval systems have been traditionally tailored to English queries, causing an information access divide between those who can and those who cannot speak this language [39]. Non-English search applications have been equally under-studied by most information retrieval researchers. Historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on English corpora (e.g., [1, 5, 6, 23]). More recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [11, 14, 21, 24, 26, 35, 37]; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as BM25. Like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on English queries and documents.

The absence of rankers designed to operate on languages other than English can largely be attributed to a lack of suitable publicly available data sets. This aspect particularly limits supervised ranking methods, as they require samples for training and validation. For English, previous research relied on English collections such as TREC Robust 2004 [33], the 2009-2014 TREC Web Track [7], and MS MARCO [2]. No datasets of similar size exist for other languages.

While most of recent approaches have focused on ad hoc retrieval for English, some researchers have studied the problem of cross-lingual information retrieval. Under this setting, document collections are typically in English, while queries get translated to several languages; sometimes, the opposite setup is used. Throughout the years, several cross lingual tracks were included as part of TREC. TREC 6, 7, 8 [3] offer queries in English, German, Dutch, Spanish, French, and Italian. For all three years, the document collection was kept in English. CLEF also hosted multiple cross-lingual ad-hoc retrieval tasks from 2000 to 2009 [4]. Early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations [27]. More recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. For example, Vulic and Moens [34] proposed BWESG, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. Sasaki et al [28] leveraged a data set of Wikipedia pages in 25 languages to train a learning to rank algorithm for Japanese-English and Swahili-English cross-language retrieval. Litschko et al [20] proposed an unsupervised framework that relies on aligned word embeddings. Ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-English document retrieval.

A few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. For example, TREC produced several non-English test collections: Spanish [13], Chinese Mandarin [31], and Arabic [25]. Other languages were explored, but the document collections are no longer available. The CLEF initiative includes some non-English monolingual datasets, though these are primarily focused on European languages [4]. Recently, Zheng et al. [40] introduced Sogou-QCL, a large query log dataset in Mandarin. Such datasets are only available for languages that already have large, established search engines.

Inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non English languages using supervised neural approaches. In particular, to circumvent the lack of training data, we leverage transfer learning techniques to train Arabic, Mandarin, and Spanish retrieval models using English training data. In the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g.

[16, 17, 29, 38]). Our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in English. We evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. By leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned [19] or unaligned [8] web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. During the peer review of this article, a preprint [30] was published with similar observations as ours. In summary, our contributions are:

  • We study zero shot transfer learning for IR in non-English languages.

  • We propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. Our approach outperforms several baselines on multiple non-English collections.

  • We show that including additional in-language training samples may help further improve ranking performance.

  • We release our code for pre-processing, initial retrieval, training, and evaluation of non-English datasets.111https://github.com/Georgetown-IR-Lab/multilingual-neural-ir We hope that this encourages others to consider cross-lingual modeling implications in future work.

2 Methodology

Zero-shot Multi-Lingual Ranking. Because large-scale relevance judgments are largely absent in languages other than English, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. This setting makes use of relevance data from one language that has a considerable amount of training data (e.g., English) for model training and validation, and applies the trained model to a different language for testing.

More formally, let be a collection of relevance tuples in the source language, and be a collection of relevance judgments from another language. Each relevance tuple consists of a query, document, and relevance score, respectively. In typical evaluation environments, is segmented into multiple splits for training () and testing (), such that there is no overlap of queries between the two splits. A ranking algorithm is tuned on to define the ranking function , which is subsequently tested on . We propose instead tuning a model on all data from the source language (i.e., training ), and testing on a collection from the second language ().

Datasets. We evaluate on monolingual newswire datasets from three languages: Arabic, Mandarin, and Spanish. The Arabic document collection contains documents (LDC2001T55), and we use topics/relevance information from the 2001–02 TREC Multilingual track (25 and 50 topics, respectively). For Mandarin, we use news articles from LDC2000T52. Mandarin topics and relevance judgments are utilized from TREC 5 and 6 (26 and 28 topics, respectively). Finally, the Spanish collection contains articles from LDC2000T51, and we use topics from TREC 3 and 4 (25 topics each). We use the topics, rather than the query descriptions, in all cases except TREC Spanish 4, in which only descriptions are provided. The topics more closely resemble real user queries than descriptions.222Some have observed that the context provided by query descriptions are valuable for neural ranking, particularly when using contextualized language models [9]. We test on these collections because they are the only document collections available from TREC at this time.333https://trec.nist.gov/data/docs_noneng.html

We index the text content of each document using a modified version of Anserini with support for the languages we investigate [36]. Specifically, we add Anserini support for Lucene’s Arabic and Spanish light stemming and stop word list (via SpanishAnalyzer and ArabicAnalyzer). We treat each character in Mandarin text as a single token.

Modeling. We explore the following ranking models:

  • [leftmargin=*]

  • Unsupervised baselines. We use the Anserini [36] implementation of BM25, RM3 query expansion, and the Sequential Dependency Model (SDM) as unsupervised baselines. In the spirit of the zero-shot setting, we use the default parameters from Anserini (i.e., assuming no data of the target language).

  • PACRR [14]

    models n-gram relationships in the text using learned 2D convolutions and max pooling atop a query-document similarity matrix.

  • KNRM [35] uses learned Gaussian kernel pooling functions over the query-document similarity matrix to rank documents.

  • Vanilla BERT [21] uses the BERT [10] transformer model, with a dense layer atop the classification token to compute a ranking score. To support multiple languages, we use the base-multilingual-cased pretrained weights. These weights were trained on Wikipedia text from 104 languages.

We use the embedding layer output from base-multilingual-cased

model for PACRR and KNRM. In pilot studies, we investigated using cross-lingual MUSE vectors 

[8] and the output representations from BERT, but found the BERT embeddings to be more effective.

Experimental Setup. We train and validate models using TREC Robust 2004 collection [33]. TREC Robust 2004 contains 249 topics, documents, and relevance judgments in English (folds 1-4 from [15] for training, fold 5 for validation). Thus, the model is only exposed to English text in the training and validation stages (though the embedding and contextualized language models are

trained on large amounts of unlabeled text in the languages). The validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via nDCG@20). We train using pairwise softmax loss with Adam 

[18].

We evaluate the performance of the trained models by re-ranking the top 100 documents retrieved with BM25. We report MAP, Precision@20, and nDCG@20 to gauge the overall performance of our approach, and the percentage of judged documents in the top 20 ranked documents (judged@20) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments.

Ranker P@20 nDCG@20 MAP judged@20
Arabic (TREC 2002) [25]
BM25 0.3470 0.3863 0.2804 99.0%
BM25 + RM3 0.3320 0.3705 0.2641 95.1%
SDM 0.3380 0.3775 0.2572 98.1%
PACRR multilingual 0.3270 0.3499 0.2517 96.4%
KNRM multilingual 0.3210 0.3415 0.2503 95.2%
Vanilla BERT multilingual 0.3790 0.4205 0.2876 97.4%
Arabic (TREC 2001) [25]
BM25 0.5420 0.5933 0.3462 97.2%
BM25 + RM3 0.4700 0.5458 0.2903 85.6%
SDM 0.5140 0.5843 0.3213 96.2%
PACRR multilingual 0.3880 0.3933 0.2724 90.6%
KNRM multilingual 0.4140 0.4327 0.2742 91.0%
Vanilla BERT multilingual 0.5240 0.5628 0.3432 91.0%
Mandarin (TREC 6) [31]
BM25 0.5962 0.6409 0.3316 89.6%
BM25 + RM3 0.5019 0.5571 0.2696 75.6%
SDM 0.5942 0.6320 0.3472 92.1%
PACRR multilingual 0.4923 0.5238 0.2856 79.0%
KNRM multilingual 0.5308 0.5497 0.3107 80.8%
Vanilla BERT multilingual 0.6615 0.6959 0.3589 92.7%
Mandarin (TREC 5) [32]
BM25 0.3893 0.4113 0.2548 85.4%
BM25 + RM3 0.2768 0.3021 0.1698 64.6%
SDM 0.4536 0.4744 0.2855 94.1%
PACRR multilingual 0.3786 0.3998 0.2331 83.2%
KNRM multilingual 0.3232 0.3449 0.2223 77.5%
Vanilla BERT multilingual 0.4589 0.5196 0.2906 92.0%
Spanish (TREC 4) [13]
BM25 0.3080 0.3314 0.1459 83.8%
BM25 + RM3 0.3360 0.3358 0.2024 85.2%
SDM 0.2780 0.3061 0.1377 78.6%
PACRR multilingual 0.2440 0.2494 0.1294 69.4%
KNRM multilingual 0.3120 0.3402 0.1444 79.2%
Vanilla BERT multilingual 0.4400 0.4898 0.1800 85.6%
Spanish (TREC 3) [12]
BM25 0.5220 0.5536 0.2420 84.8%
BM25 + RM3 0.6100 0.6236 0.3887 93.0%
SDM 0.4920 0.5178 0.2258 83.8%
PACRR multilingual 0.4140 0.4092 0.2260 76.0%
KNRM multilingual 0.5560 0.5700 0.2449 85.2%
Vanilla BERT multilingual 0.6400 0.6672 0.2623 90.8%
Table 1: Zero-shot multi-lingual results for various baseline and neural methods. Significant improvements and reductions in performance compared with BM25 are indicated with and

, respectively (paired t-test by query,

).

3 Results

We present the ranking results in Table 1. We first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, RM3 and SDM outperform BM25, whereas in other cases they under-perform. Similarly, the PACRR and KNRM neural models also vary in effectiveness, though more frequently perform much worse than BM25. This makes sense because these models capture matching characteristics that are specific to English. For instance, n-gram patterns captured by PACRR for English do not necessarily transfer well to languages with different constituent order, such as Arabic (VSO instead of SVO). An interesting observation is that the Vanilla BERT model (which recall is only tuned on English text) generally outperforms a variety of approaches across three test languages. This is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. The exceptions are the Arabic 2001 dataset, in which it performs only comparably to BM25 and the MAP results for Spanish. For Spanish, RM3 is able to substantially improve recall (as evidenced by MAP), and since Vanilla BERT acts as a re-ranker atop BM25, it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. In all cases, Vanilla BERT exhibits judged@20 above 85%, indicating that these test collections are still valuable for evaluation.

To test whether a small amount of in-language training data can further improve BERT ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. The in-language samples are interleaved into the English training samples. Results for this few-shot setting are shown in Table 2. We find that the added topics for Arabic 2001 (+50) and Spanish 4 (+25) significantly improve the performance. This results in a model significantly better than BM25 for Arabic 2001, which suggests that there may be substantial distributional differences in the English TREC 2004 training and Arabic 2001 test collections. We further back this up by training an “oracle” BERT model (training on the test data) for Arabic 2001, which yields a model substantially better (P@20=0.7340, nDCG@20=0.8093, MAP=0.4250).

P@20 nDCG@20 MAP
Dataset ZS FS ZS FS ZS FS
Arabic 2002 0.3790 0.3690 0.4205 0.3905 0.2876 0.2822
Arabic 2001 0.5240 0.6020 0.5628 0.6405 0.3432 0.3529
Mandarin 6 0.6615 0.6808 0.6959 0.7099 0.3589 0.3537
Mandarin 5 0.4589 0.4643 0.5196 0.5014 0.2906 0.2895
Spanish 4 0.4400 0.5060 0.4898 0.5636 0.1800 0.2020
Spanish 3 0.6400 0.6560 0.6672 0.6825 0.2623 0.2684
Table 2: Zero-Shot (ZS) and Few-Shot (FS) comparison for Vanilla BERT (multilingual) on each dataset. Within each metric and dataset, the top result is listed in bold. Significant increases from using FS are indicated with (paired t-test, ).

4 Conclusion

We introduced a zero-shot multilingual setting for evaluation of neural ranking methods. This is an important setting due to the lack of training data available in many languages. We found that contextualized languages models (namely, BERT) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to English). We also found that additional in-language training data may improve the performance, though not necessarily. By releasing our code and models, we hope that cross-lingual evaluation will become more commonplace.

References

  • [1] G. Amati and C. J. Van Rijsbergen (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 357–389. Cited by: §1.
  • [2] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al. (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §1.
  • [3] M. Braschler, P. Schäuble, and C. Peters (2000) Cross-language information retrieval (clir) track overview. In TREC, Cited by: §1.
  • [4] M. Braschler (2003) CLEF 2003–overview of results. In Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 44–63. Cited by: §1, §1.
  • [5] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In

    Proceedings of the 24th international conference on Machine learning

    ,
    pp. 129–136. Cited by: §1.
  • [6] C. Carpineto and G. Romano (2012) A survey of automatic query expansion in information retrieval. Acm Computing Surveys (CSUR) 44 (1), pp. 1. Cited by: §1.
  • [7] K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. M. Voorhees (2015) TREC 2014 web track overview. Technical report MICHIGAN UNIV ANN ARBOR. Cited by: §1.
  • [8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §1, §2.
  • [9] Z. Dai and J. Callan (2019) Deeper text understanding for ir with contextual neural language modeling. In SIGIR, Cited by: footnote 2.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Cited by: 4th item.
  • [11] J. Guo, Y. Fan, Q. Ai, and W. B. Croft (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: §1.
  • [12] D. K. Harman (1995) Overview of the third text retrieval conference (trec-3). DIANE Publishing. Cited by: Table 1.
  • [13] D. Harman (1996) Overview of the fourth text retrieval conference (trec-4). NIST SPECIAL PUBLICATION SP, pp. 1–24. Cited by: §1, Table 1.
  • [14] K. Hui, A. Yates, K. Berberich, and G. de Melo (2017) PACRR: a position-aware neural ir model for relevance matching. arXiv preprint arXiv:1704.03940. Cited by: §1, 2nd item.
  • [15] S. Huston and W. B. Croft (2014) Parameters learned in the comparison of retrieval models using term dependencies. Technical Report. Cited by: §2.
  • [16] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017)

    Google’s multilingual neural machine translation system: enabling zero-shot translation

    .
    Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §1.
  • [17] J. Kim, Y. Kim, R. Sarikaya, and E. Fosler-Lussier (2017) Cross-lingual transfer learning for pos tagging without cross-lingual resources. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 2832–2838. Cited by: §1.
  • [18] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §2.
  • [19] G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1.
  • [20] R. Litschko, G. Glavaš, S. P. Ponzetto, and I. Vulić (2018) Unsupervised cross-lingual information retrieval using monolingual data only. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1253–1256. Cited by: §1.
  • [21] S. MacAvaney, A. Yates, A. Cohan, and N. Goharian (2019) CEDR: contextualized embeddings for document ranking. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, New York, NY, USA, pp. 1101–1104. External Links: ISBN 978-1-4503-6172-9 Cited by: §1, 4th item.
  • [22] H. R. Max Roser and E. Ortiz-Ospina (2019) Internet. Note: https://ourworldindata.org/internetLast accessed: 2019/09/15 Cited by: §1.
  • [23] D. Metzler and W. B. Croft (2005) A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, New York, NY, USA, pp. 472–479. External Links: ISBN 1-59593-034-5 Cited by: §1.
  • [24] B. Mitra, N. Craswell, et al. (2018) An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13 (1), pp. 1–126. Cited by: §1.
  • [25] D. W. Oard and F. C. Gey (2002) The TREC 2002 arabic/english clir track. In TREC, Cited by: §1, Table 1.
  • [26] K. D. Onal, Y. Zhang, I. S. Altingovde, M. M. Rahman, P. Karagoz, A. Braylan, B. Dang, H. Chang, H. Kim, Q. McNamara, et al. (2018) Neural information retrieval: at the end of the early years. Information Retrieval Journal 21 (2-3), pp. 111–182. Cited by: §1.
  • [27] C. Peters, M. Braschler, and P. Clough (2012) Multilingual information retrieval: from research to practice. Springer Science & Business Media. Cited by: §1.
  • [28] S. Sasaki, S. Sun, S. Schamoni, K. Duh, and K. Inui (2018-06) Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 458–463. Cited by: §1.
  • [29] S. Schuster, S. Gupta, R. Shah, and M. Lewis (2019-06) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3795–3805. Cited by: §1.
  • [30] P. Shi and J. Lin (2019) Cross-lingual relevance transfer for document retrieval. ArXiv abs/1911.02989. Cited by: §1.
  • [31] E. Voorhees, D. Harman, and R. Wilkinson (1998) The sixth text retrieval conference (trec-6). In The Text REtrieval Conference (TREC), Vol. 500, pp. 240. Cited by: §1, Table 1.
  • [32] E. M. Voorhees and D. Harman (1996) Overview of the fifth text retrieval conference (trec-5). In TREC, Vol. 97, pp. 1–28. Cited by: Table 1.
  • [33] E. M. Voorhees (2005) Overview of the TREC 2005 Robust Retrieval Track.. In TREC, Cited by: §1, §2.
  • [34] I. Vulić and M. Moens (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, New York, NY, USA, pp. 363–372. External Links: ISBN 978-1-4503-3621-5 Cited by: §1.
  • [35] C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power (2017) End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pp. 55–64. Cited by: §1, 3rd item.
  • [36] P. Yang, H. Fang, and J. Lin (2018) Anserini: reproducible ranking baselines using lucene. J. Data and Information Quality 10, pp. 16:1–16:20. Cited by: 1st item, §2.
  • [37] W. Yang, H. Zhang, and J. Lin (2019) Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972. Cited by: §1.
  • [38] Z. Yang, R. Salakhutdinov, and W. W. Cohen (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345. Cited by: §1.
  • [39] H. Young (2015) The digital language divide. Note: http://labs.theguardian.com/digital-language-divide/Last accessed: 2019/09/15 Cited by: §1.
  • [40] Y. Zheng, Z. Fan, Y. Liu, C. Luo, M. Zhang, and S. Ma (2018) Sogou-qcl: a new dataset with click relevance label. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1117–1120. Cited by: §1.