Multilingual models are a topic of growing interest in NLP given their criticality to the universal adoption of AI. Cross-lingual information retrieval (CLIR) braschler1999cross; shakery2013leveraging; saleh-etal-2019-extended; jiang2020cross; asai-etal-2021-xor; shi2021cross, for example, can find relevant text in a high-resource language such as English even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where a retrieval corpus is available only in a single high-resource language (English in our experiments).
A solution to this problem can take one of two general forms. First, machine translation (MT) of the query into English followed by monolingual (English) IR asai-etal-2021-xor
. While such a pipeline approach can provide accurate predictions, an alternative solution that can tackle the problem purely cross-lingually, i.e., without involving MT, can be more efficient and cost-effective. Pre-trained multilingual masked language models (PLMs) such as multilingual BERTdevlin-etal-2019-bert and XLM-RoBERTa (XLM-R henceforth) conneau-etal-2020-unsupervised can provide the foundation for such a cross-lingual solution, as one can simply fine-tune such a model with labeled CLIR data asai-etal-2021-one.
Here we first run an empirical evaluation of these two approaches on a public CLIR benchmark asai-etal-2021-xor. We use ColBERT111Due to its state-of-the-art (SOTA) performance outperforming DPR karpukhin2020dense. khattab-etal-2020-colbert; khattab-etal-2021-relevance as our IR architecture and XLM-R as the underlying PLM for both methods (§2). Results indicate that the MT-based solution can be vastly more effective than CLIR fine-tuning, where we observe a difference in Recall@5kt of 28.6 (§3). Crucially, the modular design of the former allows it to leverage additional English-only training data to train its IR component, which contributes significantly to its performance.
The above results lead naturally to our second research question: Can a more accurate CLIR model be trained that can operate without having to rely on MT? To answer the question, instead of viewing the MT-based approach as a competing one, we propose to leverage its strength via knowledge distillation (KD) into a single-step CLIR model. KD hinton-etal-2015-distilling is a powerful supervision technique typically used to distill the knowledge of a large teacher model about some task into a smaller student model sanh2019distilbert. Here we propose to use KD in a slightly different context, where the teacher and the student IR models are identical in size, but the former has superior performance simply due to having access to an MT module and consequently operating in a high-resource and low-difficulty monolingual environment.
We perform two different KD operations (§2.2). The first one directly optimizes an IR objective, where we use a labeled CLIR dataset consisting of parallel (English and non-English) questions and their corresponding positive and negative passages for training. The teacher and the student are shown the English and non-English versions of a question, respectively, and the training objective is for the student to match the soft query-passage relevance predictions of the teacher. Our second KD task is representation learning from parallel text, where the student learns to encode a non-English text that matches the teacher’s encoding of the aligned English text at the token level. The cross-lingual token alignment needed to create the training data for this task is generated using a novel greedy alignment process which, despite its noisy nature, proves to be highly effective.
Experimental results (§3) show that our proposed methods can regain most of the performance loss from the two-stage solution to the purely cross-lingual model. On an XOR-TyDi test set, the student outperforms the cross-lingual ColBERT baseline by 25.4 points in terms of Recall@5kt, trailing the teacher processing English queries by just 3.2 points. Ablation studies also show that each of our two KD processes contribute significantly towards the final performance of the student.
Our contributions can be summarized as follows:
We present an empirical study of the effectiveness of a SOTA IR model (ColBERT) on cross-lingual IR with and without MT.
We propose a novel, purely cross-lingual solution that uses knowledge distillation to learn both improved text representation and IR.
We demonstrate with a novel cross-lingual alignment algorithm that distillation using parallel text can strongly augment cross-lingual IR training.
We achieve new SOTA results on the XOR-TyDi cross-lingual IR task.
Here we describe our base IR model (ColBERT) and the proposed KD-based cross-lingual training algorithms.
2.1 The ColBERT Model
The ColBERT khattab-etal-2020-colbert architecture consists of a shared transformer-based encoder that separately encodes the input query and document, followed by a linear compression layer. Each training instance is a <, , > triple, where is a query, is a positive (relevant) document and is a negative (irrelevant) document. ColBERT first computes a relevance score for the pair using Equation 1, where and and are the output embeddings of query token and document token , respectively. For a given training triple, a pairwise softmax cross-entropy loss is minimized over the computed scores and .
For inference, the embeddings of all documents are calculated a priori, while the query embeddings and the relevance score are computed at runtime.
2.2 Knowledge Distillation
Our teacher and student are both ColBERT models that fine-tune the same underlying multilingual PLM for IR. The teacher is first trained with all-English triples using the above ColBERT objective. The goal of the subsequent KD training is to teach the student how to reproduce the behavior of the teacher when the former is asked a non-English question and the latter its English translation.
We apply KD at two different stages of the ColBERT workflow: (1) relevance score computation ( in Equation 1), and (2) encoding (e.g., ). Figure 1 depicts the former in detail, where training minimizes the KL divergence between the student’s and teacher’s output softmax distribution (with temperature) over and .
There is limited availability of labeled training data for CLIR. MT, on the other hand, is a more established area of research that has produced a large amount of parallel text. We seek to exploit parallel corpora in our second KD training stage, where we train the student to compute a representation for a non-English text that closely matches the teacher’s representation of the aligned English text. Crucially, since ColBERT computes a single vector for each individual input token (i.e., a PLM vocabulary item) and not for the entire input text, our algorithm must support distillation at the token level.
To achieve this, we apply an iterative cross-lingual alignment algorithm. Assuming to be the ordered tuple of tokens in a non-English text and the corresponding tuple from the aligned English text, each iteration of this algorithm greedily aligns the next
pair with the highest cosine similarity of their output embeddings. Algorithm1 implements this idea by repositioning the teacher’s tokens so that they are position-wise aligned with the corresponding student tokens. Note that the design choice of using a common multilingual PLM in the teacher and the student, even though the former is tasked only with handling English content, is key for the operation of this algorithm as it relies on the pre-trained PLM’s multilingual representations.
In addition to cross-lingual alignment, we also perform a similar KD procedure in which both the teacher and the student are shown the same English text. This step is useful because ColBERT uses a shared encoder for the query and the document, necessitating a student that is able to effectively encode text of both English documents and non-English queries.
Using the alignment information, we train the student by minimizing the Euclidean distance between its representation of a token (English or non-English) and the teacher’s representation of the corresponding English token. Figure 2 shows the KD process for representation learning.
3 Experiments and Results
Our primary CLIR dataset is XOR-TyDi asai-etal-2021-xor, which consists of examples in seven typologically diverse languages: Arabic, Bengali, Finnish, Japanese, Korean, Russian and Telugu. The official training set contains 15,221 natural language queries, their short answers, and examples of relevant (positive) and non-relevant (negative) Wikipedia text snippets with respect to the queries. For most queries, there are one positive and three negative examples. We remove the 1,699 (11%) questions that have no answers in the dataset. In our experiments, we use a random selection of 90% of the remaining examples for training and the rest for validation.
Following the original XOR-TyDi process, we also obtain additional training examples by running BM25-based retrieval against a corpus of Wikipedia text snippets and using answer string match as the relevance criterion. We add these examples to the original set to obtain three positive and 100 negative examples per query. For blind testing, we use the 2,113 examples in the official XOR-TyDi dev set.
|Teacher: MT +||76.3|
Our monolingual (English) training data are derived from the third fine-tuning round (ColBERT-QA3) of ColBERT relevance-guided supervision khattab-etal-2021-relevance based on OpenNQ data kwiatkowski-etal-2019-nq. There are about 17.5M (query, positive, negative) triples in this set. We use an in-house IBM neural MT system for question translation. Finally, the parallel corpus used in our KD experiments for representation learning are collected from three sources: (1) an in-house crawl of Korean, (2) LDC releases (Arabic), and (3) OPUS222https://opus.nlpl.eu. The corpus has a total of 6.9M passage pairs which include .9M pairs in Telugu and 1M pairs in each of the other six languages.
Our evaluation metric is Recall@5000-tokens (R@5kt)asai-etal-2021-xor, which computes the fraction of questions for which the ground truth short answer is contained within the top 5,000 tokens of the retrieved passages.
3.3 Evaluation on XOR-TyDi
The CLIR baseline for our experiments is a ColBERT model with an underlying XLM-R PLM, which we first fine-tune with 17.5M NQ examples for one epoch and then 2.9M XOR-TyDi triples for five epochs. Our student model is initialized with the parameter weights of the baseline, and is further fine-tuned using the two KD objectives. The monolingual teacher model—also a ColBERT model on top of the pre-trained XLM-R—is trained with only the 17.5M NQ triples for one epoch.
Table 1 compares the performance of our models measured by their R@5kt scores. The baseline underperforms the MT English IR pipeline by 28.6 points. KD with parallel corpus on the output representations followed by XOR triples on the query-passage relevance scores achieves an overall improvement of 25.4 points for the baseline model, which, quite impressively, is only 3.2 points behind the teacher’s score.
3.4 Ablation Study
To study the effect of the two different KD operations on our student model’s performance, we train two additional students. Each of these students goes through only one of the two KD training steps. Table 2 summarizes the results: KD with only CLIR examples and only the parallel corpus improves the system’s score by 15.9 and 20.9 points, respectively. Interestingly, although the parallel corpus does not contain any IR signal, it contributes more to the final performance of our model. These results also suggest that our cross-lingual alignment algorithm does indeed produce useful alignments.
We show that without the help of machine translation (MT) at inference time, the accuracy of a state-of-the-art IR framework on cross-lingual IR drops considerably. As a solution, we propose new algorithms to distill the knowledge of a teacher model that performs monolingual IR on MT output into a cross-lingual student model capable of operating without MT. We utilize knowledge distillation (KD) for both IR and text representation, and present (for the latter) a novel cross-lingual alignment algorithm that only relies on the underlying masked language model’s multilingual representation capabilities. In empirical evaluation, our student model recovers most of the performance drop due to operating in a single-pass cross-lingual mode. Future work will explore, among other ideas, zero-shot application of our models to new datasets and utilization of our approach for end-to-end QA.
Appendix A Appendix
Following are the hyperparameter combinations used in our different models. They were selected based on performance on the validation set.
|Parameters in ColBERT model|
|linear transfer dim||128|
|Parameters in regular training|
|Parameters in distillation|