MuCoT: Multilingual Contrastive Training for Question-Answering in Low-resource Languages

by   Gokul Karthik Kumar, et al.

Accuracy of English-language Question Answering (QA) systems has improved significantly in recent years with the advent of Transformer-based models (e.g., BERT). These models are pre-trained in a self-supervised fashion with a large English text corpus and further fine-tuned with a massive English QA dataset (e.g., SQuAD). However, QA datasets on such a scale are not available for most of the other languages. Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages. Since these models are pre-trained with huge text corpora containing multiple languages, they typically learn language-agnostic embeddings for tokens from different languages. However, directly training an mBERT-based QA system for low-resource languages is challenging due to the paucity of training data. In this work, we augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model, which is already pre-trained in English. Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance, whereas the performance degrades in the case of cross-language families. We further show that introducing a contrastive loss between the translated question-context feature pairs during the fine-tuning process, prevents such degradation with cross-lingual family translations and leads to marginal improvement. The code for this work is available at


page 1

page 4

page 6


Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering

Human knowledge is collectively encoded in the roughly 6500 languages sp...

Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

Question answering (QA) in English has been widely explored, but multili...

EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering

We propose EXAMS – a new benchmark dataset for cross-lingual and multili...

On the Usability of Transformers-based models for a French Question-Answering task

For many tasks, state-of-the-art results have been achieved with Transfo...

Multilingual Answer Sentence Reranking via Automatically Translated Data

We present a study on the design of multilingual Answer Sentence Selecti...

Predicting Subjective Features from Questions on QA Websites using BERT

Modern Question-Answering websites, such as StackOverflow and Quora, hav...

Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering

Recent approaches have exploited weaknesses in monolingual question answ...

Code Repositories


MuCoT - Multilingual Contrastive Training for Question-Answering in Low-resource Languages - Accepted at #ACL 2022 Workshop

view repo

1 Introduction

India has a population of 1.4 billion people speaking 447 languages and over 10,000 dialects, making it the country with the fourth-highest number of languages. However, Indian languages are highly under-represented on the Internet and Natural Language Processing (NLP) systems for Indian languages are in their nascency. Even state-of-the-art multilingual NLP systems perform sub-optimally on Indian languages


. This can be explained by the fact that multilingual language models are often jointly trained on 100+ languages and Indian languages constitute only a small fraction of their vocabulary and training data (as shown in Figure


Figure 1: Example of a QA record from the ChAII QA dataset along with the translation and transliteration done on that record.
Figure 2: Amount of data in GB (log-scale) for the 88 languages that appear in both the Wiki-100 wiki100 corpus used for mBERT and XLM-100 conneau2020unsupervised. None of the Indian languages feature among top-25 languages with the largest amount of data.

Machine learning models and tools have been proposed for many Natural Language Understanding tasks. In this work, we focus on Extractive Question-Answering (QA), where the goal is to localize the answer to a question within a large context (see Figure 1). Specifically, we aim to develop a common multilingual question answering model for multiple Indian languages. A multilingual model has several advantages: (1) learning of cues across different languages, (2) a single model for many languages, and (3) avoiding dependency on English translation during inference. In this work, we start with a pre-trained multilingual Bidirectional Encoder Representations from Transformers (mBERT) model and further pre-train it with SQuAD rajpurkar2016SQuAD, a large-scale question answering dataset in English. The resulting English-language mBERT-QA model is fine-tuned and evaluated for Indian languages Tamil and Hindi using the ChAII dataset ChAII.

Fine-tuning the mBERT-QA model using only the training instances in the ChAII dataset is less effective because of the small number of training samples (1114 records with approximately two-thirds in Hindi and the rest in Tamil). To overcome this problem, we use translation and transliteration to other languages as a data augmentation strategy. The translation is the process of transforming the source content from one language to another, while the transliteration just involves modifying each word from the source content into another script. Both these operations are executed on the training dataset for the contexts, questions, and answers separately; then new locations of transformed answers in the transformed contexts are computed as shown in Figure 1. Using translation and transliteration increases the size of the ChAII dataset manifold.

The choice of languages used for translation and transliteration is critical. kudugunta2019investigating showed that languages under the same family have similar representations in multilingual models. Hence, we put together translations and transliterations from related languages within the same language family to achieve better performance. This will also help with better use of the vocabulary corpora from the low-resource languages. We also study the impact of translation and transliteration on languages outside the family of the target language. Since the cross-family language transfer degraded the QA performance, we introduce a contrastive loss (radford2021learning) between the translated pairs to help retain or improve the original performance by encouraging the embeddings from all languages to be similar regardless of the family group. Thus, the contributions of the paper are three-fold:

  • We propose a three-stage training pipeline for question-answering in low-resource languages.

  • We evaluate mBERT for question-answering in Tamil and Hindi with translations and transliterations as data augmentation techniques and show that same language family translations improve the performance. In contrast, we show that transliterations do not improve the QA performance on the ChAII dataset, regardless of the language family combinations.

  • We propose a contrastive loss between the features of translated pairs to align the cross-family language representations.

2 Related Work

Bidirectional Encoder Representations from Transformers (BERT) devlin2018bert

is a deep learning model for general-purpose language representations. BERT is often used as the backbone model for several NLP tasks like semantic analysis, question answering, and named entity recognition. The bidirectional transformer used in BERT has a deeper sense of language context and generates intricate semantic feature representations. These representations are learned through a pre-training step using Next Sentence Prediction (NSP) and Masked Language Modelling (MLM) as pretext tasks and transferred to the downstream NLP tasks. The goal of the Next Sentence Prediction task is to identify whether the two input sentences are consecutive or not. In Masked Language Modelling, BERT is trained to predict randomly masked words in a sentence. The Transformer network receives a sequence of tokens as input and utilizes the attention mechanism to learn the contextual relationships between words in a text. These relationships can then be used to extract high-quality language features, which can be fine-tuned for applications like semantic analysis and question answering. Multi-lingual-BERT (mBERT) is a BERT model pre-trained using the Wikipedia text corpus

wiki100 in more than 100 languages around the world. XLM-RoBERTa conneau2020unsupervised scaled this idea with more than 2 terabytes of common crawl data.

Deep models such as Transformers rely heavily on the availability of a large amount of annotated data, which is available only for prominent languages like English, Russian, German or Spanish ponti2019modeling; joshi2020state

. For a majority of other languages with a minimal number of annotations, cross-lingual transfer learning

prettenhofer2011cross; wan2011bi; ruder2019survey has been proposed as a possible solution. This approach can transfer knowledge from the annotation-rich source language to low-resource or zero-resource target languages. Furthermore, multilingual models lewis2019mlqa; clark2020tydi can be used to mitigate the data scarcity problem. For example, LASER artetxe2019massively used a bidirectional LSTM 10.1162/neco.1997.9.8.1735 encoder with a byte pair encoding vocabulary shared between languages. This work showed that joint training of multiple languages helped to improve the model performance for low-resource languages. LaBSE feng2020language used the mBERT devlin2018bert encoder pre-trained with masked language modelling and translation language modelling lample2019cross tasks. It attempted to optimize the dual encoder translation ranking guo-etal-2018-effective loss during pre-training to achieve similar embedding for the same text in different languages.

The work of bornea2020multilingual showed that large pre-trained multilingual models are not enough for question-answering in under-represented languages and presented several novel strategies to improve the performance of mBERT with translations. This work achieved language-independent embeddings, which improved the cross-lingual transfer performance with additional pre-training on adversarial tasks. It also introduced a Language Arbitration Framework (LAF), which consolidated the embedding representations across languages using properties of translation. Cross-lingual manifold mixup (X-Mixup) yang2021enhancing achieved better cross-lingual transfer by calibrating the representation discrepancy, which resulted in a compromised representation for target languages. It was shown that the multilingual pre-training process can be improved by implementing X-Mixup on parallel data. Contrastive Language-Image pre-training (CLIP) radford2021learning introduced an efficient way to learn scalable image representations with natural language supervision. Drawing inspiration from ConVIRT zhang2020contrastive

, CLIP used a contrastive objective that maximizes the cosine similarity of the correct pairs of images and text, while minimizing the same for incorrect pairs.

Building upon the work of bornea2020multilingual, we show that translations of a small-scale dataset into cross-family languages could degrade the QA performance. To overcome this problem, we propose multilingual contrastive training to encourage cross-lingual invariance. Our approach is relatively simpler compared to adversarial training and LAF used in bornea2020multilingual. Though the proposed contrastive loss has a similar objective to the pre-training loss in guo-etal-2018-effective, there are subtle differences because we use it in multi-task learning setup along with the original task loss for fine-tuning.

3 Methodology

Figure 3: Proposed training pipeline of MuCoT for question answering in low resource languages Tamil and Hindi.

3.1 Data Representation and Baseline Model

We adopt the standard data representations that are commonly used in Transformer-based question-answering models. We use the same word-piece Tokenizer of mBERT to tokenize the concatenated input of question-context pairs. For the question answering task, the context is usually very long. In some NLP applications, truncating the input text is a viable choice because it leads to only loss of information. But in the extractive question answering task, removing part of the context may result in loss of answer as well. To overcome this challenge, we follow the popular approach of splitting the long context into parts that fit into the model and regulate this splitting using an additional hyper-parameter called ’max length’. Moreover, to cover for cases where the answer might be distributed over multiple splits of the context, an overlap factor is introduced, which in turn is controlled by another hyper-parameter ’doc stride’.

Our baseline is the mBERT model devlin2018bert

, which is pre-trained using pretext tasks like Masked Language Modelling and Next Sentence Prediction on a multilingual text corpus that includes our target languages, Hindi and Tamil. The default output head of mBERT is replaced with the head for the question-answering task. This is done by adding separate output heads for classifying the start and end positions as shown in


3.2 Proposed Framework for Effective Cross-lingual Transfer

We propose a three-stage pipeline called Multilingual Constrative Training (MuCoT) to effectively train the mBERT model for question-answering in low-resource languages. An illustration of this pipeline for two low-resource languages, namely Tamil and Hindi, is shown in Figure 3. The first stage is pre-training the baseline multilingual model (mBERT). The second stage involves pre-training the QA head using the large-scale dataset(s) in high resource language(s). In Figure 3, English is considered the high-resource language and SQuAD rajpurkar2016SQuAD dataset is used to pre-train the QA head and obtain the mBERT-QA model. The final stage involves fine-tuning the mBERT-QA model using both original and augmented samples from the target low-resource languages. In this work, ChAII ChAII dataset is used for obtaining training samples in Tamil and Hindi.

Since SQuAD rajpurkar2016SQuAD and ChAII ChAII datasets have similar Wikipedia111

style contexts, it is possible to train a multilingual QA model jointly using both datasets. However, to take advantage of the engineering and training efforts of publicly available models, we sequentially use both these datasets. After obtaining the mBERT-QA model pre-trained for the English language QA task, we fine-tune it on the ChAII dataset using the following loss function.


where and are the QA task loss and multilingual contrastive loss, respectively, is the total loss, and is the relative weight assigned to the contrastive loss. Note that fine-tuning using only the QA task loss is often not sufficient to achieve good performance, especially if the dataset used for fine-tuning is small. To mitigate this problem, we translate the training samples into other languages and use both original and translated samples for fine-tuning. While this approach works well for translations into other languages within the same language family, it leads to sub-optimal performance in the case of cross-family language translations, due to divergence in the representations across language families. To solve this issue, we introduce the multi-lingual contrastive loss .

Figure 4: Logits matrix computation for the input to contrastive loss, similar to CLIP radford2021learning

3.3 Multilingual Contrastive Loss

During fine-tuning, for each data point in the original batch () of size , we pick one of its corresponding translations uniformly at random and form a translated batch () of the same size . It is important to note that itself is taken from the combined dataset of source instances and translated instances. The two batches that form a pair are denoted as original batch and pair batch, respectively, in Figure 4. We use the same mBERT network up to a specific layer as our encoder () to transform and to get the embeddings, , respectively. Then, we apply a global average pooling (

) operation to aggregate the vector representations of

tokens into a single vector representation of dimension for each instance in each batch. This will result in the aggregated embeddings for and , respectively. With these feature vectors in the original and the translated batch, we follow the CLIP radford2021learning approach and compute the contrastive loss using the cross-entropy loss (). Specifically, we multiply the matrices and to get the logits matrix . Then, we apply the cross-entropy loss row-wise and column-wise to the logits matrix , with its diagonal locations as original classes for each row and column, respectively.


4 Experimental Results

4.1 Datasets

In our experiments, we use ChAII ChAII question-answering dataset for fine-tuning and evaluation. This dataset was recently released by Google Research India and has 1,114 records of context, question, answer, and its corresponding start position in the context for Tamil and Hindi languages. Hindi is represented predominantly in the dataset with nearly two-thirds of the records. As the ChAII dataset has been published as part of an ongoing Kaggle competition ChAII, the complete test dataset has not been disclosed to the public. Hence, we have used Scikit-learn’s train_test_split method with a test size of , stratified on language and with a random seed of , to get the split from the training data. Similarly, we applied the same method over the filtered split to get the split of 100 samples. We also use the translations and transliterations of this training split as augmented samples for fine-tuning the QA model.

Stanford Question Answering Dataset (SQuAD) rajpurkar2016SQuAD is the most popular question-answering dataset in English. This dataset had been crowdsourced to form 100K records of answerable question-answer pairs along with the context. This dataset is used to pre-train the QA head added to the pre-trained mBERT model, which is subsequently fine-tuned using the ChAII dataset.

4.2 Translation and Transliteration Details

We use AI4Bharat’s IndicTrans222 ramesh2021samanantar for translation, which is a Transformer-4X model trained on dataset ramesh2021samanantar. In IndicTrans, translation can be done from Indian languages to English and vice versa. Available Indian languages include Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. At first, we translate the ChAII dataset from Hindi and Tamil to English and then to Bengali, Marathi, Malayalam, and Telugu. In the FLORES devset benchmark goyal2021flores, the BLEU scores of IndicTrans for translating Hindi and Tamil to English are 37.9 and 28.6, respectively. The scores for translating English to Bengali, Marathi, Malayalam, and Telugu are 20.3, 16.1, 16.3, and 22.0, respectively. We were not able to translate nearly 500 of the ChAII instances to English as the automatic search for the translated answers in the translated contexts failed. This happened because the same word got translated differently in the context and the answer. For the same reason, we lost nearly another 200 instances when translating from English to other Indian languages.

For transliteration, we use the open-source Indic-trans transliteration module

333 Bhat:2014:ISS:2824864.2824872, which is available for many Indian language scripts including English and Urdu. Here, we directly transliterate from Hindi and Tamil to Bengali, Marathi, Malayalam, and Telugu.

4.3 Model Training Details

We used mBERT444

as our baseline model. It is modified for the question-answering task by replacing the output head using HuggingFace’s auto model. At first, we evaluated this model after directly fine-tuning on the train split of the ChAII dataset. Then, we introduced intermediate SQuAD pre-training and fine-tuned on the train split of the ChAII dataset with and without translations or transliterations. The hyperparameter settings listed in Table

1 are used for all the experiments. We have experimented with different levels of mBERT layers to compute the contrastive loss. Layer 3 performed consistently well compared to the initial layer 1 and the deeper layers such as 5. Initially, we used contrastive training for all the steps. However, forcing the model to learn exact representations across languages could make the model forget the task-specific patterns learned with intermediate pre-training on a large-scale dataset. Hence, we applied the contrastive loss only for training steps that are a multiple of 500 and picked the best one. Other hyperparameters are tuned based on a standard search over multiple choices.

Hyperparameter Value
Maximum feature length 128
Document stride 384
Batch size 16
Maximum optimization steps 5000
Learning rate 0.00003
Weight decay 0.01
Contrastive loss layer 3
Contrastive loss weight 0.05
Maximum contrastive steps 1000
Table 1: Hyperparameter configuration of all the models for fine-tuning on ChAII dataset
SQuAD pre-training No Yes Yes Yes Yes
Translations No No Dravidian (ml, te) Indo-Aryan (bn, mr) All languages
Contrastive Training No No No Yes No Yes No Yes
Overall 0.44 0.5 0.49 0.53 0.51 0.52 0.49 0.52
Hindi 0.47 0.57 0.52 0.57 0.59 0.58 0.54 0.57
Tamil 0.39 0.37 0.44 0.45 0.35 0.4 0.39 0.41
Table 2: Jaccard scores with translation used as augmentation in different training settings. ml, te, bn, and mr denote Malayalam, Telugu, Bengali, and Marathi, respectively.
SQuAD pre-training No Yes Yes Yes Yes
Transliterations No No Dravidian (ml, te) Indo-Aryan (bn, mr) All languages
Contrastive Training No No No Yes No Yes No Yes
Overall 0.44 0.5 0.5 0.49 0.53 0.47 0.49 0.46
Hindi 0.47 0.57 0.52 0.55 0.56 0.53 0.52 0.53
Tamil 0.39 0.37 0.45 0.36 0.44 0.36 0.44 0.32
Table 3: Jaccard scores with transliteration used as augmentation in different training settings. ml, te, bn, and mr denote Malayalam, Telugu, Bengali, and Marathi, respectively.

4.4 Evaluation Metric

Given the noisy nature of the ChAII dataset, we employed the Jaccard score as the evaluation metric. Jaccard similarity coefficient is widely used for determining similarity between sets/intervals and is defined as

. Here, and are sets/intervals, and and represent intersection and union, respectively. We compute the evaluation metric for the overall test split as well as for individual language test sets in intervals of 500 optimization steps. For each experiment, we pick the model at a specific optimization step that gives the best overall Jaccard score and reports its performance.

4.5 Performance

As shown in Tables 2 and 3, translation and transliteration affect the performance in different ways. While some data is lost during the translation process due to failed automatic search of translated text in the translated context. transliteration does not cause any such loss. However, to ensure a fair comparison, records lost during translation are dropped from transliterated testing as well. Note that we use the same hyper-parameters from Table 1 for evaluating the models and later stages with additional augmentation and contrastive training.

First, we observe from Table 2 that just having intermediate SQuAD pre-training in English, improves the overall Jaccard score significantly from to . Furthermore, we fine-tune by dividing translated and transliterated data into Indo-Aryan and Dravidian language families to study how translated and transliterated pairs serve as supervised cross-lingual signals when languages share semantics and structure mikolov2013exploiting. Although transliteration improves the Jaccard scores in certain cases compared to the baseline, the trend is not consistent. Moreover, contrastive training does not help in the case of transliteration as shown in Table 3. This could be because the QA model is pre-trained only with regular text and not with transliteration style text.

From Table 2, we observe that grouped translated data in the same language family helps in improving performance. The translated Indo-Aryan data (Bengali and Marathi) increases the Jaccard score of Hindi answers to from . Similarly, Dravidian language data (Telugu and Malayalam) significantly increase the Jaccard similarity of Tamil answers from to . At the same time, the overall Jaccard score did not change much because of the degradation in cross-family language performance. Interestingly, we could observe in Table 2 that the contrastive training helps in preventing such degradation and improves the overall score by encouraging similar representations between languages from across families.

5 Conclusion and Future Work

With Internet usage expanding every day, there is an increasing need to develop better NLP models for a variety of downstream tasks in vernacular languages. As most of these languages do not have labeled resources that are sufficient to train stand-alone modern deep learning models, we need to rely on multilingual models and enhance them. Our work is a step in this direction and is an attempt to understand and evaluate the impact of cross-lingual knowledge transfer through pre-training and fine-tuning. We utilize modern open-source deep learning models to translate the ChAII dataset into different languages from two language families namely, Dravidian, and Indo-Aryan, and use them to improve the question-answering performance. Our analysis shows an effective way to pick languages for translation, which can be used for fine-tuning. We also showed that introducing a contrastive loss with the original task training loss increases the performance even for cross-family languages.

Despite the inclusion of translations and contrastive loss, we observed that there is only a marginal improvement in the QA performance. This can be attributed to the smaller size of the ChAII dataset with 1114 instances (Tamil and Hindi combined; Train, Validation, and Test combined), which is clearly insufficient to fine-tune a 177M parameter model. Hence, the proposed techniques have to be evaluated on other larger datasets as well as using other multilingual models like XLM-RoBERTa conneau2020unsupervised, Distill-mBERT sanh2019distilbert, MURIL khanuja2021muril and Indic-BERT kakwani2020indicnlpsuite. We hope that the proposed techniques will motivate further research in this field, including exploration of the same phenomenon of cross-lingual transfer in other language families and multilingual tasks.