Enhancing Model Performance in Multilingual Information Retrieval with Comprehensive Data Engineering Techniques

02/14/2023
by   Qi Zhang, et al.
0

In this paper, we present our solution to the Multilingual Information Retrieval Across a Continuum of Languages (MIRACL) challenge of WSDM CUP 2023[https://project-miracl.github.io/]. Our solution focuses on enhancing the ranking stage, where we fine-tune pre-trained multilingual transformer-based models with MIRACL dataset. Our model improvement is mainly achieved through diverse data engineering techniques, including the collection of additional relevant training data, data augmentation, and negative sampling. Our fine-tuned model effectively determines the semantic relevance between queries and documents, resulting in a significant improvement in the efficiency of the multilingual information retrieval process. Finally, Our team is pleased to achieve remarkable results in this challenging competition, securing 2nd place in the Surprise-Languages track with a score of 0.835 and 3rd place in the Known-Languages track with an average nDCG@10 score of 0.716 across the 16 known languages on the final leaderboard.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/05/2022

A Semantic Alignment System for Multilingual Query-Product Retrieval

This paper mainly describes our winning solution (team name: www) to Ama...
research
02/26/2023

Cross-lingual Knowledge Transfer via Distillation for Multilingual Information Retrieval

In this paper, we introduce the approach behind our submission for the M...
research
09/03/2022

Multilingual ColBERT-X

ColBERT-X is a dense retrieval model for Cross Language Information Retr...
research
08/15/2022

Continuous Active Learning Using Pretrained Transformers

Pre-trained and fine-tuned transformer models like BERT and T5 have impr...
research
02/28/2023

Extending English IR methods to multi-lingual IR

This paper describes our participation in the 2023 WSDM CUP - MIRACL cha...
research
09/28/2022

Multilingual Search with Subword TF-IDF

Multilingual search can be achieved with subword tokenization. The accur...
research
02/14/2022

DS4DH at TREC Health Misinformation 2021: Multi-Dimensional Ranking Models with Transfer Learning and Rank Fusion

This paper describes the work of the Data Science for Digital Health (DS...

Please sign up or login with your details

Forgot password? Click here to reset