Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

08/26/2022
by   Yabing Wang, et al.
0

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.

READ FULL TEXT

page 1

page 3

page 13

research
09/11/2023

Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval

Current research on cross-modal retrieval is mostly English-oriented, as...
research
11/17/2022

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...
research
09/11/2023

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Molecule discovery serves as a cornerstone in numerous scientific domain...
research
10/13/2022

Low-resource Neural Machine Translation with Cross-modal Alignment

How to achieve neural machine translation with limited parallel data? Ex...
research
06/01/2022

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

In this paper, we introduce Cross-View Language Modeling, a simple and e...
research
04/10/2022

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

Multimodal pre-training for audio-and-text has recently been proved to b...
research
05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...

Please sign up or login with your details

Forgot password? Click here to reset