Multi-Modal Representation Learning with Text-Driven Soft Masks

04/03/2023
by   Jaeyoo Park, et al.
0

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

READ FULL TEXT

page 1

page 3

page 8

page 10

research
12/22/2021

Fine-grained Multi-Modal Self-Supervised Learning

Multi-Modal Self-Supervised Learning from videos has been shown to impro...
research
06/07/2023

On the Generalization of Multi-modal Contrastive Learning

Multi-modal contrastive learning (MMCL) has recently garnered considerab...
research
03/25/2021

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting

Self-supervised learning has gained prominence due to its efficacy at le...
research
05/24/2017

Self-supervised learning of visual features through embedding images into text topic spaces

End-to-end training from scratch of current deep architectures for new c...
research
08/28/2023

MS-Net: A Multi-modal Self-supervised Network for Fine-Grained Classification of Aircraft in SAR Images

Synthetic aperture radar (SAR) imaging technology is commonly used to pr...
research
06/26/2023

Learning with Difference Attention for Visually Grounded Self-supervised Representations

Recent works in self-supervised learning have shown impressive results o...
research
08/31/2023

Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval

We consider the problem of composed image retrieval that takes an input ...

Please sign up or login with your details

Forgot password? Click here to reset