Fine-grained Multi-Modal Self-Supervised Learning

12/22/2021
by   Duo Wang, et al.
0

Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks. However, such Self-Supervised pre-training requires large batch sizes and a large amount of computation resources due to the noise present in the uncurated data. This is partly due to the fact that the prevalent training scheme is trained on coarse-grained setting, in which vectors representing the whole video clips or natural language sentences are used for computing similarity. Such scheme makes training noisy as part of the video clips can be totally not correlated with the other-modality input such as text description. In this paper, we propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale (such as individual feature map embeddings and embeddings of phrases), and uses attention mechanisms to reduce noisy pairs' weighting in the loss function. We show that with the proposed pre-training scheme, we can train smaller models, with smaller batch-size and much less computational resources to achieve downstream tasks performances comparable to State-Of-The-Art, for tasks including action recognition and text-image retrievals.

READ FULL TEXT
research
11/20/2020

Self-Supervised learning with cross-modal transformers for emotion recognition

Emotion recognition is a challenging task due to limited availability of...
research
04/03/2023

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a...
research
07/16/2022

Multi-Modal Unsupervised Pre-Training for Surgical Operating Room Workflow Analysis

Data-driven approaches to assist operating room (OR) workflow analysis d...
research
10/11/2021

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

In the genome biology research, regulatory genome modeling is an importa...
research
03/09/2020

Multi-modal Self-Supervision from Generalized Data Transformations

Self-supervised learning has advanced rapidly, with several results beat...
research
08/28/2023

MS-Net: A Multi-modal Self-supervised Network for Fine-Grained Classification of Aircraft in SAR Images

Synthetic aperture radar (SAR) imaging technology is commonly used to pr...
research
10/06/2020

Guiding Attention for Self-Supervised Learning with Transformers

In this paper, we propose a simple and effective technique to allow for ...

Please sign up or login with your details

Forgot password? Click here to reset