TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

06/03/2021
by   Pengda Qin, et al.
0

Among ubiquitous multimodal data in the real world, text is the modality generated by human, while image reflects the physical world honestly. In a visual understanding application, machines are expected to understand images like human. Inspired by this, we propose a novel self-supervised learning method, named Text-enhanced Visual Deep InfoMax (TVDIM), to learn better visual representations by fully utilizing the naturally-existing multimodal data. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views of a shared context to a rational degree. Different from previous methods which only consider multiple views from a single modality, our work produces multiple views from different modalities, and jointly optimizes the mutual information for features pairs of intra-modality and inter-modality. Considering the information gap between inter-modality features pairs from data noise, we adopt a ranking-based contrastive learning to optimize the mutual information. During evaluation, we directly use the pre-trained visual representations to complete various image classification tasks. Experimental results show that, TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2019

Learning Representations by Maximizing Mutual Information Across Views

We propose an approach to self-supervised representation learning based ...
research
08/07/2023

Feature-Suppressed Contrast for Self-Supervised Food Pre-training

Most previous approaches for analyzing food images have relied on extens...
research
07/12/2023

Unified Molecular Modeling via Modality Blending

Self-supervised molecular representation learning is critical for molecu...
research
07/01/2022

Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

Existing text recognition methods usually need large-scale training data...
research
11/29/2021

Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations

A fundamental goal in neuroscience is to understand the relationship bet...
research
04/19/2022

Rumor Detection with Self-supervised Learning on Texts and Social Graph

Rumor detection has become an emerging and active research field in rece...
research
07/06/2021

Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning

This work focuses on learning useful and robust deep world models using ...

Please sign up or login with your details

Forgot password? Click here to reset