Multi-Modal Mixup for Robust Fine-tuning

03/08/2022
by   Junhyuk So, et al.
0

Pre-trained large-scale models provide a transferable embedding, and they show comparable performance on the diverse downstream task. However, the transferability of multi-modal learning is restricted, and the analysis of learned embedding has not been explored well. This paper provides a perspective to understand the multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has a two separated representation space for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between two modalities with less uniformity. Less robust embedding might restrict the transferability of the representation for the downstream task. This paper provides a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a multi-modal Mixup, m^2-Mix that mixes the representation of image and text to generate the hard negative samples. Second, we fine-tune the multi-modal model on a hard negative sample as well as normal negative and positive samples with contrastive learning. Our multi-modal Mixup provides a robust representation, and we validate our methods on classification, retrieval, and structure-awareness task.

READ FULL TEXT
research
09/04/2023

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

As the size of Large Multi-Modal Models (LMMs) increases consistently, t...
research
03/20/2023

Visual Prompt Multi-Modal Tracking

Visible-modal object tracking gives rise to a series of downstream multi...
research
06/03/2020

Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm

While many approaches exist in the literature to learn representations f...
research
07/03/2023

Visual Instruction Tuning with Polite Flamingo

Recent research has demonstrated that the multi-task fine-tuning of mult...
research
04/06/2023

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that...
research
07/26/2023

Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models

The rapid growth and increasing popularity of incorporating additional m...
research
05/06/2021

Learning Neighborhood Representation from Multi-Modal Multi-Graph: Image, Text, Mobility Graph and Beyond

Recent urbanization has coincided with the enrichment of geotagged data,...

Please sign up or login with your details

Forgot password? Click here to reset