MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

03/09/2023
by   Ruize Xu, et al.
0

Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise L_2 normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.

READ FULL TEXT
research
09/12/2023

Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation

One primary topic of multi-modal learning is to jointly incorporate hete...
research
10/19/2022

CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...
research
09/19/2022

MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion

Background: Code summarization automatically generates the corresponding...
research
04/03/2023

Multi-modal Fake News Detection on Social Media via Multi-grained Information Fusion

The easy sharing of multimedia content on social media has caused a rapi...
research
07/11/2022

GMN: Generative Multi-modal Network for Practical Document Information Extraction

Document Information Extraction (DIE) has attracted increasing attention...
research
11/28/2022

Long-tail Cross Modal Hashing

Existing Cross Modal Hashing (CMH) methods are mainly designed for balan...
research
06/15/2021

The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing

Modality is the linguistic ability to describe events with added informa...

Please sign up or login with your details

Forgot password? Click here to reset