DeepAI AI Chat
Log In Sign Up

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

by   Ruize Xu, et al.

Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise L_2 normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.


CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...

Revisiting Pre-training in Audio-Visual Learning

Pre-training technique has gained tremendous success in enhancing model ...

MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion

Background: Code summarization automatically generates the corresponding...

GMN: Generative Multi-modal Network for Practical Document Information Extraction

Document Information Extraction (DIE) has attracted increasing attention...

Long-tail Cross Modal Hashing

Existing Cross Modal Hashing (CMH) methods are mainly designed for balan...

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

Despite the recent advances in opinion mining for written reviews, few w...

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Deploying deep learning models in time-critical applications with limite...