Contrastive Multi-Modal Clustering

by   Jie Xu, et al.

Multi-modal clustering, which explores complementary information from multiple modalities or views, has attracted people's increasing attentions. However, existing works rarely focus on extracting high-level semantic information of multiple modalities for clustering. In this paper, we propose Contrastive Multi-Modal Clustering (CMMC) which can mine high-level semantic information via contrastive learning. Concretely, our framework consists of three parts. (1) Multiple autoencoders are optimized to maintain each modality's diversity to learn complementary information. (2) A feature contrastive module is proposed to learn common high-level semantic features from different modalities. (3) A label contrastive module aims to learn consistent cluster assignments for all modalities. By the proposed multi-modal contrastive learning, the mutual information of high-level features is maximized, while the diversity of the low-level latent features is maintained. In addition, to utilize the learned high-level semantic features, we further generate pseudo labels by solving a maximum matching problem to fine-tune the cluster assignments. Extensive experiments demonstrate that CMMC has good scalability and outperforms state-of-the-art multi-modal clustering methods.


page 1

page 2

page 3

page 4


Multi-modal Contrastive Representation Learning for Entity Alignment

Multi-modal entity alignment aims to identify equivalent entities betwee...

Multi-Modal Sarcasm Detection Based on Contrastive Attention Mechanism

In the past decade, sarcasm detection has been intensively conducted in ...

Contrastive Learning for Multi-Modal Automatic Code Review

Automatic code review (ACR), aiming to relieve manual inspection costs, ...

Rethinking movie genre classification with fine-grained semantic clustering

Movie genre classification is an active research area in machine learnin...

Noise-Tolerant Learning for Audio-Visual Action Recognition

Recently, video recognition is emerging with the help of multi-modal lea...

CRIS: CLIP-Driven Referring Image Segmentation

Referring image segmentation aims to segment a referent via a natural li...

Support-Set Based Cross-Supervision for Video Grounding

Current approaches for video grounding propose kinds of complex architec...