CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network

04/07/2017
by   Yuxin Peng, et al.
0

Cross-modal retrieval has become a highlighted research topic for retrieval across multimedia data such as image and text. A two-stage learning framework is widely adopted by most existing methods based on Deep Neural Network (DNN): The first learning stage is to generate separate representation for each modality, and the second learning stage is to get the cross-modal common representation. However, the existing methods have three limitations: (1) In the first learning stage, they only model intra-modality correlation, but ignore inter-modality correlation with rich complementary context. (2) In the second learning stage, they only adopt shallow networks with single-loss regularization, but ignore the intrinsic relevance of intra-modality and inter-modality correlation. (3) Only original instances are considered while the complementary fine-grained clues provided by their patches are ignored. For addressing the above problems, this paper proposes a cross-modal correlation learning (CCL) approach with multi-grained fusion by hierarchical network, and the contributions are as follows: (1) In the first learning stage, CCL exploits multi-level association with joint optimization to preserve the complementary context from intra-modality and inter-modality correlation simultaneously. (2) In the second learning stage, a multi-task learning strategy is designed to adaptively balance the intra-modality semantic category constraints and inter-modality pairwise similarity constraints. (3) CCL adopts multi-grained modeling, which fuses the coarse-grained instances and fine-grained patches to make cross-modal correlation more precise. Comparing with 13 state-of-the-art methods on 6 widely-used cross-modal datasets, the experimental results show our CCL approach achieves the best performance.

READ FULL TEXT

page 1

page 5

page 6

page 12

page 15

research
08/16/2017

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network

Nowadays, cross-modal retrieval plays an indispensable role to flexibly ...
research
03/21/2017

Cross-modal Deep Metric Learning with Multi-task Regularization

DNN-based cross-modal retrieval has become a research hotspot, by which ...
research
10/19/2022

CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...
research
06/28/2023

Knowledge-Enhanced Hierarchical Information Correlation Learning for Multi-Modal Rumor Detection

The explosive growth of rumors with text and images on social media plat...
research
12/16/2021

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Multimodal summarization with multimodal output (MSMO) generates a summa...
research
05/25/2022

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Moment retrieval in videos is a challenging task that aims to retrieve t...
research
09/30/2019

Cross-Modal Subspace Learning with Scheduled Adaptive Margin Constraints

Cross-modal embeddings, between textual and visual modalities, aim to or...

Please sign up or login with your details

Forgot password? Click here to reset