AnANet: Modeling Association and Alignment for Cross-modal Correlation Classification

by   Nan Xu, et al.

The explosive increase of multimodal data makes a great demand in many cross-modal applications that follow the strict prior related assumption. Thus researchers study the definition of cross-modal correlation category and construct various classification systems and predictive models. However, those systems pay more attention to the fine-grained relevant types of cross-modal correlation, ignoring lots of implicit relevant data which are often divided into irrelevant types. What's worse is that none of previous predictive models manifest the essence of cross-modal correlation according to their definition at the modeling stage. In this paper, we present a comprehensive analysis of the image-text correlation and redefine a new classification system based on implicit association and explicit alignment. To predict the type of image-text correlation, we propose the Association and Alignment Network according to our proposed definition (namely AnANet) which implicitly represents the global discrepancy and commonality between image and text and explicitly captures the cross-modal local relevance. The experimental results on our constructed new image-text correlation dataset show the effectiveness of our model.


page 2

page 5

page 9


Improving Cross-modal Alignment for Text-Guided Image Inpainting

Text-guided image inpainting (TGII) aims to restore missing regions base...

"Is this an example image?" -- Predicting the Relative Abstractness Level of Image and Text

Successful multimodal search and retrieval requires the automatic unders...

Category-Based Deep CCA for Fine-Grained Venue Discovery from Multimodal Data

In this work, travel destination and business location are taken as venu...

Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph

Multi-modal aspect-based sentiment classification (MABSC) is an emerging...

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation

Radiology report generation (RRG) has gained increasing research attenti...

Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!

Modeling expressive cross-modal interactions seems crucial in multimodal...

TOT: Topology-Aware Optimal Transport For Multimodal Hate Detection

Multimodal hate detection, which aims to identify harmful content online...

Please sign up or login with your details

Forgot password? Click here to reset