Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering

09/09/2023
by   Yifan Dong, et al.
0

Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair. In this regard, dominant methods mainly focus on multi-modal fusion for keyphrase generation. Nevertheless, there are still two main drawbacks: 1) only a limited number of sources, such as image captions, can be utilized to provide auxiliary information. However, they may not be sufficient for the subsequent keyphrase generation. 2) the input text and image are often not perfectly matched, and thus the image may introduce noise into the model. To address these limitations, in this paper, we propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise. First, we introduce external visual entities of the image as the supplementary input to the model, which benefits the cross-modal semantic alignment for keyphrase generation. Second, we simultaneously calculate an image-text matching score and image region-text correlation scores to perform multi-granularity image noise filtering. Particularly, we introduce the correlation scores between image regions and ground-truth keyphrases to refine the calculation of the previously-mentioned correlation scores. To demonstrate the effectiveness of our model, we conduct several groups of experiments on the benchmark dataset. Experimental results and in-depth analyses show that our model achieves the state-of-the-art performance. Our code is available on https://github.com/DeepLearnXMU/MM-MKP.

READ FULL TEXT

page 4

page 8

research
08/12/2019

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

A major challenge in matching images and text is that they have intrinsi...
research
06/28/2023

Knowledge-Enhanced Hierarchical Information Correlation Learning for Multi-Modal Rumor Detection

The explosive growth of rumors with text and images on social media plat...
research
07/10/2023

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

The field of text-conditioned image generation has made unparalleled pro...
research
08/19/2022

Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph

Multi-modal aspect-based sentiment classification (MABSC) is an emerging...
research
08/23/2023

Understanding Dark Scenes by Contrasting Multi-Modal Observations

Understanding dark scenes based on multi-modal image data is challenging...
research
04/16/2021

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Recent advances in using retrieval components over external knowledge so...
research
10/07/2022

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm is a linguistic phenomenon indicating a discrepancy between lite...

Please sign up or login with your details

Forgot password? Click here to reset