Informative Visual Storytelling with Cross-modal Rules

07/07/2019
by   Jiacheng Li, et al.
0

Existing methods in the Visual Storytelling field often suffer from the problem of generating general descriptions, while the image contains a lot of meaningful contents remaining unnoticed. The failure of informative story generation can be concluded to the model's incompetence of capturing enough meaningful concepts. The categories of these concepts include entities, attributes, actions, and events, which are in some cases crucial to grounded storytelling. To solve this problem, we propose a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input. We first build the multimodal transactions by concatenating the CNN activations and the word indices. Then we use the association rule mining algorithm to mine the cross-modal rules, which will be used for the concept inference. With the help of the cross-modal rules, the generated stories are more grounded and informative. Besides, our proposed method holds the advantages of interpretation, expandability, and transferability, indicating potential for wider application. Finally, we leverage these concepts in our encoder-decoder framework with the attention mechanism. We conduct several experiments on the VIsual StoryTelling (VIST) dataset, the results of which demonstrate the effectiveness of our approach in terms of both automatic metrics and human evaluation. Additional experiments are also conducted showing that our mined cross-modal rules as additional knowledge helps the model gain better performance when trained on a small dataset.

READ FULL TEXT

page 1

page 4

page 6

page 8

research
08/14/2021

Cross-Modal Graph with Meta Concepts for Video Captioning

Video captioning targets interpreting the complex visual contents as tex...
research
10/07/2020

Universal Weighting Metric Learning for Cross-Modal Matching

Cross-modal matching has been a highlighted research topic in both visio...
research
04/15/2022

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Visual Dialog is a challenging vision-language task since the visual dia...
research
03/10/2022

Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

As a technically challenging topic, visual storytelling aims at generati...
research
01/26/2023

Multimodal Event Transformer for Image-guided Story Ending Generation

Image-guided story ending generation (IgSEG) is to generate a story endi...
research
04/12/2018

Cross-Modal Retrieval with Implicit Concept Association

Traditional cross-modal retrieval assumes explicit association of concep...
research
09/01/2020

Practical Cross-modal Manifold Alignment for Grounded Language

We propose a cross-modality manifold alignment procedure that leverages ...

Please sign up or login with your details

Forgot password? Click here to reset