ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

08/01/2023
by   Bo Zhang, et al.
0

Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.

READ FULL TEXT
research
07/02/2019

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is c...
research
05/22/2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Large-scale image-text contrastive pre-training models, such as CLIP, ha...
research
10/12/2022

Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features

Hateful memes are a growing menace on social media. While the image and ...
research
05/14/2021

Joint Retrieval and Generation Training for Grounded Text Generation

Recent advances in large-scale pre-training such as GPT-3 allow seemingl...
research
05/18/2023

Causal Document-Grounded Dialogue Pre-training

The goal of document-grounded dialogue (DocGD) is to generate a response...
research
10/16/2021

Multimodal Dialogue Response Generation

Responsing with image has been recognized as an important capability for...
research
07/07/2020

DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue

Visual Dialogue task requires an agent to be engaged in a conversation w...

Please sign up or login with your details

Forgot password? Click here to reset