Multimodal Conditionality for Natural Language Generation

09/02/2021
by   Michael Sollami, et al.
38

Large scale pretrained language models have demonstrated state-of-the-art performance in language understanding tasks. Their application has recently expanded into multimodality learning, leading to improved representations combining vision and language. However, progress in adapting language models towards conditional Natural Language Generation (NLG) has been limited to a single modality, generally text. We propose MAnTiS, Multimodal Adaptation for Text Synthesis, a general approach for multimodal conditionality in transformer-based NLG models. In this method, we pass inputs from each modality through modality-specific encoders, project to textual token space, and finally join to form a conditionality prefix. We fine-tune the pretrained language model and encoders with the conditionality prefix guiding the generation. We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text. We demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS can generate human quality descriptions consistent with given multimodal inputs.

READ FULL TEXT
research
02/16/2022

XFBoost: Improving Text Generation with Controllable Decoders

Multimodal conditionality in transformer-based natural language models h...
research
01/31/2023

Grounding Language Models to Images for Multimodal Generation

We propose an efficient method to ground pretrained text-only language m...
research
10/31/2020

Personalized Multimodal Feedback Generation in Education

The automatic evaluation for school assignments is an important applicat...
research
08/19/2019

Encoder-Agnostic Adaptation for Conditional Language Generation

Large pretrained language models have changed the way researchers approa...
research
03/04/2022

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Most methods for conditional video synthesis use a single modality as th...
research
06/21/2023

Mass-Producing Failures of Multimodal Systems with Language Models

Deployed multimodal systems can fail in ways that evaluators did not ant...
research
03/20/2023

Multimodal Shannon Game with Images

The Shannon game has long been used as a thought experiment in linguisti...

Please sign up or login with your details

Forgot password? Click here to reset