Grafting Pre-trained Models for Multimodal Headline Generation

11/14/2022
by   Lingfeng Qiao, et al.
7

Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks. However, none of them can be directly applied to multimodal headline architecture where we need both multimodal encoder and sentence decoder. A major challenge in simply gluing language model and video-language model is the modality balance, which is aimed at combining visual-language complementary abilities. In this paper, we propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model. We also present a consensus fusion mechanism for the integration of different components, via inter/intra modality relation. Empirically, experiments show that the grafted model achieves strong results on a brand-new dataset collected from real-world applications.

READ FULL TEXT

page 3

page 7

research
05/15/2022

TiBERT: Tibetan Pre-trained Language Model

The pre-trained language model is trained on large-scale unlabeled text ...
research
09/12/2021

TEASEL: A Transformer-Based Speech-Prefixed Language Model

Multimodal language analysis is a burgeoning field of NLP that aims to s...
research
09/06/2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos ...
research
11/30/2018

Deep Multimodal Learning: An Effective Method for Video Classification

Videos have become ubiquitous on the Internet. And video analysis can pr...
research
08/31/2023

Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from ...
research
04/12/2022

Mining Logical Event Schemas From Pre-Trained Language Models

We present NESL (the Neuro-Episodic Schema Learner), an event schema lea...
research
12/06/2022

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an effici...

Please sign up or login with your details

Forgot password? Click here to reset