CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

11/30/2021
by   Bang Yang, et al.
0

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation. Comparing INP with the recently proposed CLIP (Contrastive Language-Image Pre-training), this paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions. Specifically, our empirical study on INP vs. CLIP shows that INP makes video caption models tricky to capture attributes' semantics and sensitive to irrelevant background information. By contrast, CLIP's significant boost in caption quality highlights the importance of attribute-aware representation learning. We are thus motivated to introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes. Extensive experiments on benchmark datasets demonstrate that our approach enables better learning of attribute-aware representations, bringing consistent improvements on models with different architectures and decoding algorithms.

READ FULL TEXT

page 4

page 7

page 8

research
05/15/2023

PLIP: Language-Image Pre-training for Person Representation Learning

Pre-training has emerged as an effective technique for learning powerful...
research
02/20/2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Although large-scale video-language pre-training models, which usually b...
research
04/04/2023

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective i...
research
04/21/2016

Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data

The way people look in terms of facial attributes (ethnicity, hair color...
research
04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...
research
08/06/2020

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important n...
research
11/28/2014

Deep Learning Face Attributes in the Wild

Predicting face attributes in the wild is challenging due to complex fac...

Please sign up or login with your details

Forgot password? Click here to reset