CLIP4Caption: CLIP for Video Caption

10/13/2021
by   Mingkang Tang, et al.
0

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10 CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.

READ FULL TEXT
research
05/04/2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

We propose a new two-stage pre-training framework for video-to-text gene...
research
07/08/2018

Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

The explosion of video data on the internet requires effective and effic...
research
11/19/2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...
research
07/05/2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

In this work, we present Auto-captions on GIF, which is a new large-scal...
research
07/27/2020

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

Automatic video captioning aims to train models to generate text descrip...
research
08/17/2016

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

We present our submission to the Microsoft Video to Language Challenge o...
research
05/10/2019

Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder frame...

Please sign up or login with your details

Forgot password? Click here to reset