CLIP4Caption ++: Multi-CLIP for Video Caption

10/11/2021
by   Mingkang Tang, et al.
0

This report describes our solution to the VALUE Challenge 2021 in the captioning task. Our solution, named CLIP4Caption++, is built on X-Linear/X-Transformer, which is an advanced model with encoder-decoder architecture. We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features. 2) we adopt the TSN sampling strategy for data enhancement. 3) we involve the video subtitle information to provide richer semantic information. 3) we introduce the subtitle information, which fuses with the visual features as guidance. 4) we design word-level and sentence-level ensemble strategies. Our proposed method achieves 86.5, 148.4, 64.5 CIDEr scores on VATEX, YC2C, and TVC datasets, respectively, which shows the superior performance of our proposed CLIP4Caption++ on all three datasets.

READ FULL TEXT
research
12/20/2020

Guidance Module Network for Video Captioning

Video captioning has been a challenging and significant task that descri...
research
03/30/2018

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequ...
research
07/08/2023

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Egocentric action anticipation is a challenging task that aims to make a...
research
12/15/2022

Enhancing Indic Handwritten Text Recognition Using Global Semantic Information

Handwritten Text Recognition (HTR) is more interesting and challenging t...
research
01/16/2020

Delving Deeper into the Decoder for Video Captioning

Video captioning is an advanced multi-modal task which aims to describe ...
research
11/21/2022

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for variou...
research
11/21/2019

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decode...

Please sign up or login with your details

Forgot password? Click here to reset