Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

08/04/2021
by   Chiori Hori, et al.
0

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Trans-former is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94 pre-trained Transformer using the entire video clips, using only 28 from the beginning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2022

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...
research
07/07/2022

Dual-Stream Transformer for Generic Event Boundary Captioning

This paper describes our champion solution for the CVPR2022 Generic Even...
research
10/14/2021

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Automated audio captioning (AAC) is the task of automatically generating...
research
11/05/2022

Semantic Metadata Extraction from Dense Video Captioning

Annotation of multimedia data by humans is time-consuming and costly, wh...
research
09/25/2022

Paraphrasing Is All You Need for Novel Object Captioning

Novel object captioning (NOC) aims to describe images containing objects...
research
05/18/2022

It Isn't Sh!tposting, It's My CAT Posting

In this paper, we describe a novel architecture which can generate hilar...
research
10/07/2019

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...

Please sign up or login with your details

Forgot password? Click here to reset