Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

06/11/2019
by   Junchao Zhang, et al.
1

Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
08/08/2021

Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentenc...
research
08/19/2022

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...
research
08/13/2023

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

The application of video captioning models aims at translating the conte...
research
01/01/2019

Not All Words are Equal: Video-specific Information Loss for Video Captioning

An ideal description for a given video should fix its gaze on salient an...
research
07/25/2023

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

The main challenge in video question answering (VideoQA) is to capture a...
research
03/22/2023

Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...
research
12/02/2021

Relational Graph Learning for Grounded Video Description Generation

Grounded video description (GVD) encourages captioning models to attend ...

Please sign up or login with your details

Forgot password? Click here to reset