Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

08/13/2023
by   Yutao Jin, et al.
0

The application of video captioning models aims at translating the content of videos by using accurate natural language. Due to the complex nature inbetween object interaction in the video, the comprehensive understanding of spatio-temporal relations of objects remains a challenging task. Existing methods often fail in generating sufficient feature representations of video content. In this paper, we propose a video captioning model based on dual graphs and gated fusion: we adapt two types of graphs to generate feature representations of video content and utilize gated fusion to further understand these different levels of information. Using a dual-graphs model to generate appearance features and motion features respectively can utilize the content correlation in frames to generate various features from multiple perspectives. Among them, dual-graphs reasoning can enhance the content correlation in frame sequences to generate advanced semantic features; The gated fusion, on the other hand, aggregates the information in multiple feature representations for comprehensive video content understanding. The experiments conducted on worldly used datasets MSVD and MSR-VTT demonstrate state-of-the-art performance of our proposed approach.

READ FULL TEXT

page 8

page 10

research
08/27/2019

Controllable Video Captioning with POS Sequence Guidance Based on Gated Fusion Network

In this paper, we propose to guide the video caption generation with Par...
research
06/11/2019

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descrip...
research
01/14/2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Video captioning is a popular task that challenges models to describe ev...
research
08/31/2019

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Given the features of a video, recurrent neural network can be used to a...
research
02/15/2021

A Gated Fusion Network for Dynamic Saliency Prediction

Predicting saliency in videos is a challenging problem due to complex mo...
research
01/27/2016

Comprehensive Feature-based Robust Video Fingerprinting Using Tensor Model

Content-based near-duplicate video detection (NDVD) is essential for eff...
research
09/16/2020

Dual Semantic Fusion Network for Video Object Detection

Video object detection is a tough task due to the deteriorated quality o...

Please sign up or login with your details

Forgot password? Click here to reset