Adaptive Feature Abstraction for Translating Video to Text

11/23/2016
by   Yunchen Pu, et al.
0

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable context-dependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach for generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, that adaptively and sequentially focuses on different layers of CNN features (levels of feature "abstraction"), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

READ FULL TEXT

page 9

page 13

page 14

page 15

research
11/17/2016

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Visual attention has been successfully applied in structural prediction ...
research
01/16/2019

UAN: Unified Attention Network for Convolutional Neural Networks

We propose a new architecture that learns to attend to different Convolu...
research
05/18/2018

Unsupervised Learning of Neural Networks to Explain Neural Networks

This paper presents an unsupervised method to learn a neural network, na...
research
01/02/2021

Video Captioning in Compressed Video

Existing approaches in video captioning concentrate on exploring global ...
research
10/23/2019

Deja-vu: Double Feature Presentation in Deep Transformer Networks

Deep acoustic models typically receive features in the first layer of th...
research
11/22/2022

BASM: A Bottom-up Adaptive Spatiotemporal Model for Online Food Ordering Service

Online Food Ordering Service (OFOS) is a popular location-based service ...
research
06/26/2021

Interflow: Aggregating Multi-layer Feature Mappings with Attention Mechanism

Traditionally, CNN models possess hierarchical structures and utilize th...

Please sign up or login with your details

Forgot password? Click here to reset