MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

09/19/2018
by   Oliver Nina, et al.
18

Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and movie description rely on simple encoding mechanisms through recurrent neural networks to encode temporal visual information extracted from video data. In this paper, we introduce a novel multitask encoder-decoder framework for automatic semantic description and captioning of video sequences. In contrast to current approaches, our method relies on distinct decoders that train a visual encoder in a multitask fashion. Our system does not depend solely on multiple labels and allows for a lack of training data working even with datasets where only one single annotation is viable per video. Our method shows improved performance over current state of the art methods in several metrics on multi-caption and single-caption datasets. To the best of our knowledge, our method is the first method to use a multitask approach for encoding video features. Our method demonstrates its robustness on the Large Scale Movie Description Challenge (LSMDC) 2017 where our method won the movie description task and its results were ranked among other competitors as the most helpful for the visually impaired.

READ FULL TEXT

page 1

page 3

page 5

page 9

research
11/28/2016

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

The use of Recurrent Neural Networks for video captioning has recently g...
research
03/04/2019

M-VAD Names: a Dataset for Video Captioning with Naming

Current movie captioning architectures are not capable of mentioning cha...
research
04/04/2019

An End-to-End Baseline for Video Captioning

Building correspondences across different modalities, such as video and ...
research
03/21/2018

End-to-End Video Captioning with Multitask Reinforcement Learning

Although end-to-end (E2E) learning has led to promising performance on a...
research
03/29/2023

AutoAD: Movie Description in Context

The objective of this paper is an automatic Audio Description (AD) model...
research
12/09/2015

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

In this paper, we describe the system for generating textual description...
research
05/10/2019

Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder frame...

Please sign up or login with your details

Forgot password? Click here to reset