Multi-attention Networks for Temporal Localization of Video-level Labels

11/15/2019
by   Moshe Y. Vardi, et al.
1

Temporal localization remains an important challenge in video understanding. In this work, we present our solution to the 3rd YouTube-8M Video Understanding Challenge organized by Google Research. Participants were required to build a segment-level classifier using a large-scale training data set with noisy video-level labels and a relatively small-scale validation data set with accurate segment-level labels. We formulated the problem as a multiple instance multi-label learning and developed an attention-based mechanism to selectively emphasize the important frames by attention weights. The model performance is further improved by constructing multiple sets of attention networks. We further fine-tuned the model using the segment-level data set. Our final model consists of an ensemble of attention/multi-attention networks, deep bag of frames models, recurrent neural networks and convolutional neural networks. It ranked 13th on the private leader board and stands out for its efficient usage of resources.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2017

The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge

This article describes the final solution of team monkeytyping, who fini...
research
12/02/2019

BERT for Large-scale Video Segment Classification with Test-time Augmentation

This paper presents our approach to the third YouTube-8M video understan...
research
09/12/2018

Label Denoising with Large Ensembles of Heterogeneous Neural Networks

Despite recent advances in computer vision based on various convolutiona...
research
07/13/2017

UTS submission to Google YouTube-8M Challenge 2017

In this paper, we present our solution to Google YouTube-8M Video Classi...
research
07/11/2017

Hierarchical Deep Recurrent Architecture for Video Understanding

This paper introduces the system we developed for the Youtube-8M Video U...
research
03/18/2021

Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

The introduction of Transformer model has led to tremendous advancements...
research
07/05/2017

Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification

We report on CMU Informedia Lab's system used in Google's YouTube 8 Mill...

Please sign up or login with your details

Forgot password? Click here to reset