Attention Transfer from Web Images for Video Recognition

08/03/2017
by   Junnan Li, et al.
0

Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

READ FULL TEXT

page 1

page 2

page 8

research
12/22/2015

Do Less and Achieve More: Training CNNs for Action Recognition Utilizing Action Images from the Web

Recently, attempts have been made to collect millions of videos to train...
research
09/15/2014

Transfer Learning for Video Recognition with Scarce Training Data for Deep Convolutional Neural Network

Unconstrained video recognition and Deep Convolution Network (DCN) are t...
research
09/16/2020

Red Carpet to Fight Club: Partially-supervised Domain Transfer for Face Recognition in Violent Videos

In many real-world problems, there is typically a large discrepancy betw...
research
11/25/2019

Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

Existing deep learning methods for action recognition in videos require ...
research
01/26/2019

DistInit: Learning Video Representations without a Single Labeled Video

Video recognition models have progressed significantly over the past few...
research
05/04/2017

Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Video

Watching a 360 sports video requires a viewer to continuously select a v...
research
06/07/2016

Semi-Supervised Domain Adaptation for Weakly Labeled Semantic Video Object Segmentation

Deep convolutional neural networks (CNNs) have been immensely successful...

Please sign up or login with your details

Forgot password? Click here to reset