Unsupervised Learning of Video Representations using LSTMs

02/16/2015
by   Nitish Srivastava, et al.
0

We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences - patches of image pixels and high-level representations ("percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We try to visualize and interpret the learned features. We stress test the model by running it on longer time scales and on out-of-domain data. We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only a few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.

READ FULL TEXT

page 6

page 7

page 8

page 9

research
11/12/2015

Action Recognition using Visual Attention

We propose a soft attention based model for the task of action recogniti...
research
08/02/2018

RGB Video Based Tennis Action Recognition Using a Deep Weighted Long Short-Term Memory

Action recognition has attracted increasing attention from RGB input in ...
research
05/04/2020

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

In this paper, we tackle the problem of egocentric action anticipation, ...
research
05/06/2021

Unsupervised Visual Representation Learning by Tracking Patches in Video

Inspired by the fact that human eyes continue to develop tracking abilit...
research
12/12/2018

Effective Feature Learning with Unsupervised Learning for Improving the Predictive Models in Massive Open Online Courses

The effectiveness of learning in massive open online courses (MOOCs) can...
research
10/03/2018

Action Model Acquisition using LSTM

In the field of Automated Planning and Scheduling (APS), intelligent age...
research
07/16/2017

RED: Reinforced Encoder-Decoder Networks for Action Anticipation

Action anticipation aims to detect an action before it happens. Many rea...

Please sign up or login with your details

Forgot password? Click here to reset