DeepAI AI Chat
Log In Sign Up

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

by   Okan Köpüklü, et al.

Spatiotemporal action localization requires incorporation of two sources of information into the designed architecture: (1) Temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video stream. YOWO makes use of a single neural network to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 (71.1 UCF101-24 (75.0


page 1

page 3

page 8


You Only Look Once: Unified, Real-Time Object Detection

We present YOLO, a new approach to object detection. Prior work on objec...

KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal Action Localization

Real-time and online action localization in a video is a critical yet hi...

Action Tubelet Detector for Spatio-Temporal Action Localization

Current state-of-the-art approaches for spatio-temporal action localizat...

Learning from Videos with Deep Convolutional LSTM Networks

This paper explores the use of convolution LSTMs to simultaneously learn...

Exploring Frame Segmentation Networks for Temporal Action Localization

Temporal action localization is an important task of computer vision. Th...

Cascading Convolutional Temporal Colour Constancy

Computational Colour Constancy (CCC) consists of estimating the colour o...

GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction

Many online action prediction models observe complete frames to locate a...

Code Repositories


This is the source code of the ITSS project. The title of the project is Automatic Detection of Tennis Strokes using Spatio-Temporal Localization.

view repo