Learnable Sampling 3D Convolution for Video Enhancement and Action Recognition

11/22/2020
by   Shuyang Gu, et al.
1

A key challenge in video enhancement and action recognition is to fuse useful information from neighboring frames. Recent works suggest establishing accurate correspondences between neighboring frames before fusing temporal information. However, the generated results heavily depend on the quality of correspondence estimation. In this paper, we propose a more robust solution: sampling and fusing multi-level features across neighborhood frames to generate the results. Based on this idea, we introduce a new module to improve the capability of 3D convolution, namely, learnable sampling 3D convolution (LS3D-Conv). We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames. The offsets can be learned for specific tasks. The LS3D-Conv can flexibly replace 3D convolution layers in existing 3D networks and get new architectures, which learns the sampling at multiple feature levels. The experiments on video interpolation, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.

READ FULL TEXT

page 2

page 5

page 6

research
02/10/2021

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Temporal modelling is the key for efficient video action recognition. Wh...
research
03/13/2019

Two-Stream Oriented Video Super-Resolution for Action Recognition

We study the video super-resolution (SR) problem not for visual quality,...
research
03/23/2021

Learning Comprehensive Motion Representation for Action Recognition

For action recognition learning, 2D CNN-based methods are efficient but ...
research
03/04/2020

VESR-Net: The Winning Solution to Youku Video Enhancement and Super-Resolution Challenge

This paper introduces VESR-Net, a method for video enhancement and super...
research
04/20/2020

Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution

The intensity estimation of facial action units (AUs) is challenging due...
research
09/11/2018

Parallel Separable 3D Convolution for Video and Volumetric Data Understanding

For video and volumetric data understanding, 3D convolution layers are w...
research
10/24/2019

Controllable Attention for Structured Layered Video Decomposition

The objective of this paper is to be able to separate a video into its n...

Please sign up or login with your details

Forgot password? Click here to reset