ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

11/20/2020
by   Zhouyong Liu, et al.
13

Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks. Although CNNS perform well whenever large labeled training samples are available, they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence. In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis. The core ingredient of ConvTransformer is the proposed attention layer, i.e., multi-head convolutional self-attention, that learns the sequential dependence of video sequence. Our method ConvTransformer uses an encoder, built upon multi-head convolutional self-attention layers, to map the input sequence to a feature map sequence, and then another deep networks, incorporating multi-head convolutional self-attention layers, decode the target synthesized frames from the feature maps sequence. Experiments on video future frame extrapolation task show ConvTransformer to be superior in quality while being more parallelizable to recent approaches built upon convoltuional LSTM (ConvLSTM). To the best of our knowledge, this is the first time that ConvTransformer architecture is proposed and applied to video frame synthesis.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

page 7

page 8

page 9

research
07/27/2022

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Impressive performance of Transformer has been attributed to self-attent...
research
10/26/2021

Leveraging Local Temporal Information for Multimodal Scene Classification

Robust video scene classification models should capture the spatial (pix...
research
11/14/2021

Local Multi-Head Channel Self-Attention for Facial Expression Recognition

Since the Transformer architecture was introduced in 2017 there has been...
research
11/08/2019

On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led r...
research
11/14/2021

Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks

Computer vision tasks can benefit from the estimation of the salient obj...
research
10/16/2022

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attribu...
research
03/08/2022

DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos

Convolutional neural network inference on video data requires powerful h...

Please sign up or login with your details

Forgot password? Click here to reset