MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation

04/12/2023
by   Rezaul Karim, et al.
8

Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarseto-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions. We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow.

READ FULL TEXT

page 3

page 8

page 14

page 15

page 17

research
01/25/2023

Flow-guided Semi-supervised Video Object Segmentation

We propose an optical flow-guided approach for semi-supervised video obj...
research
07/15/2023

Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

Few-shot video segmentation is the task of delineating a specific novel ...
research
11/29/2022

HashEncoding: Autoencoding with Multiscale Coordinate Hashing

We present HashEncoding, a novel autoencoding architecture that leverage...
research
03/13/2023

Transformer Encoder with Multiscale Deep Learning for Pain Classification Using Physiological Signals

Pain is a serious worldwide health problem that affects a vast proportio...
research
06/08/2023

FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow

This paper introduces a novel transformer-based network architecture, Fl...
research
05/19/2023

Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Egocentric temporal action segmentation in videos is a crucial task in c...
research
10/17/2021

Temporally stable video segmentation without video annotations

Temporally consistent dense video annotations are scarce and hard to col...

Please sign up or login with your details

Forgot password? Click here to reset