Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

07/19/2022
by   Li Zhang, et al.
10

Visual representation learning is the key of solving various vision problems. Relying on the seminal grid structure priors, convolutional neural networks (CNNs) have been the de facto standard architectures of most deep vision models. For instance, classical semantic segmentation methods often adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated (i.e., atrous) convolutions or inserting attention modules. However, the FCN-based architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating visual representation learning generally as a sequence-to-sequence prediction task. Specifically, we deploy a pure Transformer to encode an image as a sequence of patches, without local convolution and resolution reduction. With the global context modeled in every layer of the Transformer, stronger visual representation can be learned for better tackling vision tasks. In particular, our segmentation model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28 the day of submission), Pascal Context (55.83 results on Cityscapes. Further, we formulate a family of Hierarchical Local-Global (HLG) Transformers characterized by local attention within windows and global-attention across windows in a hierarchical and pyramidal architecture. Extensive experiments show that our method achieves appealing performance on a variety of visual recognition tasks (e.g., image classification, object detection and instance segmentation and semantic segmentation).

READ FULL TEXT

page 3

page 7

page 8

page 9

page 12

page 13

research
12/31/2020

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Most recent semantic segmentation methods adopt a fully-convolutional ne...
research
07/11/2023

A Hierarchical Transformer Encoder to Improve Entire Neoplasm Segmentation on Whole Slide Image of Hepatocellular Carcinoma

In digital histopathology, entire neoplasm segmentation on Whole Slide I...
research
11/25/2021

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Recently, Vision Transformers (ViT), with the self-attention (SA) as the...
research
09/18/2021

Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation

Semantic segmentation of fine-resolution urban scene images plays a vita...
research
04/25/2021

Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images

The fully-convolutional network (FCN) with an encoder-decoder architectu...
research
12/02/2018

Iterative Reorganization with Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning

Learning visual features from unlabeled image data is an important yet c...
research
04/04/2022

BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

Attention mechanisms have been very popular in deep neural networks, whe...

Please sign up or login with your details

Forgot password? Click here to reset