Representation Separation for Semantic Segmentation with Vision Transformers

12/28/2022
by   Yuanduo Hong, et al.
0

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs. It is targeted for the peculiar over-smoothness of ViTs in semantic segmentation, and therefore differs from current popular paradigms of context modeling and most existing related methods reinforcing the advantage of attention. We first deliver the decoupled two-pathway network in which another pathway enhances and passes down local-patch discrepancy complementary to global representations of transformers. We then propose the spatially adaptive separation module to obtain more separate deep representations and the discriminative cross-attention which yields more discriminative region representations through novel auxiliary supervisions. The proposed methods achieve some impressive results: 1) incorporated with large-scale plain ViTs, our methods achieve new state-of-the-art performances on five widely used benchmarks; 2) using masked pre-trained plain ViTs, we achieve 68.9 record; 3) pyramid ViTs integrated with the decoupled two-pathway network even surpass the well-designed high-resolution ViTs on Cityscapes; 4) the improved representations by our framework have favorable transferability in images with natural corruptions. The codes will be released publicly.

READ FULL TEXT

page 1

page 5

page 6

page 10

page 14

research
12/15/2022

Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation

Transformers have proved to be very effective for visual recognition tas...
research
05/31/2021

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

We present SegFormer, a simple, efficient yet powerful semantic segmenta...
research
06/08/2021

Fully Transformer Networks for Semantic Image Segmentation

Transformers have shown impressive performance in various natural langua...
research
03/19/2023

MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation

The initial seed based on the convolutional neural network (CNN) for wea...
research
05/19/2022

Masked Image Modeling with Denoising Contrast

Since the development of self-supervised visual representation learning ...
research
08/12/2022

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Masked image modeling (MIM) has demonstrated impressive results in self-...
research
10/21/2022

High-Fidelity Visual Structural Inspections through Transformers and Learnable Resizers

Visual inspection is the predominant technique for evaluating the condit...

Please sign up or login with your details

Forgot password? Click here to reset