DeepAI AI Chat
Log In Sign Up

TFill: Image Completion via a Transformer-Based Architecture

by   Chuanxia Zheng, et al.

Bridging distant context interactions is important for high quality image completion with large masks. Previous methods attempting this via deep or large receptive field (RF) convolutions cannot escape from the dominance of nearby interactions, which may be inferior. In this paper, we propose treating image completion as a directionless sequence-to-sequence prediction task, and deploy a transformer to directly capture long-range dependence in the encoder in a first phase. Crucially, we employ a restrictive CNN with small and non-overlapping RF for token representation, which allows the transformer to explicitly model the long-range context relations with equal importance in all layers, without implicitly confounding neighboring tokens when larger RFs are used. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced to better exploit distantly related features and also avoid the insular effect of standard attention. Overall, extensive experiments demonstrate superior performance compared to state-of-the-art methods on several datasets.


page 12

page 13

page 14

page 15

page 16

page 17

page 18

page 19


Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, d...

Token Transformer: Can class token help window-based transformer build better long-range interactions?

Compared with the vanilla transformer, the window-based transformer offe...

Memory transformers for full context and high-resolution 3D Medical Segmentation

Transformer models achieve state-of-the-art results for image segmentati...

CompletionFormer: Depth Completion with Convolutions and Vision Transformers

Given sparse depths and the corresponding RGB images, depth completion a...

Efficient Representation Learning via Adaptive Context Pooling

Self-attention mechanisms model long-range context by using pairwise att...

CNN Injected Transformer for Image Exposure Correction

Capturing images with incorrect exposure settings fails to deliver a sat...

Cyclic orthogonal convolutions for long-range integration of features

In Convolutional Neural Networks (CNNs) information flows across a small...