Dynamic Token Normalization Improves Vision Transformer

12/05/2021
by   Wenqi Shao, et al.
13

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by 0.5% - 1.2% top-1 accuracy on ImageNet, by 1.2 - 1.4 box AP in object detection on COCO benchmark, by 2.3% - 3.9% mCE in robustness experiments on ImageNet-C, and by 0.5% - 0.8% accuracy in Long ListOps on Long-Range Arena. Codes will be made public at <https://github.com/wqshao126/DTN>

READ FULL TEXT

page 2

page 13

page 14

page 17

research
04/19/2022

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer visio...
research
12/21/2022

What Makes for Good Tokenizers in Vision Transformer?

The architecture of transformers, which recently witness booming applica...
research
08/02/2022

Unified Normalization for Accelerating and Stabilizing Transformers

Solid results from Transformers have made them prevailing architectures ...
research
05/30/2023

Contextual Vision Transformers for Robust Representation Learning

We present Contextual Vision Transformers (ContextViT), a method for pro...
research
04/10/2022

Stripformer: Strip Transformer for Fast Image Deblurring

Images taken in dynamic scenes may contain unwanted motion blur, which s...
research
03/15/2023

Attention-likelihood relationship in transformers

We analyze how large language models (LLMs) represent out-of-context wor...
research
04/10/2023

ViT-Calibrator: Decision Stream Calibration for Vision Transformer

A surge of interest has emerged in utilizing Transformers in diverse vis...

Please sign up or login with your details

Forgot password? Click here to reset