Revisiting Vision Transformer from the View of Path Ensemble

08/12/2023
by   Shuning Chang, et al.
0

Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/02/2021

Recurrent circuits as multi-path ensembles for modeling responses of early visual cortical neurons

In this paper, we showed that adding within-layer recurrent connections ...
research
05/06/2021

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

The strong performance of vision transformers on image classification an...
research
05/20/2016

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

In this work we propose a novel interpretation of residual networks show...
research
04/14/2022

MiniViT: Compressing Vision Transformers with Weight Multiplexing

Vision Transformer (ViT) models have recently drawn much attention in co...
research
09/23/2020

Multi-Pass Transformer for Machine Translation

In contrast with previous approaches where information flows only toward...
research
02/06/2022

Estimating the Euclidean Quantum Propagator with Deep Generative Modelling of Feynman paths

Feynman path integrals provide an elegant, classically-inspired represen...
research
04/10/2018

What's (Not) Validating Network Paths: A Survey

Validating network paths taken by packets is critical for a secure Inter...

Please sign up or login with your details

Forgot password? Click here to reset