ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer

09/04/2023
by   Gyeongdong Yang, et al.
0

The paper proposes an efficient structure for enhancing the performance of mobile-friendly vision transformer with small computational overhead. The vision transformer (ViT) is very attractive in that it reaches outperforming results in image classification, compared to conventional convolutional neural networks (CNNs). Due to its need of high computational resources, MobileNet-based ViT models such as MobileViT-S have been developed. However, their performance cannot reach the original ViT model. The proposed structure relieves the above weakness by storing the information from early attention stages and reusing it in the final classifier. This paper is motivated by the idea that the data itself from early attention stages can have important meaning for the final classification. In order to reuse the early information in attention stages, the average pooling results of various scaled features from early attention stages are used to expand channels in the fully-connected layer of the final classifier. It is expected that the inductive bias introduced by the averaged features can enhance the final performance. Because the proposed structure only needs the average pooling of features from the attention stages and channel expansions in the final classifier, its computational and storage overheads are very small, keeping the benefits of low-cost MobileNet-based ViT (MobileViT). Compared with the original MobileViTs on the ImageNet dataset, the proposed ExMobileViT has noticeable accuracy enhancements, having only about 5

READ FULL TEXT
research
12/27/2021

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer st...
research
01/25/2022

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in proc...
research
09/28/2021

Real-Time Glaucoma Detection from Digital Fundus Images using Self-ONNs

Glaucoma leads to permanent vision disability by damaging the optical ne...
research
12/08/2022

Group Generalized Mean Pooling for Vision Transformer

Vision Transformer (ViT) extracts the final representation from either c...
research
11/19/2021

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision ...
research
05/19/2021

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Deploying deep learning models in time-critical applications with limite...

Please sign up or login with your details

Forgot password? Click here to reset