DeepAI AI Chat
Log In Sign Up

GMML is All you Need

by   Sara Atito, et al.

Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. The source code is publicly available for the community to train on bigger corpora:


page 2

page 7

page 10


ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

Vision transformers, which were originally developed for natural languag...

MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning

Self-supervised pretraining is the method of choice for natural language...

Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Automatic data augmentation (AutoAugment) strategies are indispensable i...

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Self-supervised learning for computer vision has achieved tremendous pro...

Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology

Tissue phenotyping is a fundamental task in learning objective character...

Reason from Context with Self-supervised Learning

A tiny object in the sky cannot be an elephant. Context reasoning is cri...

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

Recently, Masked Image Modeling (MIM) achieves great success in self-sup...