Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

10/26/2022
by   Daisuke Niizumi, et al.
0

Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2022

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Recent general-purpose audio representations show state-of-the-art perfo...
research
02/07/2022

Context Autoencoder for Self-Supervised Representation Learning

We present a novel masked image modeling (MIM) approach, context autoenc...
research
04/01/2023

Mask Hierarchical Features For Self-Supervised Learning

This paper shows that Masking the Deep hierarchical features is an effic...
research
07/13/2022

Masked Autoencoders that Listen

This paper studies a simple extension of image-based Masked Autoencoders...
research
10/20/2022

SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading

Self-supervised learning (SSL) has been widely applied to learn image re...
research
03/26/2022

Self-Supervised Point Cloud Representation Learning with Occlusion Auto-Encoder

Learning representations for point clouds is an important task in 3D com...
research
04/08/2021

HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies

This paper presents a model architecture for encoding the representation...

Please sign up or login with your details

Forgot password? Click here to reset