Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

08/18/2023
by   Yeming Chen, et al.
0

With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pretraining. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a nonnegligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial neural networks (ANNs) and Spiking neural networks (SNNs) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract encoder are constructed to learn continuous and discrete latent variables to enhance the flexibility of semantic encoding. Considering the spatio-temporal properties of SNNs modeling, we introduce a contrastive learning method to optimize the inputs of similar samples. This can improve the computational efficiency of the hierarchical network, while the augmentation of hard samples is beneficial to the learning of visual representations. Furthermore, the Spiking to Text Uni-Alignment Learning (STUA) pre-training method is proposed, which only relies on text features to enhance the encoding ability of abstract semantics. We validate the performance on multiple well-established downstream VL tasks. Experiments show that the proposed ASH-Nets achieve competitive results.

READ FULL TEXT

page 1

page 3

page 9

page 10

research
12/02/2022

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

We present a simple yet effective end-to-end Video-language Pre-training...
research
03/27/2023

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Contrastive learning-based vision-language pre-training approaches, such...
research
10/17/2022

Contrastive Language-Image Pre-Training with Knowledge Graphs

Recent years have witnessed the fast development of large-scale pre-trai...
research
06/16/2022

MixGen: A New Multi-Modal Data Augmentation

Data augmentation is a necessity to enhance data efficiency in deep lear...
research
12/11/2019

Multimodal Self-Supervised Learning for Medical Image Analysis

In this paper, we propose a self-supervised learning approach that lever...
research
07/31/2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is in...
research
07/29/2021

UIBert: Learning Generic Multimodal Representations for UI Understanding

To improve the accessibility of smart devices and to simplify their usag...

Please sign up or login with your details

Forgot password? Click here to reset