X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

08/18/2021
by   Yehao Li, et al.
0

With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler – a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

READ FULL TEXT
research
08/28/2023

Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions

With the exponential surge in diverse multi-modal data, traditional uni-...
research
12/16/2021

Distilled Dual-Encoder Model for Vision-Language Understanding

We propose a cross-modal attention distillation framework to train a dua...
research
07/02/2022

Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Recently, the cross-modal pre-training task has been a hotspot because o...
research
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
research
06/17/2022

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Vision-Language (VL) models with the Two-Tower architecture have dominat...
research
02/22/2023

Modular Deep Learning

Transfer learning has recently become the dominant paradigm of machine l...
research
11/01/2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...

Please sign up or login with your details

Forgot password? Click here to reset