DeepAI AI Chat
Log In Sign Up

Masked Vision-Language Transformer in Fashion

10/27/2022
by   Ge-Peng Ji, et al.
0

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17 over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

READ FULL TEXT
05/24/2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on dive...
03/04/2023

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

In the fashion domain, there exists a variety of vision-and-language (V+...
07/17/2022

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Large-scale Vision-and-Language (V+L) pre-training for representation le...
03/13/2023

Predicting Density of States via Multi-modal Transformer

The density of states (DOS) is a spectral property of materials, which p...
04/26/2021

MDETR – Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to e...
04/10/2022

Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition

Human fashion understanding is one important computer vision task since ...