Masked Vision-Language Transformer in Fashion

10/27/2022
by   Ge-Peng Ji, et al.
0

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17 over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

READ FULL TEXT
research
05/24/2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on dive...
research
03/04/2023

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

In the fashion domain, there exists a variety of vision-and-language (V+...
research
07/27/2022

VICTOR: Visual Incompatibility Detection with Transformers and Fashion-specific contrastive pre-training

For fashion outfits to be considered aesthetically pleasing, the garment...
research
03/13/2023

Predicting Density of States via Multi-modal Transformer

The density of states (DOS) is a spectral property of materials, which p...
research
01/03/2023

Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection

In this paper, we propose a robust 3D detector, named Cross Modal Transf...
research
08/06/2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
research
05/29/2021

UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

Conditional image synthesis aims to create an image according to some mu...

Please sign up or login with your details

Forgot password? Click here to reset