FashionViL: Fashion-Focused Vision-and-Language Representation Learning

07/17/2022
by   Xiao Han, et al.
10

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

READ FULL TEXT
research
05/15/2023

PLIP: Language-Image Pre-training for Person Representation Learning

Pre-training has emerged as an effective technique for learning powerful...
research
10/26/2022

FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization

The Vision-Language Pre-training (VLP) models like CLIP have gained popu...
research
01/19/2022

TriCoLo: Trimodal Contrastive Loss for Fine-grained Text to Shape Retrieval

Recent work on contrastive losses for learning joint embeddings over mul...
research
05/26/2022

Matryoshka Representations for Adaptive Deployment

Learned representations are a central component in modern ML systems, se...
research
12/07/2022

Learning-To-Embed: Adopting Transformer based models for E-commerce Products Representation Learning

Learning low-dimensional representation for large number of products pre...
research
05/22/2023

Efficient Large-Scale Vision Representation Learning

In this article, we present our approach to single-modality vision repre...
research
07/27/2022

VICTOR: Visual Incompatibility Detection with Transformers and Fashion-specific contrastive pre-training

For fashion outfits to be considered aesthetically pleasing, the garment...

Please sign up or login with your details

Forgot password? Click here to reset