FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

03/04/2023
by   Xiao Han, et al.
0

In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5 significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL.

READ FULL TEXT

page 2

page 9

page 10

page 11

research
12/13/2021

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Recently, fine-tuning language models pre-trained on large text corpora ...
research
01/14/2023

: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation

The HuggingFace Datasets Hub hosts thousands of datasets. This provides ...
research
10/27/2022

Masked Vision-Language Transformer in Fashion

We present a masked vision-language transformer (MVLT) for fashion-speci...
research
04/28/2023

A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning

Knowledge-intensive language tasks (KILTs) benefit from retrieving high-...
research
03/15/2022

Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs

Pre-trained Transformers are good foundations for unified multi-task mod...
research
05/27/2019

FAN: Focused Attention Networks

Attention networks show promise for both vision and language tasks, by e...
research
05/31/2021

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

Inspired by biological evolution, we explain the rationality of Vision T...

Please sign up or login with your details

Forgot password? Click here to reset