FashionVQA: A Domain-Specific Visual Question Answering System

08/24/2022
by   Min Wang, et al.
0

Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.

READ FULL TEXT
research
09/20/2023

Visual Question Answering in the Medical Domain

Medical visual question answering (Med-VQA) is a machine learning task t...
research
08/05/2022

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Visual question answering is an important task in both natural language ...
research
10/12/2017

Adapting general-purpose speech recognition engine output for domain-specific natural language question answering

Speech-based natural language question-answering interfaces to enterpris...
research
12/28/2017

A Syntactic Approach to Domain-Specific Automatic Question Generation

Factoid questions are questions that require short fact-based answers. A...
research
04/03/2020

Template-based Question Answering using Recursive Neural Networks

We propose a neural network-based approach to automatically learn and cl...
research
04/19/2018

Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization

Traditional information retrieval (such as that offered by web search en...
research
12/06/2021

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer ...

Please sign up or login with your details

Forgot password? Click here to reset