Champion Solution for the WSDM2023 Toloka VQA Challenge

01/22/2023
by   Shengyi Gao, et al.
0

In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2023

Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes

Training models to apply common-sense linguistic knowledge and visual co...
research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
06/24/2021

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

In this paper, inspired by the successes of visionlanguage pre-trained m...
research
04/12/2020

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benc...
research
01/24/2022

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Visual question answering (VQA) is the multi-modal task of answering nat...
research
05/05/2022

Declaration-based Prompt Tuning for Visual Question Answering

In recent years, the pre-training-then-fine-tuning paradigm has yielded ...
research
12/19/2022

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to a...

Please sign up or login with your details

Forgot password? Click here to reset