Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model

06/10/2022
by   Fabian Deuser, et al.
0

Current architectures for multi-modality tasks such as visual question answering suffer from their high complexity. As a result, these architectures are difficult to train and require high computational resources. To address these problems we present a CLIP-based architecture that does not require any fine-tuning of the feature extractors. A simple linear classifier is used on the concatenated features of the image and text encoder. During training an auxiliary loss is added which operates on the answer types. The resulting classification is then used as an attention gate on the answer class selection. On the VizWiz 2022 Visual Question Answering Challenge we achieve 60.15 accuracy on Task 1: Predict Answer to a Visual Question and AP score of 83.78 on Task 2: Predict Answerability of a Visual Question.

READ FULL TEXT

page 1

page 2

research
03/04/2016

Dynamic Memory Networks for Visual and Textual Question Answering

Neural network architectures with memory and attention mechanisms exhibi...
research
12/07/2015

Simple Baseline for Visual Question Answering

We describe a very simple bag-of-words baseline for visual question answ...
research
12/14/2021

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Answering semantically-complicated questions according to an image is ch...
research
10/09/2016

Open-Ended Visual Question-Answering

This thesis report studies methods to solve Visual Question-Answering (V...
research
10/06/2020

DaNetQA: a yes/no Question Answering Dataset for the Russian Language

DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) d...
research
09/22/2021

Tecnologica cosa: Modeling Storyteller Personalities in Boccaccio's Decameron

We explore Boccaccio's Decameron to see how digital humanities tools can...
research
04/12/2020

Explaining Question Answering Models through Text Generation

Large pre-trained language models (LMs) have been shown to perform surpr...

Please sign up or login with your details

Forgot password? Click here to reset