KAT: A Knowledge Augmented Transformer for Vision-and-Language

12/16/2021
by   Liangke Gui, et al.
0

The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.

READ FULL TEXT

page 1

page 9

page 11

research
05/24/2023

Interpretable by Design Visual Question Answering

Model interpretability has long been a hard problem for the AI community...
research
02/22/2021

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer

We propose UniT, a Unified Transformer model to simultaneously learn the...
research
11/14/2019

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Many visual scenes contain text that carries crucial information, and it...
research
09/09/2021

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...
research
02/17/2023

CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension

The task of multimodal referring expression comprehension (REC), aiming ...
research
02/02/2023

QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning

Daily images may convey abstract meanings that require us to memorize an...
research
03/29/2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to de...

Please sign up or login with your details

Forgot password? Click here to reset