Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

02/11/2023
by   Zhu Wang, et al.
0

Often, deep network models are purely inductive during training and while performing inference on unseen data. Thus, when such models are used for predictions, it is well known that they often fail to capture the semantic information and implicit dependencies that exist among objects (or concepts) on a population level. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings. In this work, we propose an end-to-end vision and language model incorporating explicit knowledge graphs. We also introduce an interactive out-of-distribution (OOD) layer using implicit network operator. The layer is used to filter noise that is brought by external knowledge base. In practice, we apply our model on several vision and language downstream tasks including visual question answering, visual reasoning, and image-text retrieval on different datasets. Our experiments show that it is possible to design models that perform similarly to state-of-art results but with significantly fewer samples and training time.

READ FULL TEXT

page 8

page 17

page 18

research
07/27/2022

Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base

Knowledge-based visual question answering (KVQA) task aims to answer que...
research
04/02/2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-moda...
research
10/17/2022

Contrastive Language-Image Pre-Training with Knowledge Graphs

Recent years have witnessed the fast development of large-scale pre-trai...
research
06/30/2022

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Knowledge-based Visual Question Answering (VQA) expects models to rely o...
research
08/22/2021

External Knowledge enabled Text Visual Question Answering

The open-ended question answering task of Text-VQA requires reading and ...
research
08/30/2023

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

Knowledge-based visual question answering is a very challenging and wide...
research
07/20/2015

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries

The complexity of the visual world creates significant challenges for co...

Please sign up or login with your details

Forgot password? Click here to reset