DeepAI AI Chat
Log In Sign Up

MCQA: Multimodal Co-attention Based Network for Question Answering

by   Abhishek Kumar, et al.
University of Maryland

We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach fuses and aligns the question and the answer within this context. Moreover, we use the notion of co-attention to perform cross-modal alignment and multimodal context-query alignment. Our context-query alignment module matches the relevant parts of the multimodal context and the query with each other and aligns them to improve the overall performance. We evaluate the performance of MCQA on Social-IQ, a benchmark dataset for multimodal question answering. We compare the performance of our algorithm with prior methods and observe an accuracy improvement of 4-7


Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Video question answering has recently received a lot of attention from m...

Towards Solving Multimodal Comprehension

This paper targets the problem of procedural multimodal machine comprehe...

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...

Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Multimodal IR, spanning text corpus, knowledge graph and images, called ...

Latent Alignment of Procedural Concepts in Multimodal Recipes

We propose a novel alignment mechanism to deal with procedural reasoning...

DyREx: Dynamic Query Representation for Extractive Question Answering

Extractive question answering (ExQA) is an essential task for Natural La...