DeepAI AI Chat
Log In Sign Up

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

12/06/2021
by   Fangzhi Xu, et al.
8

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with the span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose cross-guided multimodal attention to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21

READ FULL TEXT
03/24/2016

A Diagram Is Worth A Dozen Images

Diagrams are common tools for representing complex concepts, relationshi...
04/25/2020

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...
04/08/2019

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...
02/25/2019

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Multimodal attentional networks are currently state-of-the-art models fo...
11/25/2020

XTQA: Span-Level Explanations of the Textbook Question Answering

Textbook Question Answering (TQA) is a task that one should answer a dia...
03/01/2023

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Vision-and-language multi-modal pretraining and fine-tuning have shown g...
11/27/2017

Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams

In this work, we introduce a new algorithm for analyzing a diagram, whic...