MCQA: Multimodal Co-attention Based Network for Question Answering

by   Abhishek Kumar, et al.
University of Maryland

We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach fuses and aligns the question and the answer within this context. Moreover, we use the notion of co-attention to perform cross-modal alignment and multimodal context-query alignment. Our context-query alignment module matches the relevant parts of the multimodal context and the query with each other and aligns them to improve the overall performance. We evaluate the performance of MCQA on Social-IQ, a benchmark dataset for multimodal question answering. We compare the performance of our algorithm with prior methods and observe an accuracy improvement of 4-7



There are no comments yet.


page 4


Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Video question answering has recently received a lot of attention from m...

Towards Solving Multimodal Comprehension

This paper targets the problem of procedural multimodal machine comprehe...

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer ...

Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Multimodal IR, spanning text corpus, knowledge graph and images, called ...

Latent Alignment of Procedural Concepts in Multimodal Recipes

We propose a novel alignment mechanism to deal with procedural reasoning...

Robust Question Answering Through Sub-part Alignment

Current textual question answering models achieve strong performance on ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intelligent Question Answering Woods (1978) has seen great progress in the past decade Zadeh et al. (2019). Initial attempts in question answering were limited to a single modality of text Rajpurkar et al. (2018, 2016); Welbl et al. (2018); Weston et al. (2015). There has been a shift in terms of using multiple modalities that also include audio or video. For instance, recent works have shown good performance in single-image based question answering, which include image and text modalities Agrawal et al. (2017); Jang et al. (2017); Yu et al. (2015), and video-based question answering tasks with video, audio and text as the underlying modalities Tapaswi et al. (2016); Kim et al. (2018); Lei et al. (2018).

Recently, there has been interest in the community to shift to more challenging questions of the nature ‘why’ and ‘how’, rather than ‘what’ and ‘when’ to make the models more intelligent. These questions are harder as they require a sense of causal reasoning. One such recent benchmark in this context is the Social-IQ (Social Intelligence Queries) dataset Zadeh et al. (2019). This dataset contains a set of questions, answers for social in-the-wild videos. This dataset comprises of a diverse set of videos collected from YouTube, is completely unconstrained and unscripted, and is regarded as a challenging dataset because of the gap between human performance and the prior methods on this dataset . The average length of the answers of the Social-IQ dataset is longer than the previous datasets by . This makes it challenging to develop accurate algorithms for such datasets.

Main Contributions: Social-IQ is a challenging dataset for video, text, audio input. We successfully demonstrate better results than SOTA systems on other datasets like MovieQA and TVQA when applied to Social-IQ. Given a input video and an input query (question and answer), we present a novel learning-based algorithm to predict if the answer is correct or not. Our main contributions include:

  1. [noitemsep]

  2. We present MCQA, a multimodal question answering algorithm that includes two main novel components: Multimodal Fusion and Alignment, which fuses and aligns the three multimodal inputs (text, video and audio) to serve as a context to the query and Multimodal Context-Query Alignment, which performs cross-alignment of the modalities with the query.

  3. We propose the use of Co-Attention, a concept borrowed from Machine Reading Comprehension literature Wang et al. (2018b) to perform these alignments.

  4. We analyze the performance of MCQA and compare it with the existing state-of-the-art QA methods on the Social-IQ dataset. We report an improvement of around % over prior methods.

2 Related Work

Figure 1: MCQA Network Architecture: The input to our network are encoded using BiLSTMs. Our multimodal fusion and alignment component aligns and fuses the multimodal inputs using co-attention and BiLSTM to output the multimodal context . Our multimodal context-query alignment module uses co-attention and BiLSTM to soft-align and combine to generate the latent representation, . The final answer () is then computed by applying Eqs and .

Question Answering Datasets: Datasets like COCO-QA Ren et al. (2015), VQA Antol et al. (2015), FM-IQA Gao et al. (2015), and Visual7w Zhu et al. (2016) are single-image based question answering datasets. MovieQA Tapaswi et al. (2016) and TVQA Lei et al. (2018) extend the task of question answering from a single image to videos. All of these datasets have focused on questions like ‘what’ and ‘when’.

Multimodal Fusion:

With multiple input modalities, it is important to consider how to fuse different modalities. Fusion methods like Tensor fusion network 

Zadeh et al. (2017) and memory fusion network Zadeh et al. (2018)

have been used for multimodal sentiment analysis. These methods use tensor fusion and attention-based memory fusion, respectively. 

Lei et al. (2018) use a context matching module and BiLSTM to model the inputs in a combined manner for multimodal question answering task.  Sun et al.

 (2019) proposed VideoBERT model to learn the bidirectional joint distributions over text and video data. Our model takes in text, video, and audio features as the input however, VideoBERT only takes in video and text data. It is not evident if VideoBERT can be extended directly to handle audio inputs.

Multimodal Context-Query Alignment: Alignment has been studied extensively for reading comprehension and question answering tasks. For instance, Wang et al. (2018) use a hierarchical attention fusion network to model the query and the context. Xiong et al. (2016) use a co-attentive encoder that captures the interactions between the question and the document, and dynamic pointing decoder to answer the questions. Wang et al. (2017) propose a self-matching attention mechanism and use pointer networks.

3 Our Approach: MCQA

In this section, we present the details of our learning-based algorithm(MCQA). In Section 3.1, we give an overview of the approach. This is followed by a discussion of co-attention in Section 3.2. We use this notion for Multimodal Fusion and Alignment and Multimodal Context-Query Alignment. We present these two alignments in Section 3.3 and 3.4, respectively.

3.1 Overview

We present MCQA for the task of multimodal question answering. We directly used the features publicly released by the Social-IQ authors. The input text, audio, and video features were generated using BERT embeddings, COVAREP features, and pretrained DenseNet161 network, respectively. The video frames were sampled at 30fps. We used the CMU-SDK library to align text, audio, and video modalities automatically. Given an input video, with three input modalities, text (), audio (), and video () of feature length 768, 74, 2208, respectively, we perform Multimodal Fusion and Alignment to obtain the context for the input query. Next, we perform Multimodal Context-Query Alignment to obtain the predicted answer using . Figure 1 highlights the overall network. We tried multiple approaches and different attention mechanisms. Ultimately, co-attention performed the best.

3.2 Co-attention

Our approach is motivated by Wang et al. (2018b), who use the notion of co-attention for the machine comprehension reading. The co-attention component calculates the shallow semantic similarity between the two inputs. Given two inputs , co-attention aligns them by constructing a soft-alignment matrix . Here, and can be the encoded outputs of any BiLSTM. Each entry of the matrix is the multiplication of the Relu Nair and Hinton (2010) activation for both the inputs.

We use the attention weights of the matrix

to obtain the attended vectors

and . and are trainable weights which are learned jointly. Each vector is the combination of the vectors that are most relevant to . Similarly, each is the combination of most relevant for .

represents the attention weight on all relevant for . Similarly for , represents the attention weight on all relevant . The final attended representation is the concatenation of and which captures the most important parts of and with respect to each other. We use in our network to capture the soft-alignment between and . As an example, given and as an input to the co-attention component, it will output as shown in Fig 1.

3.3 Multimodal Fusion and Alignment

The multimodal input from the Social-IQ dataset consists of text (), audio () and video () component. These components are encoded using BiLSTMs to capture the contextual information and output , and . We utilized all the outputs of BiLSTM and not just the final vector representation. We use BiLSTMs of dimension , , for text, audio, and video, respectively. Similarly, the question and the answer are encoded using BiLSTMs to output and .

The interactions between the different modalities are captured by the multimodal fusion and alignment components. We use co-attention as described in Section 3.2 to combine different modalities. Next we use a BiLSTM to combine the encoded inputs and the outputs of the modality fusion to obtain the multimodal context representation .

3.4 Multimodal context - Query alignment

The alignment between a context and a question is an important step to locate the answer for the question Wang et al. (2018b). We use the notion of co-attention to align the multimodal context and the question to obtain their aligned fused representation . Similarly, we align the multimodal context and the answer choice to compute .


We fuse the representation , , , and, using a BiLSTM. In order to make the final prediction, we obtain using a linear self-alignmentWang et al. (2018b) on

and pass it through a feed-forward neural network.

is a trainable weight.

4 Experiments and Results

Models A2 A4
LMN 61.1% 31.8%
FVTA 60.9% 31.0%
E2EMemNet 62.6% 31.5%
MDAM 60.2% 30.7%
MSM 60.0% 29.9%
TFN 63.2% 29.8%
MFN 62.8% 30.8%
Tensor-MFN 64.8% 34.1%
MCQA 68.8% 38.3%
Table 1: Accuracy Performance: We compare the performance of our method with eight prior methods on Social-IQ dataset and observe accuracy improvement.
Models A2
MCQA w/o fusion and alignment 66.9%
MCQA w/o context-query alignment 67.4%
MCQA 68.8%
Table 2: Ablation Experiments: We analyze the contribution of each of the components proposed by performing ablation experiments and report accuracy numbers by removing some components.

4.1 Training Details

We train the MCQA with a batch size of 32 for 100 epochs. We use the Adam optimizer

Kingma and Ba (2014)

with a learning rate of 0.001. All our results were generated on an NVIDIA GeForce GTX 1080 Ti GPU. We performed grid search over the hyperparameter space of number of epochs, learning rate, and dimensions of the BiLSTM.

4.2 Evaluation Methods

We analyze the performance of the MCQA by comparing it against Tensor-MFN Zadeh et al. (2019), the previous state-of-the-art system on the Social-IQ dataset. We also compare our system with the End2End Multimodal Memory Network (E2EMMemNet) Sukhbaatar et al. (2015) and Multimodal Dual Attention Memory (MDAM) Kim et al. (2018) which uses self-attention based on visual frames and cross-attention based on question. These systems showed good performance on the MovieQA dataset. We also compared our system with the Layered Memory Network (LMN) Wang et al. (2018a), the winner of ICCV 2017 which uses Static Word Memory Module and the Dynamic Subtitle Memory Module, and with Focal Visual-Text Attention (FVTA) Liang et al. (2018) which proposed Focal Visual-Text attention. These approaches have a strong performance on the MovieQA dataset. We also compare our approach with Multi-stream Memory (MSM) Lei et al. (2018), a top-performing baseline for TVQA which encodes all the modalities using recurrent networks. We also compare with Tensor Fusion Network (TFN) Zadeh et al. (2017) and Memory Fusion Network (MFN) Zadeh et al. (2018).

Figure 2: Qualitative Results: Example snippets from the Social-IQ dataset.

4.3 Analysis and Discussion

Comparison with SOTA Methods: Table 1 shows the accuracy performance of the MCQA for A2 and A4 tasks. The A2 task is to select the correct answer from two given answer choices. The A4 task is to select the correct answer from four given answer choices. We observe that MCQA is % better than Tensor-MFN network on the A2 and A4 tasks.
Qualitative Results: We show one frame from two videos from the Social-IQ dataset, where our model answers correctly in the Fig 2. The choice highlighted in green is the correct answer to the answer asked, whereas the red choices indicate the incorrect answers.

4.4 Ablation Experiments

MCQA without Multimodal Fusion and Alignment: We removed the component of the network that is used to fuse and align the input modalities (text, video and audio) using co-attention. We observe a drop in the performance to %. This can be observed in Row 1 of Table 2. We believe this reduction in performance is due to the fact that the network cannot exploit the three modalities to their full potential without the alignment. Our approach builds on the Social-IQ paper, which established the baseline for multimodal question answering. Social-IQ authors observed that the combination of text, audio, and video modality produced the best results, and we use this finding to incorporate all three modalities in all our experiments.
MCQA without Multimodal Context-Query Alignment: As shown in Table 2(Row 2), when we run experiments without the context-query alignment component, the performance of our approach reduces to 67.4%. This can be attributed to the fact that this component is responsible for the soft-alignment of the multimodal context and the query and without it the network struggles to locate the relevant answer of the question in the context.

4.5 Emotion and Sentiment Understanding

We observed that Social-IQ has many questions and answers that contain emotion or sentiment information. For instance, in Figure 2, the answer choices contain sentiment-oriented phrases like happy, despise, not happy, not comfortable. We performed an interesting study and divided the questions and answers in two disjoint sets. One set contained all the questions and answers which have emotion/sentiment-oriented words and the others which did not. We observe that both Tensor-MFN and the MCQA performed better on the set without emotion/sentiment-oriented words.

5 Conclusion, Limitations and Future Work

We present MCQA, a learning-based multimodal question answering task and evaluate our method on the Social-IQ, a benchmark dataset. We use co-attention for fusing and aligning the three modalities which serve as a context to the query. We also use co-attention to also perform a context-query alignment to enable our network to focus on the relevant parts of the video helpful for answering the query. We propose the use of a co-attention mechanism to handle combination of different modalities and question-answering alignment. It is a critical component of our model, as reflected by the drop in accuracy in the ablation experiments. While vanilla attention has been used for many NLP tasks, co-attention has not used for multimodal question answering; we clearly demonstrate its benefits. In practice, MCQA has certain limitations as it confuses and fails to predict the right answer multiple times. Social-IQ is a dataset containing videos of social situations and questions that require causal reasoning. We would like to explore better methods that can capture this reasoning. Our analysis in Section 4.5 suggests that the current models for question answering lack the understanding of emotions and sentiments and can be challenging problem in a question-answering setup. As part of future work, We would like to explicitly model this information using multimodal emotion recognition techniques Mittal et al. (2019); Kim et al. (2018); Majumder et al. (2018) to improve the performance of question answering systems.


  • A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra (2017) Vqa: visual question answering.

    International Journal of Computer Vision

    123 (1), pp. 4–31.
    Cited by: §1.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.
  • H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu (2015) Are you talking to a machine? dataset and methods for multilingual image question. In Advances in neural information processing systems, pp. 2296–2304. Cited by: §2.
  • Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2758–2766. Cited by: §1.
  • K. Kim, S. Choi, J. Kim, and B. Zhang (2018) Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 673–688. Cited by: §1, §4.2, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) Tvqa: localized, compositional video question answering. arXiv preprint arXiv:1809.01696. Cited by: §1, §2, §2, §4.2.
  • J. Liang, L. Jiang, L. Cao, L. Li, and A. G. Hauptmann (2018) Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6135–6143. Cited by: §4.2.
  • N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems 161, pp. 124–133. Cited by: §5.
  • T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha (2019) M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. arXiv preprint arXiv:1911.05659. Cited by: §5.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In

    Proceedings of the 27th international conference on machine learning (ICML-10)

    pp. 807–814. Cited by: §3.2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1.
  • M. Ren, R. Kiros, and R. Zemel (2015) Exploring models and data for image question answering. In Advances in neural information processing systems, pp. 2953–2961. Cited by: §2.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §4.2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473. Cited by: §2.
  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) Movieqa: understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4631–4640. Cited by: §1, §2.
  • B. Wang, Y. Xu, Y. Han, and R. Hong (2018a) Movie question answering: remembering the textual cues for layered visual contents. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §4.2.
  • W. Wang, M. Yan, and C. Wu (2018b) Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934. Cited by: item 2, §2, §3.2, §3.4.
  • W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou (2017) Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 189–198. Cited by: §2.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §1.
  • J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov (2015) Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698. Cited by: §1.
  • W. A. Woods (1978) Semantics and quantification in natural language question answering. In Advances in computers, Vol. 17, pp. 1–87. Cited by: §1.
  • C. Xiong, V. Zhong, and R. Socher (2016) Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604. Cited by: §2.
  • L. Yu, E. Park, A. C. Berg, and T. L. Berg (2015) Visual madlibs: fill in the blank description generation and question answering. In Proceedings of the ieee international conference on computer vision, pp. 2461–2469. Cited by: §1.
  • A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L. Morency (2019) Social-iq: a question answering benchmark for artificial social intelligence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8807–8817. Cited by: §1, §1, §4.2.
  • A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency (2017) Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250. Cited by: §2, §4.2.
  • A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. Morency (2018) Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2, §4.2.
  • Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4995–5004. Cited by: §2.