A Joint Sequence Fusion Model for Video Question Answering and Retrieval

08/07/2018
by   Youngjae Yu, et al.
0

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

READ FULL TEXT

page 5

page 13

research
09/21/2018

Multimodal Dual Attention Memory for Video Story Question Answering

We propose a video story question-answering (QA) architecture, Multimoda...
research
10/08/2018

Dense Multimodal Fusion for Hierarchically Joint Representation

Multiple modalities can provide more valuable information than single on...
research
10/15/2020

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

This paper presents MAST, a new model for Multimodal Abstractive Text Su...
research
03/29/2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to de...
research
07/20/2020

Multimodal Dialogue State Tracking By QA Approach with Data Augmentation

Recently, a more challenging state tracking task, Audio-Video Scene-Awar...
research
06/06/2023

Diversifying Joint Vision-Language Tokenization Learning

Building joint representations across images and text is an essential st...
research
10/10/2016

End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

We propose a high-level concept word detector that can be integrated wit...

Please sign up or login with your details

Forgot password? Click here to reset