Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

02/04/2023
by   Min Peng, et al.
0

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid.

READ FULL TEXT
research
09/10/2021

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
05/22/2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Large-scale image-text contrastive pre-training models, such as CLIP, ha...
research
05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...
research
09/09/2021

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...
research
11/11/2017

End-to-end Video-level Representation Learning for Action Recognition

From the frame/clip-level feature learning to the video-level representa...
research
09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...

Please sign up or login with your details

Forgot password? Click here to reset