Cross-Modal and Hierarchical Modeling of Video and Text

10/16/2018
by   Bowen Zhang, et al.
0

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

READ FULL TEXT

page 22

page 23

page 24

page 25

research
08/09/2019

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval betw...
research
03/15/2023

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Large scale Vision-Language (VL) models have shown tremendous success in...
research
05/03/2022

Cross-modal Representation Learning for Zero-shot Action Recognition

We present a cross-modal Transformer-based framework, which jointly enco...
research
01/27/2021

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

We introduce SynSE, a novel syntactically guided generative approach for...
research
04/12/2018

Cross-Modal Retrieval with Implicit Concept Association

Traditional cross-modal retrieval assumes explicit association of concep...
research
08/08/2022

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Video-text retrieval (VTR) is an attractive yet challenging task for mul...

Please sign up or login with your details

Forgot password? Click here to reset