Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

06/27/2021
by   Riko Suzuki, et al.
0

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form <subject, predicate, object> that can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.

READ FULL TEXT

page 1

page 3

page 4

06/10/2019

Multimodal Logical Inference System for Visual-Textual Entailment

A large amount of research about multimodal inference across text and vi...
12/03/2012

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

We introduce UCF101 which is currently the largest dataset of human acti...
03/10/2020

Video Caption Dataset for Describing Human Actions in Japanese

In recent years, automatic video caption generation has attracted consid...
06/10/2019

Identifying Visible Actions in Lifestyle Vlogs

We consider the task of identifying human actions visible in online vide...
09/06/2021

WhyAct: Identifying Action Reasons in Lifestyle Vlogs

We aim to automatically identify human action reasons in online videos. ...
06/19/2018

Translating MFM into FOL: towards plant operation planning

This paper proposes a method to translate multilevel flow modeling (MFM)...
01/28/2021

Playable Video Generation

This paper introduces the unsupervised learning problem of playable vide...