Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference

06/27/2021
by   Riko Suzuki, et al.
0

This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form <subject, predicate, object> that can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.

READ FULL TEXT

page 1

page 3

page 4

research
06/10/2019

Multimodal Logical Inference System for Visual-Textual Entailment

A large amount of research about multimodal inference across text and vi...
research
12/03/2012

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

We introduce UCF101 which is currently the largest dataset of human acti...
research
03/10/2020

Video Caption Dataset for Describing Human Actions in Japanese

In recent years, automatic video caption generation has attracted consid...
research
06/10/2019

Identifying Visible Actions in Lifestyle Vlogs

We consider the task of identifying human actions visible in online vide...
research
01/15/2020

EEV Dataset: Predicting Expressions Evoked by Diverse Videos

When we watch videos, the visual and auditory information we experience ...
research
12/05/2022

Muscles in Action

Small differences in a person's motion can engage drastically different ...
research
12/14/2022

Learning and Predicting Multimodal Vehicle Action Distributions in a Unified Probabilistic Model Without Labels

We present a unified probabilistic model that learns a representative se...

Please sign up or login with your details

Forgot password? Click here to reset