Implicit and Explicit Commonsense for Multi-sentence Video Captioning

03/14/2023
by   Shih-Han Chou, et al.
0

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even function of certain objects within a scene. To address this limitation we propose a novel video captioning Transformer-based model, that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge. We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions. Further, inspired by imitation learning, we propose a new task of instruction generation, where the goal is to produce a set of linguistic instructions from a video demonstration of its performance. We formalize the task using ALFRED dataset [52] generated using an AI2-THOR environment. While instruction generation is conceptually similar to paragraph captioning, it differs in the fact that it exhibits stronger object persistence, as well as spatially-aware and causal sentence structure. We show that our commonsense knowledge enhanced approach produces significant improvements on this task (up to 57 the state-of-the-art result on more traditional video captioning in the ActivityNet Captions dataset [29].

READ FULL TEXT

page 1

page 8

research
03/11/2020

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Captioning is a crucial and challenging task for video understanding. In...
research
06/22/2018

RUC+CMU: System Report for Dense Captioning Events in Videos

This notebook paper presents our system in the ActivityNet Dense Caption...
research
11/17/2022

Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, ...
research
08/05/2021

Hybrid Reasoning Network for Video-based Commonsense Captioning

The task of video-based commonsense captioning aims to generate event-wi...
research
02/23/2022

Commonsense Reasoning for Identifying and Understanding the Implicit Need of Help and Synthesizing Assistive Actions

Human-Robot Interaction (HRI) is an emerging subfield of service robotic...
research
04/13/2023

A-CAP: Anticipation Captioning with Commonsense Knowledge

Humans possess the capacity to reason about the future based on a sparse...
research
10/10/2022

Generating image captions with external encyclopedic knowledge

Accurately reporting what objects are depicted in an image is largely a ...

Please sign up or login with your details

Forgot password? Click here to reset