Multi-modal Capsule Routing for Actor and Action Video Segmentation Conditioned on Natural Language Queries

12/02/2018
by   Bruce McIntosh, et al.
0

In this paper, we propose an end-to-end capsule network for pixel level localization of actors and actions present in a video. The localization is performed based on a natural language query through which an actor and action are specified. We propose to encode both the video as well as textual input in the form of capsules, which provide more effective representation in comparison with standard convolution based features. We introduce a novel capsule based attention mechanism for fusion of video and text capsules for text selected video segmentation. The attention mechanism is performed via joint EM routing over video and text capsules for text selected actor and action localization. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action localization on all the frames of a video, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of the proposed capsule network for text selective actor and action localization in videos, and it also improves upon the performance of the existing state-of-the art works on single frame-based localization.

READ FULL TEXT

page 8

page 9

research
03/20/2018

Actor and Action Video Segmentation from a Sentence

This paper strives for pixel-level segmentation of actors and their acti...
research
05/21/2018

VideoCapsuleNet: A Simplified Network for Action Detection

The recent advances in Deep Convolutional Neural Networks (DCNNs) have s...
research
04/05/2018

Guess Where? Actor-Supervision for Spatiotemporal Action Localization

This paper addresses the problem of spatiotemporal localization of actio...
research
11/02/2020

Actor and Action Modular Network for Text-based Video Segmentation

The actor and action semantic segmentation is a challenging problem that...
research
11/22/2020

We don't Need Thousand Proposals Single Shot Actor-Action Detection in Videos

We propose SSA2D, a simple yet effective end-to-end deep network for act...
research
03/15/2021

Siamese Network Features for Endoscopy Image and Video Localization

Conventional Endoscopy (CE) and Wireless Capsule Endoscopy (WCE) are kno...
research
07/23/2018

Actor-Action Semantic Segmentation with Region Masks

In this paper, we study the actor-action semantic segmentation problem, ...

Please sign up or login with your details

Forgot password? Click here to reset