Object Segmentation with Audio Context

01/04/2023
by   Kaihui Zheng, et al.
0

Visual objects often have acoustic signatures that are naturally synchronized with them in audio-bearing video recordings. For this project, we explore the multimodal feature aggregation for video instance segmentation task, in which we integrate audio features into our video segmentation model to conduct an audio-visual learning scheme. Our method is based on existing video instance segmentation method which leverages rich contextual information across video frames. Since this is the first attempt to investigate the audio-visual instance segmentation, a novel dataset, including 20 vocal classes with synchronized video and audio recordings, is collected. By utilizing combined decoder to fuse both video and audio features, our model shows a slight improvements compared to the base model. Additionally, we managed to show the effectiveness of different modules by conducting extensive ablations.

READ FULL TEXT
research
12/07/2020

Learning Video Instance Segmentation with Recurrent Graph Neural Networks

Most existing approaches to video instance segmentation comprise multipl...
research
12/07/2020

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Video instance segmentation is a complex task in which we need to detect...
research
03/02/2019

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

In this paper, we focus on the challenging perception problem in robotic...
research
02/01/2021

ConvNets for Counting: Object Detection of Transient Phenomena in Steelpan Drums

We train an object detector built from convolutional neural networks to ...
research
09/08/2021

VideoModerator: A Risk-aware Framework for Multimodal Video Moderation in E-Commerce

Video moderation, which refers to remove deviant or explicit content fro...
research
09/18/2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps...
research
11/28/2018

Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

The forensic investigation of a terrorist attack poses a huge challenge ...

Please sign up or login with your details

Forgot password? Click here to reset