Repetitive Activity Counting by Sight and Sound

03/24/2021
by   Yunhua Zhang, et al.
2

This paper strives for repetitive activity counting in videos. Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process. This benefits accuracy in challenging vision conditions such as occlusion, dramatic camera view changes, low resolution, etc. We propose a model that starts with analyzing the sight and sound streams separately. Then an audiovisual temporal stride decision module and a reliability estimation module are introduced to exploit cross-modal temporal interaction. For learning and evaluation, an existing dataset is repurposed and reorganized to allow for repetition counting with sight and sound. We also introduce a variant of this dataset for repetition counting under challenging vision conditions. Experiments demonstrate the benefit of sound, as well as the other introduced modules, for repetition counting. Our sight-only model already outperforms the state-of-the-art by itself, when we add sound, results improve notably, especially under harsh vision conditions.

READ FULL TEXT

page 1

page 3

page 6

page 11

research
05/14/2020

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

Visual crowd counting has been recently studied as a way to enable peopl...
research
09/18/2021

V-SlowFast Network for Efficient Visual Sound Separation

The objective of this paper is to perform visual sound separation: i) we...
research
06/17/2021

Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention

We introduce a state-of-the-art audio-visual on-screen sound separation ...
research
04/26/2013

In the sight of my wearable camera: Classifying my visual experience

We introduce and we analyze a new dataset which resembles the input to b...
research
06/27/2020

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

We present an approach for estimating the period with which an action is...
research
01/26/2020

Curriculum Audiovisual Learning

Associating sound and its producer in complex audiovisual scene is a cha...
research
04/05/2023

Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning

Pest counting, which predicts the number of pests in the early stage, is...

Please sign up or login with your details

Forgot password? Click here to reset