Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

09/28/2018
by   Yutong Ban, et al.
14

In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -- either speaking or silent -- of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.

READ FULL TEXT

page 10

page 11

page 12

research
12/19/2018

Tracking Multiple Audio Sources with the von Mises Distribution and Variational EM

In this paper, we address the problem of simultaneously tracking several...
research
09/04/2015

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes

Object tracking is an ubiquitous problem that appears in many applicatio...
research
05/06/2023

Variational Nonlinear Kalman Filtering with Unknown Process Noise Covariance

Motivated by the maneuvering target tracking with sensors such as radar ...
research
03/31/2016

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engag...
research
10/22/2019

Better Approximate Inference for Partial Likelihood Models with a Latent Structure

Temporal Point Processes (TPP) with partial likelihoods involving a late...
research
05/28/2018

Discrete flow posteriors for variational inference in discrete dynamical systems

Each training step for a variational autoencoder (VAE) requires us to sa...
research
09/12/2020

Tracking disease outbreaks from sparse data with Bayesian inference

The COVID-19 pandemic provides new motivation for a classic problem in e...

Please sign up or login with your details

Forgot password? Click here to reset