Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

by   Yutong Ban, et al.

In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status -- either speaking or silent -- of each tracked person along time. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. This may well be viewed as the problem of maximizing the posterior joint distribution of a set of continuous and discrete latent variables given the past and current observations, which is intractable. We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution. The solution takes the form of a closed-form expectation maximization procedure. We describe in detail the inference algorithm, we evaluate its performance and we compare it with several baseline methods. These experiments show that the proposed audio-visual tracker performs well in informal meetings involving a time-varying number of people.


page 10

page 11

page 12


Tracking Multiple Audio Sources with the von Mises Distribution and Variational EM

In this paper, we address the problem of simultaneously tracking several...

An On-line Variational Bayesian Model for Multi-Person Tracking from Cluttered Scenes

Object tracking is an ubiquitous problem that appears in many applicatio...

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engag...

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

In this paper, we are interested in unsupervised speech enhancement usin...

Deep Variational Generative Models for Audio-visual Speech Separation

In this paper, we are interested in audio-visual speech separation given...

Better Approximate Inference for Partial Likelihood Models with a Latent Structure

Temporal Point Processes (TPP) with partial likelihoods involving a late...

Tracking disease outbreaks from sparse data with Bayesian inference

The COVID-19 pandemic provides new motivation for a classic problem in e...

I Introduction

In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information [1, 2, 3, 4, 5, 6, 7]. We propose to exploit the complementary nature of these two modalities in order to accurately estimate the position of each person at each time step, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status, either speaking or silent, of each tracked person. We propose to cast the problem at hand into a generative audio-visual fusion (or association) model formulated as a latent-variable temporal graphical model. We propose a tractable solver via a variational approximation.

We are particularly interested in tracking people involved in informal meetings and social gatherings. In this type of scenarios, participants wander around, cross each other, move in and out the camera field of view, take speech turns, etc. Acoustic room conditions, e.g. reverberation, and overlapping audio sources of various kinds drastically deteriorate or modify the microphone signals. Likewise, occluded persons, lighting conditions and middle-range camera-viewing complicate the task of visual processing. It is therefore impossible to gather reliable and continuous flows of visual and audio observations. Hence one must design a fusion and tracking method that is able to deal with intermittent visual and audio data.

We propose a multi-speaker tracking method based on a dynamic Bayesian model that fuses audio and visual information over time from their respective observations spaces. This may well be viewed as a generalization of single-observation and single-target Kalman filtering – which yields an exact recursive solution – to multiple-observations and -targets, which makes the recursive solution intractable. We propose a variational approximation of the posterior distribution over the continuous variables (positions and velocities of tracked persons) and discrete variables (observation-to-person associations) at each time step, given all the past and present audio and visual observations. The approximation of this joint distribution with a factorized distribution makes the tracking problem tractable: the solution takes the form of a closed-form expectation maximization (EM) procedure.

In general, multiple object tracking consists of the temporal estimation of the kinematic state of each object, i.e. position and velocity. In computer vision, local descriptors are used to better discriminate between objects, e.g. person detectors/descriptors based on hand-crafted features


or on deep neural networks

[9]. If the tracked objects emit sounds, their states can be inferred as well using sound-source localization techniques combined with tracking. These techniques are often based on the estimation of the sound’s direction of arrival (DOA) using a microphone array, e.g. [10]. DOA estimation can be carried out either in the temporal domain [11], or in the spectral (Fourier) domain [12]. However, spectral-domain DOA estimation methods are more robust than temporal-domain methods, in particular in the presence of background noise and reverberation [13, 14].

Via proper camera-microphone calibration, audio and visual observations can be aligned such that a DOA corresponds to a 2D location in the image plane. In this paper we adopt the audio-visual alignment method of [15]

which learns a mapping, from a vector space spanned by multichannel spectral features (or audio features, in short) to the image plane, as well as the inverse of this mapping. This allows us to exploit the richness of representing acoustic signals in the short-time Fourier domain

[16] and to extract noise- and reverberation-free audio features [13].

We propose to represent the audio-visual fusion problem via two sets of independent variables, i.e. visual-feature-to-person and audio-feature-to-person sets of assignment variables. An interesting characteristic of this way of doing is that the proposed tracking algorithm can indifferently use visual features, audio features, or a combination of both, and choose independently for every target and at every time step. Indeed, audio and visual information are rarely available simultaneously and continuously. Visual information suffers from limited camera field-of-view, occlusions, false positives, missed detections, etc. Audio information is often corrupted by room acoustics, environmental noise and overlapping acoustic signals. In particular speech signals are sparse, non-stationary and are emitted intermittently, with silence intervals between speech utterances. Hence a robust audio-visual tracking must explicitly take into account the temporal sparsity of the two modalities and this is exactly what is proposed in this paper.

We use the AVDIAR dataset [17] to evaluate the performance of the proposed audio-visual tracker. We use the MOT (multiple object tracking) metrics to quantitatively assess method performance. In particular the tracking accuracy (MOTA), which combines false positives, false negatives, identity switches and compares them with the ground-truth trajectories, is a commonly used score to assess the quality of a multiple person tracker.111 We use the MOT metrics to compare our method with two recently proposed audio-visual tracking methods [4, 7] and with a visual tracker [8]. An interesting outcome of the proposed method is that speaker diarization, i.e. who speaks when, can be coarsely inferred from the tracking output, thanks to the audio-feature-to-person assignment variables. The speaker diarization results obtained with our method are compared with two other methods [18, 17] based on the diarization error rate (DER) score.

The remainder of the paper is organized as follows. Section II describes the related work. Section III describes in detail the proposed formulation. Section IV describes the proposed variational approximation and Section V details the variational expectation-maximization procedure. The algorithm implementation is described in Section VI. Tracking results and comparisons with other methods are reported in Section VII. Finally, Section VIII draws a few conclusions. Supplemental materials are available on our website.222

Ii Related Work

In computer vision, there is a long history of multiple object tracking methods. While these methods provide interesting insights concerning the problem at hand, a detailed account of existing visual trackers is beyond the scope of this paper. Several audio-visual tracking methods were proposed in the recent past, e.g. [19, 1, 2, 3]

. These papers proposed to use approximate inference of the filtering distribution using Markov chain Monte Carlo particle filter sampling (MCMC-PF). These methods cannot provide estimates of the accuracy and merit of each modality with respect to each tracked person. Sampling and distribution estimation are performed in parameter space but no statistics are gathered in the observations spaces.

More recently, audio-visual trackers based on particle filtering (PF) and probability hypothesis density (PHD) filters were proposed, e.g.

[4, 5, 6, 7, 20, 21, 22]. In [6] DOAs of audio sources to guide the propagation of particles and combined the filter with a mean-shift algorithm to reduce the computational complexity. Some PHD filter variants were proposed to improve tracking performance [20, 21]. The method of [4] also used DOAs of active audio sources to give more importance to particles located around DOAs. Along the same line of thought, [7] proposed a mean-shift sequential Monte Carlo PHD (SMC-PHD) algorithm that used audio information to improve the performance of a visual tracker. This implies that the persons being tracked must emit acoustic signals continuously and that multiple-source audio localization is reliable enough for proper audio-visual alignment.

PF- and PHD-based tracking methods are computationally efficient but their inherent limitation is that they are unable to associate observations to tracks. Hence they require an external post-processing mechanism that provides associations. Also, in the case of PF-based filtering, the number of tracked persons must be set in advance. Moreover, both PF- and PHD-based trackers provide non-smooth trajectories since the state dynamics are not explicitly enforced. In contrast, the proposed variational formulation embeds association variables within the model, uses a birth process to estimate the initial number of persons and to add new ones along time, and an explicit dynamic model yields smooth trajectories.

Another limitation of the methods proposed in [1, 3, 6, 20, 21, 22] is that they need as input a continuous flow of audio and visual observations. To some extent, this is also the case with [4, 7], where only the audio observations are supposed to be continuous. All these methods showed good performance in the case of the AV16.3 dataset [23] in which the participants spoke simultaneously and continuously – which is somehow artificial. The AV16.3 dataset was recorded in a specially equipped meeting room using a large number of cameras to guarantee that frontal views of the participants were always available. This contrasts with the AVDIAR dataset which was recorded with one sensor unit composed of two cameras and six microphones. The AVDIAR scenarios are composed of participants that take speech turns while they look at each other, hence they speak intermittently and they do not always face the cameras.

Recently, we proposed an audio-visual clustering method [24] and an audio-visual speaker diarization method [17]. The weighted-data clustering method of [24] analyzed a short time window composed of several audio and visual frames and hence it was assumed that the speakers were static within such temporal windows. Binaural audio features were mapped onto the image plane and were clustered with nearby visual features. There was no dynamic model that allowed to track speakers. The audio-visual diarization method [17] used an external multi-object visual tracker that provided trajectories for each tracked person. The audio-feature-space to image-plane mapping [15] was used to assign audio information to each tracked person at each time step. Diarization itself was modeled with a binary state variable (speaking or silent) associated with each person. The diarization transition probabilities (state dynamics) were hand crafted, with the assumption that the speaking status of a person was independent of all the other persons. Because of the small number of state configurations, i.e. (where is the maximum number of tracked persons), the MAP solution could be found by exhaustively searching the state space. In Section VII-H we use the AVDIAR recordings to compare our diarization results with the results obtained with [17].

The variational inference method proposed in this paper may well be viewed as a multimodal generalization of [8]. We show that the model of [8] can be extended to deal with observations living in completely different mathematical spaces. Indeed, we show that two (or several) different data-processing pipelines can be embedded and treated on an equal footing in the proposed formulation. Special attention is given to audio-visual alignment and to audio-to-person assignments: (i) we learn a mapping from the space of audio features to the image plane, as well as the inverse of this mapping, which are integrated in the proposed generative approach, and (ii) we show that the additional assignment variables due to the audio modality do not affect the complexity of the algorithm. Absence of observed data of any kind or erroneous data are carefully modeled: this enables the algorithm to deal with intermittent observations, whether audio, visual, or both. This is probably one of the most prominent features of the method, in contrast with most existing audio-visual tracking methods which require continuous and simultaneous flows of visual and audio data.

This paper is an extended version of [25] and of [26]. The probabilistic model and its variational approximation were briefly presented in [25] together with preliminary results obtained with three AVDIAR sequences. Reverberation-free audio features were used in [26] where it was shown that good performance could be obtained with these features when the audio mapping was trained in one room and tested in another room. With respect to these two papers. we provide detailed descriptions of the proposed formulation, of the variational expectation maximization solver and of the implemented algorithm. We explain in detail the birth process, which is crucial for track initialization and for detecting potentially new tracks at each time step. We experiment with the entire AVDIAR dataset and we benchmark our method with the state-of-the-art multiple-speaker audio-visual tracking methods [4, 7] and with [8]. Moreover, we show that our tracker can be used for speaker diarization.

Iii Proposed Model

Iii-a Mathematical Definitions and Notations

Unless otherwise specified, uppercase letters denote random variables while lowercase letters denote their realizations, e.g.

, where

denotes either a probability density function (pdf) or a probability mass function (pmf). For the sake of conciseness we generally write

. Vectors are written in slanted bold, e.g. , whereas matrices are written in bold, e.g. . Video and audio data are assumed to be synchronized, and let denote the common frame index. Let be the upper bound of the number of persons that can simultaneously be tracked at any time , and let be the person index. Let denote nobody. A subscript denotes variable concatenation at time , e.g. , and the subscript denotes concatenation from 1 to , e.g. .

Let , and be three latent variables that correspond to the 2D position, 2D velocity and 2D size (width and height) of person at . Typically, and correspond to the center and size of a bounding box of a person while is the velocity of . Let be the complete set of continuous latent variables at , where denotes the transpose operator. Without loss of generality, in this paper a person is characterized with the bounding box of her/his head and the center of this bounding box is assumed to be the location of the corresponding speech source.

We now define the observations. Let and be realizations of the visual and audio random observed variables and , respectively. A visual observation, , corresponds to the bounding box of a detected face and it is the concatenation of the bounding-box center, width and height, , and of a feature vector that describes the photometric content of that bounding box, i.e. a -dimensional face descriptor (Section VII-C). An audio observation, , corresponds to an inter-microphone spectral feature, where is a frequency sub-band index. Let’s assume that there are sub-bands, that sub-bands are active at , i.e. with sufficient energy, and that there are frequencies per sub-band. Hence, corresponds to complex-valued Fourier coefficients which are represented by their real and imaginary parts. In practice, the inter-microphone features contain audio-source localization information and are obtained by applying the multi-channel audio processing method described in detail below (Section VII-B). Note that both the number of visual and of audio observations at , and , vary over time. Let denote the set of observations from 1 to , where .

We now define the assignment variables of the proposed latent variable model. There is an assignment variable (a discrete random variable) associated with each observed variable. Namely, let

and be associated with and with , respectively, e.g. denotes the probability of assigning visual observation at to person . Note that and are the probabilities of assigning visual observation and audio observation to none of the persons, or to nobody. In the visual domain, this may correspond to a false detection while in the audio domain this may correspond to an audio signal that is not uttered by a person. There is an additional assignment variable, that is associated with the audio generative model described in Section III-D. The assignment variables are jointly denoted with .

Iii-B The Filtering Distribution

We remind that the objective is to estimate the positions and velocities of participants (multiple person tracking) and, possibly, to estimate their speaking status (speaker diarization). The audio-visual multiple-person tracking problem is cast into the problems of estimating the filtering distribution and of inferring the state variable . Subsequently, speaker diarization can be obtained from audio-feature-to-person information via the estimation of the assignment variables (Section VI-C).

We reasonably assume that the state variable

follows a first-order Markov model, and that the visual and audio observations only depend on

and . By applying Bayes rule, one can then write the filtering distribution of as:




Eq. (2) is the joint (audio-visual) observed-data likelihood. Visual and audio observations are assumed independent conditionally to , and their distributions will be detailed in Sections III-C and III-D, respectively.333We will see that depends on but depends neither on nor on , and depends on and but not on . Eq. (3) is the prior distribution of the assignment variable. The observation-to-person assignments are assumed to be a priori independent so that the probabilities in (3) factorize as:


It makes sense to assume that these distributions do not depend on and that they are uniform. The following notations are introduced: and . The probability is discussed below (Section III-D).

Eq. (4) is the predictive distribution of given the past observations, i.e. from 1 to . The state dynamics in (4) is modeled with a linear-Gaussian first-order Markov process. Moreover, it is assumed that the dynamics are independent over speakers:


where is the dynamics’ covariance matrix and is the state transition matrix, given by:

As described in Section IV below, an important feature of the proposed model is that the predictive distribution (4) at frame is computed from the state dynamics model (8) and an approximation of the filtering distribution at frame , which also factorizes across speaker. As a result, the computation of (4) factorizes across speakers as well.

Iii-C The Visual Observation Model

As already mentioned above (Section III-A), a visual observation consists of the center, width and height of a bounding box, namely , as well as of a feature vector describing the region inside the bounding box. Since the velocity is not observed, a projection matrix is used to project onto . Assuming that the visual observations available at are independent, and that the appearance of a person is independent of his/her position in the image, the visual likelihood in (2) is defined as:


where the observed bounding-box centers, widths, heights, and feature vectors are drawn from the following distributions:


where is a covariance matrix quantifying the measurement error in the bounding-box center and size,

is the uniform distribution with

being the support volume of the variable space, is the Bhattacharya distribution with parameter , and is a set of prototype feature vectors that model the appearances of the persons.

Iii-D The Audio Observation Model

It is well established in the multichannel audio signal processing literature that inter-microphone spectral features encode sound-source localization information [15, 12, 13]. Therefore, observed audio features, are obtained by considering all pairs of a microphone array. Audio observations depend neither on (size of the bounding box) nor on (velocity). Hence one can replace with in the equations below, with . By assuming independence across frequency sub-bands (indexed by ), the audio likelihood in (2) can be factorized as:


While the inter-microphone spectral coefficients

contain localization information, in complex acoustic environments there is no explicit function that maps source locations onto inter-microphone spectral features. Moreover, this mapping is non-linear. We therefore make recourse to modeling this relationship via learning a regression function. We propose to use the piecewise-linear regression

[27] which belongs to the mixture of experts (MOE) class of models. For that purpose we consider a training set of audio features and their associated source locations, and let . The joint probability of writes:


Assuming Gaussian variables, we have , , and , where matrix and vector characterize the -th affine transformation that maps the space of source locations onto the space spanned by inter-microphone sub-band spectral features, is the associated covariance matrix, and

is drawn from a Gaussian mixture model with

components, each component being characterized by , and . The parameter set of this model is:


These parameters can be estimated via a closed-form EM procedure from a training dataset, e.g. (please consult [27, 15] and Section VII-B below for more details). One should notice that there is a parameter set for each sub-band , , hence there are models that need be trained in our case. It follows that (12) writes:


The right-hand side of (7) can now be written as:


Iv Variational Approximation

Direct estimation of the filtering distribution is intractable. In particular, the integral (4) does not have an analytic solution. Consequently, evaluating expectations over this distribution is intractable as well. We overcome this problem via variational inference and associated EM closed-form solver [28, 29]. More precisely is approximated with the following factorized form:


which implies


where and

are the variational posterior probabilities of assigning visual observation

to person and audio observation to person , respectively. The proposed variational approximation (17) amounts to break the conditional dependence of and with respect to which causes the computational intractability. Note that the visual, , and audio, , , assignment variables are independent, that the assignment variables for each observation are also independent, and that and are conditionally dependent on the audio observation. This factorized approximation makes the calculation of tractable. The optimal solution is given by an instance of the variational expectation maximization (VEM) algorithm [28, 29], which alternates between two steps:

  • Variational E-step: the approximate log-posterior distribution of each one of the latent variables is estimated by taking the expectation of the complete-data log-likelihood over the remaining latent variables, i.e. (19), (20), and (21) below, and

  • M-step: model parameters are estimated by maximizing the variational expected complete-data log-likelihood.

In the case of the proposed model the latent variable log-posteriors write:


A remarkable consequence of the factorization (17) is that is replaced with , consequently (4) becomes:


It is now assumed that the variational posterior distribution is Gaussian with mean and covariance :


By substituting (23) into (22) and combining it with (8), the predictive distribution (22) becomes:


Note that the above distribution factorizes across persons. Now that all the factors in (1) have tractable expressions, A VEM algorithm can be applied.

V Variational Expectation Maximization

The proposed VEM algorithm iterates between an E-S-step, an E-Z-step, and an M-step on the following grounds.

V-1 E-S-step

the per-person variational posterior distribution of the state vector is evaluated by developing (19). The complete-data likelihood in (19) is the product of (2), (3) and (24). We thus first sum the logarithms of (2), of (3) and of (24). Then we ignore the terms that do not involve . Evaluation of the expectation over all the latent variables except

yields the following Gaussian distribution:




where and are computed in the E-Z-step below. A key point is that, because of the recursive nature of the formulas above, it is sufficient to make the Gaussian assumption at , i.e. , whose parameters may be easily initialized. It follows that is Gaussian at each frame.

We note that both (26) and (27) are composed of three terms: the first term (#1), second second term (#2) and third term (#3) of (26) correspond to the visual, audio, and model dynamics contributions to the precision, respectively. Remind that covariance is associated with the visual observed variable in (10). Matrices and vectors characterize the piecewise affine mappings from the space of person locations to the space of audio features, and covariances capture the errors that are associated with both audio measurements and the piecewise affine approximation in (15). A similar interpretation holds for the three terms of (27).

V-2 E-Z-step

by developing (20), and following the same reasoning as above, we obtain the following closed-form expression for the variational posterior distribution of the visual assignment variable:


where is given by:

Similarly, for the variational posterior distribution of the audio assignment variables, developing (21) leads to:


where is given by:


To obtain (30), an additional approximation is made. Indeed, the logarithm of (16) is part of the complete-data log-likelihood and the denominator of (16) contains a weighted sum of Gaussian distributions. Taking the expectation of this term is not tractable because of the denominator. Based on the dynamical model (8), we replace the state variable in (16) with a “naive” estimate predicted from the position and velocity inferred at : .

V-3 M-step

The entries of covariance matrix of the state dynamics, , are the only parameters that need be estimated. To this aim, we develop and ignore the terms that do not depend on . We obtain:

which can be further developed as:


Hence, by differentiating (V-3) with respect to and equating to zero, we obtain:


Vi Algorithm Implementation

The VEM procedure above will be referred to as VAVIT which stands for variational audio-visual tracking, and pseudo-code is shown in Algorithm 1. In theory, the order in which the two expectation steps are executed is not important. In practice, the issue of initialization is crucial. In our case, it is more convenient to start with the E-Z step rather than with the E-S step because the former is easier to initialize than the latter (see below). We start by explaining how the algorithm is initialized at and then how the E-Z-step is initialized at each iteration. Next, we explain in detail the birth process. An interesting feature of the proposed method is that it allows to estimate who speaks when, or speaker diarization, which is then explained in detail.

Input: visual observations ;
         audio observations ;
Output: Parameters of : (the estimated position of each person is given by the two first entries of );
           Person speaking status for
Initialization (see Section VI-A);
for  to end do
       Gather visual and audio observations at frame ;
       Perform voice activity detection;
       Initialization of E-Z step (see Section VI-A);
       for  to  do
             E-Z-step (vision):
             for  do
                   for  do
                         Evaluate with (28);
                   end for
             end for
            E-Z-step (audio):
             for  do
                   for  and  do
                         Evaluate with (30);
                   end for
             end for
             for   do
                   Evaluate and with (26) and (27);
             end for
            M-step: Evaluate with (32);
       end for
      Perform birth (see Section VI-B);
       Output the results;
end for
Algorithm 1 Variational audio-visual tracking algorithm.

Vi-a Initialization

At one must provide initial values for the parameters of the distributions (25), namely and for all . These parameters are initialized as follows. The means are initialized at the image center and the covariances are given very large values, such that the variational distributions are non-informative. Once these parameters are initialized, they remain constant for a few frames, i.e. until the birth process is activated (see Section VI-B below).

As already mentioned, it is preferable to start with the E-Z-step than with the E-S-step because the initialization of the former is straightforward. Indeed, the E-S-step (Section V) requires current values for the posterior probabilities (28) and (30) which are estimated during the E-Z-step and which are both difficult to initialize. Conversely, the E-Z-step only requires current mean values, , which can be easily initialized by using the model dynamics (8), namely .

Vi-B Birth Process

We now explain in detail the birth process, which is executed at the start of the tracking to initialize a latent variable for each detected person, as well as at any time to detect new persons. The birth process considers consecutive visual frames. At , with , we consider the set visual observations assigned to from to , namely observations whose posteriors (28) are maximized for (at initialization all the observations are in this case). We then build observation sequences from this set, namely sequences of the form , where indexes the set of observations at assigned to and indexes the set of all such sequences. Notice that the birth process only uses the bounding-box center, width and size, , and that the descriptor is not used. Hence the birth process is only based on the smoothness of an observed sequence of bounding boxes. Let’s consider the marginal likelihood of a sequence , namely:


where is the latent variable already defined and indexes the set

. All the probability distributions in (

33) were already defined, namely (8) and (10), with the exception of

. Without loss of generality, we can assume that the latter is a normal distribution centered at

and with a large covariance. Therefore, the evaluation of (33) yields a closed-form expression for . A sequence generated by a person is likely to be smooth and hence is high, while for a non-smooth sequence the marginal likelihood is low. A newborn person is therefore created from a sequence of observations if , where is a user-defined parameter. As just mentioned, the birth process is executed to initialize persons as well as along time to add new persons. In practice, in (33) we set B=3 and hence, from t=1 to t=4 all the observations are initially assigned to .

Vi-C Speaker Diarization

Speaker diarization consists of assigning temporal segment of speech to persons [30]

. We introduce a binary variable

such that if person speaks at time and otherwise. Traditionally, speaker diarization is based on the following assumptions. First, it is assumed that speech signals are sparse in the time-frequency domain. Second, it is assumed that each time-frequency point in such a spectrogram corresponds to a single speech source. Therefore, the proposed speaker diarization method is based on assigning time-frequency points to persons.

In the case of the proposed model, speaker diarization can be coarsely inferred from frequency sub-bands in the following way. The posterior probability that the speech signal available in the frequency sub-band at frame was uttered by person , given the audio observation , is:


where is the audio assignment variable and is the affine-mapping assignment variable defined in Section III-D. Using the variational approximation (29), this probability becomes:


and by accumulating probabilities over all the frequency sub-bands, we obtain the following:


where is a user-defined threshold. Note that there is no dynamic model associated with diarization: is estimated independently at each frame and for each person. More sophisticated diarization models can be found in [31, 17].

Vii Experiments

Vii-a Dataset

We use the AVDIAR dataset [17] to evaluate the performance of the proposed audio-visual tracking method. This dataset is challenging in terms of audio-visual analysis. There are several participants involved in informal conversations while wandering around. They are in between two and four meters away from the audio-visual recording device. They take speech turns and often there are speech overlaps. They turn their faces away from the camera. The dataset is annotated as follows:444Please consult for a detailed description of the dataset. The visual annotations comprise the centers, widths and heights of two bounding boxes for each person and in each video frame, a face bounding box and an upper-body bounding box. An identity (a number) is associated with each person through the entire dataset. The audio annotations comprise the speech status of each person over time (speaking or silent), with a minimum speech duration of 0.2 seconds. The audio source locations correspond to the centers of the face bounding boxes.

The dataset was recorded with a sensor composed of two cameras and six microphones, but only one camera is used in the experiments described below. The videos were recorded at 25 FPS. The frame resolution is of pixels corresponding to a field of view of . The microphone signals are sampled at 16000 Hz. The dataset was recorded into two different rooms, living-room and meeting-room, e.g. Fig. 1 and Fig. 2. These two rooms have quite different lighting conditions and acoustic properties (size, presence of furniture, background noise, etc.). Altogether there are 18 sequences associated with living-room (26928 video frames) and 6 sequences with meeting-room (6031 video frames). Additionally, there are two training datasets, and (one for each room) that contain input-output pairs of multichannel audio features and audio-source locations that allow to estimate the parameters (14) using the method of [15]. This yields a mapping between source locations in the image plane, , and audio features,

. Audio feature extraction is described in detail below.

Vii-B Audio Features

The STFT (short-time Fourier transform)

[16] is applied to each microphone signal using a 16 ms Hann window (256 audio samples per window) and with an 8 ms shift (50% overlap), leading to 128 frequency bins and to 125 audio FPS. Inter-channel features are then computed using [14]. These features – referred to as direct-path relative transfer function (DP-RTF) features – are robust both against background noise and reverberation, hence they do not depend on the room acoustic properties as they encode the direct path from the audio source to the microphones. The audio features are averaged over five audio frames in order to properly align them with the video frames. The feature vector is then split into sub-bands, each sub-band being composed of frequencies; sub-bands with low energy are disregarded. This yields the set of audio observations at , , (see Section III-D).

Vii-C Visual processing

Because in AVDIAR people do not necessarily face the camera, face detection is not very robust. Instead we use a body-pose detector

[32] from which we infer a full-body bounding-box and a head bounding-box. We use the person re-identification CNN-based method [33] to extract en embedding from the full-body bounding-box. This yields the features vectors (Section III-C). Similarly, the center, width and height of the head bounding-box yield the observations at each frame .

Vii-D Experimental Settings

One interesting feature of the proposed tracking is its flexibility in dealing with visual data, audio data or visual and audio data. Moreover, the algorithm is able to automatically switch from unimodal to multimodal. In order to quantitatively assess the performance and merits of each one of these variants we used two configurations:

  • Full camera field of view (FFOV): The entire horizontal field of view of the camera, i.e. 1920 pixels, or 97, is being used, such that visual and audio observations, if any, are simultaneously available, and

  • Partial camera field of view (PFOV): The horizontal field of view is restricted to 768 pixels (or 49) and there are two blind strips (576 pixels each) on its left- and right-hand sides; the audio field of view remains unchanged, 1920 pixels, or 97.

The PFOV configuration allows us to test scenarios in which a participant may leave the camera field of view and still be heard. Notice that since ground-truth annotations are available for the full field of view, it is possible to assess the performance of the tracker using audio observations only, as well as to analyse the behavior of the tracker when it switches from audio-only tracking to audio-visual tracking.

Vii-E Evaluation Metrics

We used standard multi-object tracking (MOT) metrics to quantitatively evaluate the performance of the proposed tracking algorithm. The multi-object tracking accuracy (MOTA) is the most commonly used metrics for MOT. It is a combination of false positives (FP), false negatives (FN; aka missed track), and identity switches (IDs), and is defined as:


where GT stands for the ground-truth person trajectories, as annotated in the AVDIAR dataset. After comparison with GT trajectories, each estimated trajectory can be classified as mostly tracked (MT), partially tracked (PT) and mostly lost (ML). If a trajectory is covered by a correct estimation at least

of the time, it is considered as MT. Similarly, it is considered as ML if it is covered less than . In our experiments, MT and ML scores represent the percentage of trajectories which are considered as mostly tracked and mostly lost respectively. In addition, the number of track fragmentations (FM) counts how many times the estimated trajectories are discontinuous (whereas the corresponding GT trajectories are continuous).

In our experiments, the threshold of overlap to consider that a ground truth is covered by an estimation is set to 0.1. In the PFOV configuration, we need to evaluate the audio-only tracking, i.e. the speakers are in the blind areas. As mentioned before, audio localization is less accurate than visual localization. Therefore, for evaluating the audio-only tracker we relax by a factor of two the expected localization accuracy with respect to the audio-visual localization accuracy.

Vii-F Benchmarking with Baseline Methods

Method MOTA() FP() FN() IDs() FM() MT() ML()
AS-VA-PF [4] 10.37 44.64 % 43.95% 732 918 20% 7.5 %
AV-MSSMC-PHD [7] 18.96 8.13 % 72.09% 581 486 17.5% 52.5%
OBVT [8] 96.32 1.77% 1.79% 80 131 92.5% 0%
VAVIT (proposed) 96.03 1.85% 2.0% 86 152 92.5% 0%
TABLE I: MOT scores for the living-room sequences (full camera field of view)
Method MOTA() FP() FN() IDs() FM() MT() ML()
AS-VA-PF [4] 62.43 18.63% 17.19% 297 212 70.59 % 0%
AV-MSSMC-PHD [7] 28.48 0.93% 69.68% 155 60 0 % 52.94%
OBVT [8] 98.50 0.25% 1.11% 25 10 100.00% 0%
VAVIT (proposed) 98.16 0.38% 1.27% 32 15 100.00% 0%
TABLE II: MOT scores for the meeting-room sequences (full camera field of view)
Method MOTA() FP() FN() IDs() FM() MT() ML()
AS-VA-PF [4] 17.82 36.86% 42.88% 1722 547 32.50% 7.5%
AV-MSSMC-PHD [7] 20.61 5.54% 72.45% 989 471 12.5% 40%
OBVT [8] 66.39 0.48% 32.95% 129 203 45% 7.5%
VAVIT (proposed) 69.62 8.97% 21.18% 152 195 70% 5%
TABLE III: MOT scores for the living-room sequences (partial camera field of view)
Method MOTA() FP() FN() IDs() FM() MT() ML()
AS-VA-PF [4] 29.04 23.05% 45.19 % 461 246 29.41% 17.65%
AV-MSSMC-PHD [7] 26.95 1.05% 70.62% 234 64 5.88% 52.94%
OBVT [8] 64.24 0.43% 35.18% 24 25 36.84% 15.79%
VAVIT (proposed) 65.27 5.07% 29.5% 26 26 47.37% 10.53%
TABLE IV: MOT scores for the meeting-room sequences (partial camera field of view)
Fig. 1: Four frames sampled from Seq13-4P-S2M1. First row: green digits denote speakers while red digits denote silent participants. Second, third and fourth rows: visual, audio, and dynamic contours of constant densities (covariances), respectively, of each tracked person. The tracked persons are color-coded: green, yellow, blue, and red.
Fig. 2: Four frames sampled from Seq19-2P-S1M1. The camera field of view is limited to the central strip. Whenever the participants are outside the central strip, the tracker entirely relies on audio observations and on the model’s dynamics.

To quantitatively evaluate its performance, we benchmarked the proposed method with two state-of-the-art audio-visual tracking methods. The first one is the audio-assisted video adaptive particle filtering (AS-VA-PF) method of [4], and the second one is the sparse audio-visual mean-shift sequential Monte-Carlo probability hypothesis density (AV-MSSMC-PHD) method of [7]. [4] takes as input a video and a sequence of sound locations. Sound locations are used to reshape the typical Gaussian noise distribution of particles in a propagation step, then uses the particles to weight the observation model. [7] uses audio information to improve the performance and robustness of a visual SMC-PHD filter. Both methods show good performance in meeting configurations, e.g. the AV16.3 dataset [23]: the recordings used a circular microphone array placed on a table and located at the center of the room, as well as several cameras fixed on the ceiling. The scenarios associated with AV16.3 are somehow artificial in the sense that the participants speak simultaneously and continuously. This stays in contrast with the AVDIAR recordings where people take speech turns in informal conversations.

Since both [4] and [7] require input from a multiple sound-source localization (SSL) algorithm, the multi-speaker localization method proposed in [14] is used to provide input to [4] and [7].555The authors of [4] and [7] kindly provided their software packages. We also compare the proposed method with a visual multiple-person tracker, more specifically the online Bayesian variational tracker (OBVT) of [8], which is based on a similar variational inference as the one presented in this paper. In [8] visual observations were provided by color histograms. In our benchmark, for the sake of fairness, the proposed tracker and [8] share the same visual observations (Section VII-C).

The MOT scores obtained with these methods as well as the proposed method are reported in Table LABEL:tab::livingRoom_FULL, Table LABEL:tab::meeting-room_FULL, Table LABEL:tab::livingRoom_PART and Table LABEL:tab::meeting-room_PART. The symbols and indicate higher the better and lower the better, respectively. The tables report results obtained with the meeting-room and living-room sequences and for the two configurations mentioned above: full and partial camera fields of view, respectively. The most informative metric is MOTA (MOT accuracy) and one can easily see that both [8] and the proposed method outperform the other two methods. The poorer performance of both [4] and [7] for all the configurations is generally explained by the fact that these two methods assume that audio and visual observations are simultaneously available. In particular, [4] is not robust against visual occlusions, which leads to poor IDs (identity switches) scores.

The AV-MSSMC-PHD method [7] uses audio information in order to count the number of speakers. The algorithm detects multiple speakers whenever multiple audio sources are detected. In practice, the algorithm rarely finds multiple speakers and in most of the cases it only tracks one speaker. This explains why both FN (false negatives) and IDs (identity switches) scores are high, i.e. Tables LABEL:tab::livingRoom_FULL, LABEL:tab::meeting-room_FULL, and LABEL:tab::livingRoom_PART.

One can notice that in the case of FFOV, [8] and the proposed method yield similar results in terms of MOT scores: they both exhibit low FP, FN and IDs scores and, consequently, high MOTA scores. Moreover, they have very good MT, PT and ML scores (out of 40 sequences 37 are mostly tracked, 3 are partially tracked, and none is mostly lost). As expected, the inferred trajectories are more accurate for visual tracking (whenever visual observations are available) than for audio-visual tracking: indeed, the latter fuses visual and audio observations which slightly degrades the accuracy because audio localization is less accurate than visual localization.

As for the PFOV configuration (Table LABEL:tab::livingRoom_PART and Table LABEL:tab::meeting-room_PART), the proposed algorithm yields the best MOTA scores both for the meeting and for the living rooms. Both [4] and [7] have difficulties when visual information is not available, e.g. the left- and right-hand blind strips on both sides of the restricted field of view: both these algorithms fail to track speakers when they walk outside the visual field of view. While [7] is able to detect a speaker when it re-enters the visual field of view, [4] is not. Obviously, the tracking algorithm of [8] fails in the absence of visual observations.

Vii-G Audio-Visual Tracking Examples

We now provide and discuss results obtained with three recordings, one FFOV sequence, Seq13-4P-S2-M1 (Fig. 1) and two PFOV sequences, Seq19-2P-S1M1 (Fig. 2) and Seq22-1P-S0M1 (Fig. 3).666 These sequences are challenging in terms of audio-visual tracking: participants are seated, then they stand up or they wander around. Some participants take speech turns and interrupt each other, while other participants remain silent.

The first row of Fig. 1 shows four frames sampled from a video recording with two then four participants, labeled 1, 2, 3, and 4. Green digits designate participants detected as speakers and red digits correspond to participants detected as listeners. The second row shows ellipses of constant density (visual covariances), i.e. the inverse of the precision #1 in (26). Notice that in the second frame the detection of person 3, who turns his back to the camera, was missed. The third row shows the audio covariances, i.e. the inverse of the precision #2 in (26). The audio covariances are much larger than the visual ones since audio localization is less accurate than visual localization. There are two distinct audio sources close to each other that are correctly detected, localized and assigned to persons 1 and 4 and therefore it is still possible to assign audio activities to both 1 and 4. The fourth row shows the contribution of the dynamic model to the covariance, i.e. the inverse of the precision #3 in (26). Notice that these “dynamic” covariances are small, in comparison with the “observation” covariances, which reflects a smooth trajectory and ensures tracking continuity when audio or visual observations are either weak or totally absent. Fig. 2 shows a tracking example with a PFOV (partial camera field of view) configuration. In this case, audio and visual observations are barely available simultaneously. The independence of the visual and audio observation models and their fusion within the same dynamic model guarantees robust tracking results.

Fig. 3 shows the ground-truth trajectory of a person and the trajectories estimated with the audio-visual tracker [4], with the visual tracker [8], and with the proposed method. The ground-truth trajectory corresponds to a sequence of bounding-box centers. Both [4] and [8] failed to estimate a correct trajectory. Indeed, [4] requires simultaneous availability of audio-visual data while [8] cannot track outside the visual field of view. Notice the dangled trajectory obtained with [4] in comparison with the smooth trajectories obtained with variational inference, i.e. [8] and proposed.

Vii-H Speaker Diarization Results

As already mentioned in Section VI-C, speaker diarization information can be extracted from the output of the proposed VAVIT algorithm. Notice that, while audio diarization is an extremely well investigated topic, audio-visual diarization has received much less attention. In [31] it is proposed an audio-visual diarization method based on a dynamic Bayesian network that is applied to video conferencing. The method assumes that participants take speech turns, which is an unrealistic hypothesis in the general case. The diarization method of [34] requires audio, depth and RGB data. More recently, [17] proposed a Bayesian dynamic model for audio-visual diarization that takes as input fused audio-visual information. Since diarization is not the main objective of this paper, we only compared our diarization results with [17], which achieves state of the art results, and with the diarization toolkit of [18] which only considers audio information.

Ground-truth trajectory AS-VA-PF [4]
OBVT [8] VAVIT (proposed)
Fig. 3: Trajectories associated with a tracked person under the PFOV configuration. The ground-truth trajectory corresponds to the center of the bounding-box of the head. The trajectory of [4] dangles. Both [4] and [8] fail to track outside the camera field of view. In the case of OBVT, there is an identity switch, from “red” (before the person leaves the visual field of view) to “blue” (after the person re-enters in the visual field of view).

The diarization error rate (DER) is generally used as a quantitative measure. As for MOT, DER combines FP, FN and IDs scores. The NIST-RT evaluation toolbox777 is used. The results obtained with these two methods and with ours are reported in Table LABEL:tab:DIAR-FULL, with both the full field-of-view and partial field-of-view configurations (FFOV and PFOV). The proposed method performs better than the audio-only baseline method [18]. In comparison with [17], the proposed method performs slightly less well despite the lack of a diarization dynamic model. Indeed, [17] estimates diarization within a temporal model that takes into account both diarization dynamics and audio activity at each time step, whereas our method is only based on audio activity at each time step.

The ability of the proposed audio-visual tracker to perform diarization is illustrated with the FFOV sequence Seq13-4P-S2-M1 (Fig. 1) and with the PFOV sequence Seq19-2P-S1M1 (Fig. 2), e.g. Fig. 4 and Fig. 5, respectively.

Sequence DiarTK [18] [17] Proposed (FFOV) Proposed (PFOV)
Seq01-1P-S0M1 43.19 3.32 1.64 1.86
Seq02-1P-S0M1 49.9 - 2