I Introduction
In this paper we address the problem of tracking multiple speakers via the fusion of visual and auditory information [1, 2, 3, 4, 5, 6, 7]. We propose to exploit the complementary nature of these two modalities in order to accurately estimate the position of each person at each time step, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status, either speaking or silent, of each tracked person. We propose to cast the problem at hand into a generative audiovisual fusion (or association) model formulated as a latentvariable temporal graphical model. We propose a tractable solver via a variational approximation.
We are particularly interested in tracking people involved in informal meetings and social gatherings. In this type of scenarios, participants wander around, cross each other, move in and out the camera field of view, take speech turns, etc. Acoustic room conditions, e.g. reverberation, and overlapping audio sources of various kinds drastically deteriorate or modify the microphone signals. Likewise, occluded persons, lighting conditions and middlerange cameraviewing complicate the task of visual processing. It is therefore impossible to gather reliable and continuous flows of visual and audio observations. Hence one must design a fusion and tracking method that is able to deal with intermittent visual and audio data.
We propose a multispeaker tracking method based on a dynamic Bayesian model that fuses audio and visual information over time from their respective observations spaces. This may well be viewed as a generalization of singleobservation and singletarget Kalman filtering – which yields an exact recursive solution – to multipleobservations and targets, which makes the recursive solution intractable. We propose a variational approximation of the posterior distribution over the continuous variables (positions and velocities of tracked persons) and discrete variables (observationtoperson associations) at each time step, given all the past and present audio and visual observations. The approximation of this joint distribution with a factorized distribution makes the tracking problem tractable: the solution takes the form of a closedform expectation maximization (EM) procedure.
In general, multiple object tracking consists of the temporal estimation of the kinematic state of each object, i.e. position and velocity. In computer vision, local descriptors are used to better discriminate between objects, e.g. person detectors/descriptors based on handcrafted features
[8]or on deep neural networks
[9]. If the tracked objects emit sounds, their states can be inferred as well using soundsource localization techniques combined with tracking. These techniques are often based on the estimation of the sound’s direction of arrival (DOA) using a microphone array, e.g. [10]. DOA estimation can be carried out either in the temporal domain [11], or in the spectral (Fourier) domain [12]. However, spectraldomain DOA estimation methods are more robust than temporaldomain methods, in particular in the presence of background noise and reverberation [13, 14].Via proper cameramicrophone calibration, audio and visual observations can be aligned such that a DOA corresponds to a 2D location in the image plane. In this paper we adopt the audiovisual alignment method of [15]
which learns a mapping, from a vector space spanned by multichannel spectral features (or audio features, in short) to the image plane, as well as the inverse of this mapping. This allows us to exploit the richness of representing acoustic signals in the shorttime Fourier domain
[16] and to extract noise and reverberationfree audio features [13].We propose to represent the audiovisual fusion problem via two sets of independent variables, i.e. visualfeaturetoperson and audiofeaturetoperson sets of assignment variables. An interesting characteristic of this way of doing is that the proposed tracking algorithm can indifferently use visual features, audio features, or a combination of both, and choose independently for every target and at every time step. Indeed, audio and visual information are rarely available simultaneously and continuously. Visual information suffers from limited camera fieldofview, occlusions, false positives, missed detections, etc. Audio information is often corrupted by room acoustics, environmental noise and overlapping acoustic signals. In particular speech signals are sparse, nonstationary and are emitted intermittently, with silence intervals between speech utterances. Hence a robust audiovisual tracking must explicitly take into account the temporal sparsity of the two modalities and this is exactly what is proposed in this paper.
We use the AVDIAR dataset [17] to evaluate the performance of the proposed audiovisual tracker. We use the MOT (multiple object tracking) metrics to quantitatively assess method performance. In particular the tracking accuracy (MOTA), which combines false positives, false negatives, identity switches and compares them with the groundtruth trajectories, is a commonly used score to assess the quality of a multiple person tracker.^{1}^{1}1https://motchallenge.net/ We use the MOT metrics to compare our method with two recently proposed audiovisual tracking methods [4, 7] and with a visual tracker [8]. An interesting outcome of the proposed method is that speaker diarization, i.e. who speaks when, can be coarsely inferred from the tracking output, thanks to the audiofeaturetoperson assignment variables. The speaker diarization results obtained with our method are compared with two other methods [18, 17] based on the diarization error rate (DER) score.
The remainder of the paper is organized as follows. Section II describes the related work. Section III describes in detail the proposed formulation. Section IV describes the proposed variational approximation and Section V details the variational expectationmaximization procedure. The algorithm implementation is described in Section VI. Tracking results and comparisons with other methods are reported in Section VII. Finally, Section VIII draws a few conclusions. Supplemental materials are available on our website.^{2}^{2}2https://team.inria.fr/perception/research/variational_av_tracking/
Ii Related Work
In computer vision, there is a long history of multiple object tracking methods. While these methods provide interesting insights concerning the problem at hand, a detailed account of existing visual trackers is beyond the scope of this paper. Several audiovisual tracking methods were proposed in the recent past, e.g. [19, 1, 2, 3]
. These papers proposed to use approximate inference of the filtering distribution using Markov chain Monte Carlo particle filter sampling (MCMCPF). These methods cannot provide estimates of the accuracy and merit of each modality with respect to each tracked person. Sampling and distribution estimation are performed in parameter space but no statistics are gathered in the observations spaces.
More recently, audiovisual trackers based on particle filtering (PF) and probability hypothesis density (PHD) filters were proposed, e.g.
[4, 5, 6, 7, 20, 21, 22]. In [6] DOAs of audio sources to guide the propagation of particles and combined the filter with a meanshift algorithm to reduce the computational complexity. Some PHD filter variants were proposed to improve tracking performance [20, 21]. The method of [4] also used DOAs of active audio sources to give more importance to particles located around DOAs. Along the same line of thought, [7] proposed a meanshift sequential Monte Carlo PHD (SMCPHD) algorithm that used audio information to improve the performance of a visual tracker. This implies that the persons being tracked must emit acoustic signals continuously and that multiplesource audio localization is reliable enough for proper audiovisual alignment.PF and PHDbased tracking methods are computationally efficient but their inherent limitation is that they are unable to associate observations to tracks. Hence they require an external postprocessing mechanism that provides associations. Also, in the case of PFbased filtering, the number of tracked persons must be set in advance. Moreover, both PF and PHDbased trackers provide nonsmooth trajectories since the state dynamics are not explicitly enforced. In contrast, the proposed variational formulation embeds association variables within the model, uses a birth process to estimate the initial number of persons and to add new ones along time, and an explicit dynamic model yields smooth trajectories.
Another limitation of the methods proposed in [1, 3, 6, 20, 21, 22] is that they need as input a continuous flow of audio and visual observations. To some extent, this is also the case with [4, 7], where only the audio observations are supposed to be continuous. All these methods showed good performance in the case of the AV16.3 dataset [23] in which the participants spoke simultaneously and continuously – which is somehow artificial. The AV16.3 dataset was recorded in a specially equipped meeting room using a large number of cameras to guarantee that frontal views of the participants were always available. This contrasts with the AVDIAR dataset which was recorded with one sensor unit composed of two cameras and six microphones. The AVDIAR scenarios are composed of participants that take speech turns while they look at each other, hence they speak intermittently and they do not always face the cameras.
Recently, we proposed an audiovisual clustering method [24] and an audiovisual speaker diarization method [17]. The weighteddata clustering method of [24] analyzed a short time window composed of several audio and visual frames and hence it was assumed that the speakers were static within such temporal windows. Binaural audio features were mapped onto the image plane and were clustered with nearby visual features. There was no dynamic model that allowed to track speakers. The audiovisual diarization method [17] used an external multiobject visual tracker that provided trajectories for each tracked person. The audiofeaturespace to imageplane mapping [15] was used to assign audio information to each tracked person at each time step. Diarization itself was modeled with a binary state variable (speaking or silent) associated with each person. The diarization transition probabilities (state dynamics) were hand crafted, with the assumption that the speaking status of a person was independent of all the other persons. Because of the small number of state configurations, i.e. (where is the maximum number of tracked persons), the MAP solution could be found by exhaustively searching the state space. In Section VIIH we use the AVDIAR recordings to compare our diarization results with the results obtained with [17].
The variational inference method proposed in this paper may well be viewed as a multimodal generalization of [8]. We show that the model of [8] can be extended to deal with observations living in completely different mathematical spaces. Indeed, we show that two (or several) different dataprocessing pipelines can be embedded and treated on an equal footing in the proposed formulation. Special attention is given to audiovisual alignment and to audiotoperson assignments: (i) we learn a mapping from the space of audio features to the image plane, as well as the inverse of this mapping, which are integrated in the proposed generative approach, and (ii) we show that the additional assignment variables due to the audio modality do not affect the complexity of the algorithm. Absence of observed data of any kind or erroneous data are carefully modeled: this enables the algorithm to deal with intermittent observations, whether audio, visual, or both. This is probably one of the most prominent features of the method, in contrast with most existing audiovisual tracking methods which require continuous and simultaneous flows of visual and audio data.
This paper is an extended version of [25] and of [26]. The probabilistic model and its variational approximation were briefly presented in [25] together with preliminary results obtained with three AVDIAR sequences. Reverberationfree audio features were used in [26] where it was shown that good performance could be obtained with these features when the audio mapping was trained in one room and tested in another room. With respect to these two papers. we provide detailed descriptions of the proposed formulation, of the variational expectation maximization solver and of the implemented algorithm. We explain in detail the birth process, which is crucial for track initialization and for detecting potentially new tracks at each time step. We experiment with the entire AVDIAR dataset and we benchmark our method with the stateoftheart multiplespeaker audiovisual tracking methods [4, 7] and with [8]. Moreover, we show that our tracker can be used for speaker diarization.
Iii Proposed Model
Iiia Mathematical Definitions and Notations
Unless otherwise specified, uppercase letters denote random variables while lowercase letters denote their realizations, e.g.
, wheredenotes either a probability density function (pdf) or a probability mass function (pmf). For the sake of conciseness we generally write
. Vectors are written in slanted bold, e.g. , whereas matrices are written in bold, e.g. . Video and audio data are assumed to be synchronized, and let denote the common frame index. Let be the upper bound of the number of persons that can simultaneously be tracked at any time , and let be the person index. Let denote nobody. A subscript denotes variable concatenation at time , e.g. , and the subscript denotes concatenation from 1 to , e.g. .Let , and be three latent variables that correspond to the 2D position, 2D velocity and 2D size (width and height) of person at . Typically, and correspond to the center and size of a bounding box of a person while is the velocity of . Let be the complete set of continuous latent variables at , where denotes the transpose operator. Without loss of generality, in this paper a person is characterized with the bounding box of her/his head and the center of this bounding box is assumed to be the location of the corresponding speech source.
We now define the observations. Let and be realizations of the visual and audio random observed variables and , respectively. A visual observation, , corresponds to the bounding box of a detected face and it is the concatenation of the boundingbox center, width and height, , and of a feature vector that describes the photometric content of that bounding box, i.e. a dimensional face descriptor (Section VIIC). An audio observation, , corresponds to an intermicrophone spectral feature, where is a frequency subband index. Let’s assume that there are subbands, that subbands are active at , i.e. with sufficient energy, and that there are frequencies per subband. Hence, corresponds to complexvalued Fourier coefficients which are represented by their real and imaginary parts. In practice, the intermicrophone features contain audiosource localization information and are obtained by applying the multichannel audio processing method described in detail below (Section VIIB). Note that both the number of visual and of audio observations at , and , vary over time. Let denote the set of observations from 1 to , where .
We now define the assignment variables of the proposed latent variable model. There is an assignment variable (a discrete random variable) associated with each observed variable. Namely, let
and be associated with and with , respectively, e.g. denotes the probability of assigning visual observation at to person . Note that and are the probabilities of assigning visual observation and audio observation to none of the persons, or to nobody. In the visual domain, this may correspond to a false detection while in the audio domain this may correspond to an audio signal that is not uttered by a person. There is an additional assignment variable, that is associated with the audio generative model described in Section IIID. The assignment variables are jointly denoted with .IiiB The Filtering Distribution
We remind that the objective is to estimate the positions and velocities of participants (multiple person tracking) and, possibly, to estimate their speaking status (speaker diarization). The audiovisual multipleperson tracking problem is cast into the problems of estimating the filtering distribution and of inferring the state variable . Subsequently, speaker diarization can be obtained from audiofeaturetoperson information via the estimation of the assignment variables (Section VIC).
We reasonably assume that the state variable
follows a firstorder Markov model, and that the visual and audio observations only depend on
and . By applying Bayes rule, one can then write the filtering distribution of as:(1) 
with:
(2)  
(3)  
(4) 
Eq. (2) is the joint (audiovisual) observeddata likelihood. Visual and audio observations are assumed independent conditionally to , and their distributions will be detailed in Sections IIIC and IIID, respectively.^{3}^{3}3We will see that depends on but depends neither on nor on , and depends on and but not on . Eq. (3) is the prior distribution of the assignment variable. The observationtoperson assignments are assumed to be a priori independent so that the probabilities in (3) factorize as:
(5)  
(6)  
(7) 
It makes sense to assume that these distributions do not depend on and that they are uniform. The following notations are introduced: and . The probability is discussed below (Section IIID).
Eq. (4) is the predictive distribution of given the past observations, i.e. from 1 to . The state dynamics in (4) is modeled with a linearGaussian firstorder Markov process. Moreover, it is assumed that the dynamics are independent over speakers:
(8) 
where is the dynamics’ covariance matrix and is the state transition matrix, given by:
As described in Section IV below, an important feature of the proposed model is that the predictive distribution (4) at frame is computed from the state dynamics model (8) and an approximation of the filtering distribution at frame , which also factorizes across speaker. As a result, the computation of (4) factorizes across speakers as well.
IiiC The Visual Observation Model
As already mentioned above (Section IIIA), a visual observation consists of the center, width and height of a bounding box, namely , as well as of a feature vector describing the region inside the bounding box. Since the velocity is not observed, a projection matrix is used to project onto . Assuming that the visual observations available at are independent, and that the appearance of a person is independent of his/her position in the image, the visual likelihood in (2) is defined as:
(9) 
where the observed boundingbox centers, widths, heights, and feature vectors are drawn from the following distributions:
(10)  
(11) 
where is a covariance matrix quantifying the measurement error in the boundingbox center and size,
is the uniform distribution with
being the support volume of the variable space, is the Bhattacharya distribution with parameter , and is a set of prototype feature vectors that model the appearances of the persons.IiiD The Audio Observation Model
It is well established in the multichannel audio signal processing literature that intermicrophone spectral features encode soundsource localization information [15, 12, 13]. Therefore, observed audio features, are obtained by considering all pairs of a microphone array. Audio observations depend neither on (size of the bounding box) nor on (velocity). Hence one can replace with in the equations below, with . By assuming independence across frequency subbands (indexed by ), the audio likelihood in (2) can be factorized as:
(12) 
While the intermicrophone spectral coefficients
contain localization information, in complex acoustic environments there is no explicit function that maps source locations onto intermicrophone spectral features. Moreover, this mapping is nonlinear. We therefore make recourse to modeling this relationship via learning a regression function. We propose to use the piecewiselinear regression
[27] which belongs to the mixture of experts (MOE) class of models. For that purpose we consider a training set of audio features and their associated source locations, and let . The joint probability of writes:(13) 
Assuming Gaussian variables, we have , , and , where matrix and vector characterize the th affine transformation that maps the space of source locations onto the space spanned by intermicrophone subband spectral features, is the associated covariance matrix, and
is drawn from a Gaussian mixture model with
components, each component being characterized by , and . The parameter set of this model is:(14) 
These parameters can be estimated via a closedform EM procedure from a training dataset, e.g. (please consult [27, 15] and Section VIIB below for more details). One should notice that there is a parameter set for each subband , , hence there are models that need be trained in our case. It follows that (12) writes:
(15)  
The righthand side of (7) can now be written as:
(16) 
Iv Variational Approximation
Direct estimation of the filtering distribution is intractable. In particular, the integral (4) does not have an analytic solution. Consequently, evaluating expectations over this distribution is intractable as well. We overcome this problem via variational inference and associated EM closedform solver [28, 29]. More precisely is approximated with the following factorized form:
(17) 
which implies
(18) 
where and
are the variational posterior probabilities of assigning visual observation
to person and audio observation to person , respectively. The proposed variational approximation (17) amounts to break the conditional dependence of and with respect to which causes the computational intractability. Note that the visual, , and audio, , , assignment variables are independent, that the assignment variables for each observation are also independent, and that and are conditionally dependent on the audio observation. This factorized approximation makes the calculation of tractable. The optimal solution is given by an instance of the variational expectation maximization (VEM) algorithm [28, 29], which alternates between two steps:
Mstep: model parameters are estimated by maximizing the variational expected completedata loglikelihood.
In the case of the proposed model the latent variable logposteriors write:
(19)  
(20)  
(21) 
A remarkable consequence of the factorization (17) is that is replaced with , consequently (4) becomes:
(22) 
It is now assumed that the variational posterior distribution is Gaussian with mean and covariance :
(23) 
By substituting (23) into (22) and combining it with (8), the predictive distribution (22) becomes:
(24) 
Note that the above distribution factorizes across persons. Now that all the factors in (1) have tractable expressions, A VEM algorithm can be applied.
V Variational Expectation Maximization
The proposed VEM algorithm iterates between an ESstep, an EZstep, and an Mstep on the following grounds.
V1 ESstep
the perperson variational posterior distribution of the state vector is evaluated by developing (19). The completedata likelihood in (19) is the product of (2), (3) and (24). We thus first sum the logarithms of (2), of (3) and of (24). Then we ignore the terms that do not involve . Evaluation of the expectation over all the latent variables except
yields the following Gaussian distribution:
(25) 
with:
(26)  
(27)  
where and are computed in the EZstep below. A key point is that, because of the recursive nature of the formulas above, it is sufficient to make the Gaussian assumption at , i.e. , whose parameters may be easily initialized. It follows that is Gaussian at each frame.
We note that both (26) and (27) are composed of three terms: the first term (#1), second second term (#2) and third term (#3) of (26) correspond to the visual, audio, and model dynamics contributions to the precision, respectively. Remind that covariance is associated with the visual observed variable in (10). Matrices and vectors characterize the piecewise affine mappings from the space of person locations to the space of audio features, and covariances capture the errors that are associated with both audio measurements and the piecewise affine approximation in (15). A similar interpretation holds for the three terms of (27).
V2 EZstep
by developing (20), and following the same reasoning as above, we obtain the following closedform expression for the variational posterior distribution of the visual assignment variable:
(28) 
where is given by:
Similarly, for the variational posterior distribution of the audio assignment variables, developing (21) leads to:
(29) 
where is given by:
(30)  
To obtain (30), an additional approximation is made. Indeed, the logarithm of (16) is part of the completedata loglikelihood and the denominator of (16) contains a weighted sum of Gaussian distributions. Taking the expectation of this term is not tractable because of the denominator. Based on the dynamical model (8), we replace the state variable in (16) with a “naive” estimate predicted from the position and velocity inferred at : .
V3 Mstep
The entries of covariance matrix of the state dynamics, , are the only parameters that need be estimated. To this aim, we develop and ignore the terms that do not depend on . We obtain:
which can be further developed as:
(31) 
Hence, by differentiating (V3) with respect to and equating to zero, we obtain:
(32) 
Vi Algorithm Implementation
The VEM procedure above will be referred to as VAVIT which stands for variational audiovisual tracking, and pseudocode is shown in Algorithm 1. In theory, the order in which the two expectation steps are executed is not important. In practice, the issue of initialization is crucial. In our case, it is more convenient to start with the EZ step rather than with the ES step because the former is easier to initialize than the latter (see below). We start by explaining how the algorithm is initialized at and then how the EZstep is initialized at each iteration. Next, we explain in detail the birth process. An interesting feature of the proposed method is that it allows to estimate who speaks when, or speaker diarization, which is then explained in detail.
Via Initialization
At one must provide initial values for the parameters of the distributions (25), namely and for all . These parameters are initialized as follows. The means are initialized at the image center and the covariances are given very large values, such that the variational distributions are noninformative. Once these parameters are initialized, they remain constant for a few frames, i.e. until the birth process is activated (see Section VIB below).
As already mentioned, it is preferable to start with the EZstep than with the ESstep because the initialization of the former is straightforward. Indeed, the ESstep (Section V) requires current values for the posterior probabilities (28) and (30) which are estimated during the EZstep and which are both difficult to initialize. Conversely, the EZstep only requires current mean values, , which can be easily initialized by using the model dynamics (8), namely .
ViB Birth Process
We now explain in detail the birth process, which is executed at the start of the tracking to initialize a latent variable for each detected person, as well as at any time to detect new persons. The birth process considers consecutive visual frames. At , with , we consider the set visual observations assigned to from to , namely observations whose posteriors (28) are maximized for (at initialization all the observations are in this case). We then build observation sequences from this set, namely sequences of the form , where indexes the set of observations at assigned to and indexes the set of all such sequences. Notice that the birth process only uses the boundingbox center, width and size, , and that the descriptor is not used. Hence the birth process is only based on the smoothness of an observed sequence of bounding boxes. Let’s consider the marginal likelihood of a sequence , namely:
(33)  
where is the latent variable already defined and indexes the set
. All the probability distributions in (
33) were already defined, namely (8) and (10), with the exception of. Without loss of generality, we can assume that the latter is a normal distribution centered at
and with a large covariance. Therefore, the evaluation of (33) yields a closedform expression for . A sequence generated by a person is likely to be smooth and hence is high, while for a nonsmooth sequence the marginal likelihood is low. A newborn person is therefore created from a sequence of observations if , where is a userdefined parameter. As just mentioned, the birth process is executed to initialize persons as well as along time to add new persons. In practice, in (33) we set B=3 and hence, from t=1 to t=4 all the observations are initially assigned to .ViC Speaker Diarization
Speaker diarization consists of assigning temporal segment of speech to persons [30]
. We introduce a binary variable
such that if person speaks at time and otherwise. Traditionally, speaker diarization is based on the following assumptions. First, it is assumed that speech signals are sparse in the timefrequency domain. Second, it is assumed that each timefrequency point in such a spectrogram corresponds to a single speech source. Therefore, the proposed speaker diarization method is based on assigning timefrequency points to persons.In the case of the proposed model, speaker diarization can be coarsely inferred from frequency subbands in the following way. The posterior probability that the speech signal available in the frequency subband at frame was uttered by person , given the audio observation , is:
(34) 
where is the audio assignment variable and is the affinemapping assignment variable defined in Section IIID. Using the variational approximation (29), this probability becomes:
(35) 
and by accumulating probabilities over all the frequency subbands, we obtain the following:
(36) 
where is a userdefined threshold. Note that there is no dynamic model associated with diarization: is estimated independently at each frame and for each person. More sophisticated diarization models can be found in [31, 17].
Vii Experiments
Viia Dataset
We use the AVDIAR dataset [17] to evaluate the performance of the proposed audiovisual tracking method. This dataset is challenging in terms of audiovisual analysis. There are several participants involved in informal conversations while wandering around. They are in between two and four meters away from the audiovisual recording device. They take speech turns and often there are speech overlaps. They turn their faces away from the camera. The dataset is annotated as follows:^{4}^{4}4Please consult https://team.inria.fr/perception/avdiar/ for a detailed description of the dataset. The visual annotations comprise the centers, widths and heights of two bounding boxes for each person and in each video frame, a face bounding box and an upperbody bounding box. An identity (a number) is associated with each person through the entire dataset. The audio annotations comprise the speech status of each person over time (speaking or silent), with a minimum speech duration of 0.2 seconds. The audio source locations correspond to the centers of the face bounding boxes.
The dataset was recorded with a sensor composed of two cameras and six microphones, but only one camera is used in the experiments described below. The videos were recorded at 25 FPS. The frame resolution is of pixels corresponding to a field of view of . The microphone signals are sampled at 16000 Hz. The dataset was recorded into two different rooms, livingroom and meetingroom, e.g. Fig. 1 and Fig. 2. These two rooms have quite different lighting conditions and acoustic properties (size, presence of furniture, background noise, etc.). Altogether there are 18 sequences associated with livingroom (26928 video frames) and 6 sequences with meetingroom (6031 video frames). Additionally, there are two training datasets, and (one for each room) that contain inputoutput pairs of multichannel audio features and audiosource locations that allow to estimate the parameters (14) using the method of [15]. This yields a mapping between source locations in the image plane, , and audio features,
. Audio feature extraction is described in detail below.
ViiB Audio Features
The STFT (shorttime Fourier transform)
[16] is applied to each microphone signal using a 16 ms Hann window (256 audio samples per window) and with an 8 ms shift (50% overlap), leading to 128 frequency bins and to 125 audio FPS. Interchannel features are then computed using [14]. These features – referred to as directpath relative transfer function (DPRTF) features – are robust both against background noise and reverberation, hence they do not depend on the room acoustic properties as they encode the direct path from the audio source to the microphones. The audio features are averaged over five audio frames in order to properly align them with the video frames. The feature vector is then split into subbands, each subband being composed of frequencies; subbands with low energy are disregarded. This yields the set of audio observations at , , (see Section IIID).ViiC Visual processing
Because in AVDIAR people do not necessarily face the camera, face detection is not very robust. Instead we use a bodypose detector
[32] from which we infer a fullbody boundingbox and a head boundingbox. We use the person reidentification CNNbased method [33] to extract en embedding from the fullbody boundingbox. This yields the features vectors (Section IIIC). Similarly, the center, width and height of the head boundingbox yield the observations at each frame .ViiD Experimental Settings
One interesting feature of the proposed tracking is its flexibility in dealing with visual data, audio data or visual and audio data. Moreover, the algorithm is able to automatically switch from unimodal to multimodal. In order to quantitatively assess the performance and merits of each one of these variants we used two configurations:

Full camera field of view (FFOV): The entire horizontal field of view of the camera, i.e. 1920 pixels, or 97, is being used, such that visual and audio observations, if any, are simultaneously available, and

Partial camera field of view (PFOV): The horizontal field of view is restricted to 768 pixels (or 49) and there are two blind strips (576 pixels each) on its left and righthand sides; the audio field of view remains unchanged, 1920 pixels, or 97.
The PFOV configuration allows us to test scenarios in which a participant may leave the camera field of view and still be heard. Notice that since groundtruth annotations are available for the full field of view, it is possible to assess the performance of the tracker using audio observations only, as well as to analyse the behavior of the tracker when it switches from audioonly tracking to audiovisual tracking.
ViiE Evaluation Metrics
We used standard multiobject tracking (MOT) metrics to quantitatively evaluate the performance of the proposed tracking algorithm. The multiobject tracking accuracy (MOTA) is the most commonly used metrics for MOT. It is a combination of false positives (FP), false negatives (FN; aka missed track), and identity switches (IDs), and is defined as:
(37) 
where GT stands for the groundtruth person trajectories, as annotated in the AVDIAR dataset. After comparison with GT trajectories, each estimated trajectory can be classified as mostly tracked (MT), partially tracked (PT) and mostly lost (ML). If a trajectory is covered by a correct estimation at least
of the time, it is considered as MT. Similarly, it is considered as ML if it is covered less than . In our experiments, MT and ML scores represent the percentage of trajectories which are considered as mostly tracked and mostly lost respectively. In addition, the number of track fragmentations (FM) counts how many times the estimated trajectories are discontinuous (whereas the corresponding GT trajectories are continuous).In our experiments, the threshold of overlap to consider that a ground truth is covered by an estimation is set to 0.1. In the PFOV configuration, we need to evaluate the audioonly tracking, i.e. the speakers are in the blind areas. As mentioned before, audio localization is less accurate than visual localization. Therefore, for evaluating the audioonly tracker we relax by a factor of two the expected localization accuracy with respect to the audiovisual localization accuracy.
ViiF Benchmarking with Baseline Methods
Method  MOTA()  FP()  FN()  IDs()  FM()  MT()  ML() 

ASVAPF [4]  10.37  44.64 %  43.95%  732  918  20%  7.5 % 
AVMSSMCPHD [7]  18.96  8.13 %  72.09%  581  486  17.5%  52.5% 
OBVT [8]  96.32  1.77%  1.79%  80  131  92.5%  0% 
VAVIT (proposed)  96.03  1.85%  2.0%  86  152  92.5%  0% 
Method  MOTA()  FP()  FN()  IDs()  FM()  MT()  ML() 

ASVAPF [4]  62.43  18.63%  17.19%  297  212  70.59 %  0% 
AVMSSMCPHD [7]  28.48  0.93%  69.68%  155  60  0 %  52.94% 
OBVT [8]  98.50  0.25%  1.11%  25  10  100.00%  0% 
VAVIT (proposed)  98.16  0.38%  1.27%  32  15  100.00%  0% 
Method  MOTA()  FP()  FN()  IDs()  FM()  MT()  ML() 

ASVAPF [4]  17.82  36.86%  42.88%  1722  547  32.50%  7.5% 
AVMSSMCPHD [7]  20.61  5.54%  72.45%  989  471  12.5%  40% 
OBVT [8]  66.39  0.48%  32.95%  129  203  45%  7.5% 
VAVIT (proposed)  69.62  8.97%  21.18%  152  195  70%  5% 
Method  MOTA()  FP()  FN()  IDs()  FM()  MT()  ML() 

ASVAPF [4]  29.04  23.05%  45.19 %  461  246  29.41%  17.65% 
AVMSSMCPHD [7]  26.95  1.05%  70.62%  234  64  5.88%  52.94% 
OBVT [8]  64.24  0.43%  35.18%  24  25  36.84%  15.79% 
VAVIT (proposed)  65.27  5.07%  29.5%  26  26  47.37%  10.53% 
To quantitatively evaluate its performance, we benchmarked the proposed method with two stateoftheart audiovisual tracking methods. The first one is the audioassisted video adaptive particle filtering (ASVAPF) method of [4], and the second one is the sparse audiovisual meanshift sequential MonteCarlo probability hypothesis density (AVMSSMCPHD) method of [7]. [4] takes as input a video and a sequence of sound locations. Sound locations are used to reshape the typical Gaussian noise distribution of particles in a propagation step, then uses the particles to weight the observation model. [7] uses audio information to improve the performance and robustness of a visual SMCPHD filter. Both methods show good performance in meeting configurations, e.g. the AV16.3 dataset [23]: the recordings used a circular microphone array placed on a table and located at the center of the room, as well as several cameras fixed on the ceiling. The scenarios associated with AV16.3 are somehow artificial in the sense that the participants speak simultaneously and continuously. This stays in contrast with the AVDIAR recordings where people take speech turns in informal conversations.
Since both [4] and [7] require input from a multiple soundsource localization (SSL) algorithm, the multispeaker localization method proposed in [14] is used to provide input to [4] and [7].^{5}^{5}5The authors of [4] and [7] kindly provided their software packages. We also compare the proposed method with a visual multipleperson tracker, more specifically the online Bayesian variational tracker (OBVT) of [8], which is based on a similar variational inference as the one presented in this paper. In [8] visual observations were provided by color histograms. In our benchmark, for the sake of fairness, the proposed tracker and [8] share the same visual observations (Section VIIC).
The MOT scores obtained with these methods as well as the proposed method are reported in Table LABEL:tab::livingRoom_FULL, Table LABEL:tab::meetingroom_FULL, Table LABEL:tab::livingRoom_PART and Table LABEL:tab::meetingroom_PART. The symbols and indicate higher the better and lower the better, respectively. The tables report results obtained with the meetingroom and livingroom sequences and for the two configurations mentioned above: full and partial camera fields of view, respectively. The most informative metric is MOTA (MOT accuracy) and one can easily see that both [8] and the proposed method outperform the other two methods. The poorer performance of both [4] and [7] for all the configurations is generally explained by the fact that these two methods assume that audio and visual observations are simultaneously available. In particular, [4] is not robust against visual occlusions, which leads to poor IDs (identity switches) scores.
The AVMSSMCPHD method [7] uses audio information in order to count the number of speakers. The algorithm detects multiple speakers whenever multiple audio sources are detected. In practice, the algorithm rarely finds multiple speakers and in most of the cases it only tracks one speaker. This explains why both FN (false negatives) and IDs (identity switches) scores are high, i.e. Tables LABEL:tab::livingRoom_FULL, LABEL:tab::meetingroom_FULL, and LABEL:tab::livingRoom_PART.
One can notice that in the case of FFOV, [8] and the proposed method yield similar results in terms of MOT scores: they both exhibit low FP, FN and IDs scores and, consequently, high MOTA scores. Moreover, they have very good MT, PT and ML scores (out of 40 sequences 37 are mostly tracked, 3 are partially tracked, and none is mostly lost). As expected, the inferred trajectories are more accurate for visual tracking (whenever visual observations are available) than for audiovisual tracking: indeed, the latter fuses visual and audio observations which slightly degrades the accuracy because audio localization is less accurate than visual localization.
As for the PFOV configuration (Table LABEL:tab::livingRoom_PART and Table LABEL:tab::meetingroom_PART), the proposed algorithm yields the best MOTA scores both for the meeting and for the living rooms. Both [4] and [7] have difficulties when visual information is not available, e.g. the left and righthand blind strips on both sides of the restricted field of view: both these algorithms fail to track speakers when they walk outside the visual field of view. While [7] is able to detect a speaker when it reenters the visual field of view, [4] is not. Obviously, the tracking algorithm of [8] fails in the absence of visual observations.
ViiG AudioVisual Tracking Examples
We now provide and discuss results obtained with three recordings, one FFOV sequence, Seq134PS2M1 (Fig. 1) and two PFOV sequences, Seq192PS1M1 (Fig. 2) and Seq221PS0M1 (Fig. 3).^{6}^{6}6https://team.inria.fr/perception/research/variational_av_tracking/ These sequences are challenging in terms of audiovisual tracking: participants are seated, then they stand up or they wander around. Some participants take speech turns and interrupt each other, while other participants remain silent.
The first row of Fig. 1 shows four frames sampled from a video recording with two then four participants, labeled 1, 2, 3, and 4. Green digits designate participants detected as speakers and red digits correspond to participants detected as listeners. The second row shows ellipses of constant density (visual covariances), i.e. the inverse of the precision #1 in (26). Notice that in the second frame the detection of person 3, who turns his back to the camera, was missed. The third row shows the audio covariances, i.e. the inverse of the precision #2 in (26). The audio covariances are much larger than the visual ones since audio localization is less accurate than visual localization. There are two distinct audio sources close to each other that are correctly detected, localized and assigned to persons 1 and 4 and therefore it is still possible to assign audio activities to both 1 and 4. The fourth row shows the contribution of the dynamic model to the covariance, i.e. the inverse of the precision #3 in (26). Notice that these “dynamic” covariances are small, in comparison with the “observation” covariances, which reflects a smooth trajectory and ensures tracking continuity when audio or visual observations are either weak or totally absent. Fig. 2 shows a tracking example with a PFOV (partial camera field of view) configuration. In this case, audio and visual observations are barely available simultaneously. The independence of the visual and audio observation models and their fusion within the same dynamic model guarantees robust tracking results.
Fig. 3 shows the groundtruth trajectory of a person and the trajectories estimated with the audiovisual tracker [4], with the visual tracker [8], and with the proposed method. The groundtruth trajectory corresponds to a sequence of boundingbox centers. Both [4] and [8] failed to estimate a correct trajectory. Indeed, [4] requires simultaneous availability of audiovisual data while [8] cannot track outside the visual field of view. Notice the dangled trajectory obtained with [4] in comparison with the smooth trajectories obtained with variational inference, i.e. [8] and proposed.
ViiH Speaker Diarization Results
As already mentioned in Section VIC, speaker diarization information can be extracted from the output of the proposed VAVIT algorithm. Notice that, while audio diarization is an extremely well investigated topic, audiovisual diarization has received much less attention. In [31] it is proposed an audiovisual diarization method based on a dynamic Bayesian network that is applied to video conferencing. The method assumes that participants take speech turns, which is an unrealistic hypothesis in the general case. The diarization method of [34] requires audio, depth and RGB data. More recently, [17] proposed a Bayesian dynamic model for audiovisual diarization that takes as input fused audiovisual information. Since diarization is not the main objective of this paper, we only compared our diarization results with [17], which achieves state of the art results, and with the diarization toolkit of [18] which only considers audio information.
Groundtruth trajectory  ASVAPF [4] 
OBVT [8]  VAVIT (proposed) 
The diarization error rate (DER) is generally used as a quantitative measure. As for MOT, DER combines FP, FN and IDs scores. The NISTRT evaluation toolbox^{7}^{7}7https://www.nist.gov/itl/iad/mig/richtranscriptionevaluation is used. The results obtained with these two methods and with ours are reported in Table LABEL:tab:DIARFULL, with both the full fieldofview and partial fieldofview configurations (FFOV and PFOV). The proposed method performs better than the audioonly baseline method [18]. In comparison with [17], the proposed method performs slightly less well despite the lack of a diarization dynamic model. Indeed, [17] estimates diarization within a temporal model that takes into account both diarization dynamics and audio activity at each time step, whereas our method is only based on audio activity at each time step.