Natural Language Processing Tasks and References
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.READ FULL TEXT VIEW PDF
We present a joint audio-visual model for isolating a single speech sign...
Turn-taking has played an essential role in structuring the regulation o...
In this paper we address the problem of tracking multiple speakers via t...
We present an audio-visual speech separation learning method that consid...
Automatic speaker recognition algorithms typically use pre-defined
Estimating the positions of multiple speakers can be helpful for tasks l...
This paper addresses the problem of online multiple-speaker localization...
Natural Language Processing Tasks and References
In human-computer interaction (HCI) and human-robot interaction (HRI) it is often necessary to solve multi-party dialogue problems. For example, if two or more persons are engaged in a conversation, one important task to be solved, prior to automatic speech recognition (ASR) and natural language processing (NLP), is to correctly assign temporal segments of speech to corresponding speakers. In the speech and language processing literature this problem is referred to asspeaker diarization, or “who speaks when?” A number of diarization methods were recently proposed, e.g. . If only unimodal data are available, the task is extremely difficult. Acoustic data are inherently ambiguous because they contain mixed speech signals emitted by several persons, corrupted by reverberations, by other sound sources and by background noise. Likewise, the detection of speakers from visual data is very challenging and it is limited to lip and facial motion detection from frontal close-range images of people: in more general settings, such as informal gatherings, people are not always facing the cameras, hence lip reading cannot be readily achieved.
Therefore, an interesting and promising alternative consists of combining the merits of audio and visual data. The two modalities provide complementary information and hence audio-visual approaches to speaker diarization are likely to be more robust than audio-only or vision-only approaches. Several audio-visual diarization methods have been investigated for the last decade, e.g. [2, 3, 4, 5, 6, 7]. Diarization is based on audio-visual association, on the premise that a speech signal coincides with the visible face of a speaker. This coincidence must occur both in space and time.
In formal scenarios, e.g. meetings, diarization is facilitated by the fact that participants take speech turns, which results in (i) a clear-cut distinction between speech and non-speech and (ii) the presence of short silent intervals between speech segments. Moreover, participants are seated, or are static, and there are often dedicated close-field microphones and cameras for each participant e.g. . In these cases, the task consists of associating audio signals that contain clean speech with frontal images of faces: audio-visual association methods based on temporal coincidence between the audio and visual streams seem to provide satisfactory results, e.g. canonical correlation analysis (CCA) [9, 10, 11] or mutual information (MI) [12, 13, 2, 3]. Nevertheless, temporal association between the two modalities is only effective on the premises that (i) speech segments are uttered by a single person at a time, that (ii) single-speaker segments are relatively long, and that (iii) speakers continuously face the cameras.
In informal scenarios, e.g. ad-hoc social events, the audio signals are provided by distant microphones, hence the signals are corrupted by environmental noise and by reverberations. Speakers interrupt each other, hence short speech signals may occasionally be uttered simultaneously by different speakers. Moreover, people often wander around, turn their head away from the cameras, may be occluded by other people, suddenly appear or disappear from the cameras’ fields of view, etc. Some of these problems were addressed in the framework of audio-visual speaker tracking, e.g. [14, 15, 16]. Nevertheless, audio-visual tracking is mainly concerned with finding speaker locations and speaker trajectories, rather than solving the speaker diarization problem.
In this paper it is proposed a novel spatiotemporal diarization model that is well suited for challenging scenarios that consist of several participants engaged in multi-party dialogue. The participants are allowed to move around and to turn their heads towards the other participants rather than facing the cameras. We propose to combine multiple-person visual tracking with multiple speech source localization in order to tackle the speech to person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: acoustic spectral features are extracted from a microphone pair, a novel supervised audio-visual alignment technique maps these features onto the image plane such that the audio and visual modalities are represented in the same mathematical space, a semi-supervised clustering method assigns the acoustic features to visible persons. The main advantage of this method over previous work is twofold: it processes in a principled way speech signals uttered simultaneously by multiple persons, and it enforces spatial coincidence between audio and visual features.
Moreover, we cast the diarization process into a latent-variable temporal graphical model that infers over time both speaker identities and speech turns. This inference is based on combining the output of the proposed audio-visual fusion, that occurs at each time-step, with a dynamic model of the diarization variable (from the previous time-step to the current time-step), i.e. a state transition model. We describe in detail the proposed formulation which is efficiently solved via an exact inference procedure. We introduce a novel dataset that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue. We thoroughly test and benchmark the proposed method with respect to several state-of-the art diarization algorithms.
The remainder of this paper is organized as follows. Section II describes the related work. Section III describes in detail the temporal graphical model. Section IV describes visual feature detection and Section V describes the proposed audio features and their detection. Section VI describes the proposed semi-supervised audio-visual association method. The novel audio-visual dataset is presented in detail in Section VII while numerous experiments, tests, and benchmarks are presented in Section VIII. Finally, Section IX draws some conclusions. Videos, Matlab code and additional examples are available online.111https://team.inria.fr/perception/avdiarization/
The task of speaker diarization is to detect speech segments and to group segments that correspond to the same speaker without any prior knowledge about the speakers involved nor their number. This can be done using auditory features alone, or a combination of auditory and visual features. Mel frequency cepstral coefficients (MFCC) is often the representation of choice whenever audio signal segments correspond to a single speaker. Then the diarization pipeline consists of splitting the audio frames into speech and non-speech frames, of extracting an MFCC feature vector from each speech frame and of performing agglomerative clustering such that each cluster found at the end corresponds to a different speaker. Consecutive speech frames are assigned either to the same speaker and grouped into segments, or to different speakers, by using a state transition model, e.g. HMM.
[av-synchrony-1]1The use of visual features for diarization has been motivated by the importance of audio-visual synchrony. Indeed, it was shown that facial and lip movements are strongly correlated with speech production 
and hence visual features, extracted from frontal views of speaker faces, can be used to increase the discriminative power of audio features in numerous tasks, e.g. speech recognition, source separation, [20, 21] and diarization [13, 22, 23, 24]. In the latter case, the most common approaches involve the analysis of temporal correlation between the two modalities such that the face/lip movements that best correlate with speech correspond to an active speaker.
[av-synchrony-2]1Garau et al.  compare two audio-visual synchronization methods, based on mutual information (MI) and on canonical correlation analysis (CCA), and using MFCC auditory features combined with motion amplitude computed from facial feature tracks. They conclude that MI performs slightly better than CCA and that vertical facial displacements (lip and chin movements) are the visual features the most correlated with speech production. MI that combines gray-scale pixel-value variations extracted from a face region with acoustic energy is also used by Noulas et al. . The audio-visual features thus extracted are plugged into a dynamic Bayesian network (DBN) that perform speaker diarization. The method was tested on video meetings involving up to four participants which are recorded with several cameras, such that each camera faces a participant. More recently, both El Khoury et al.  and Kapsouras et al.  propose to cluster audio features and face features independently and then to correlated these features based on temporal alignments between speech and face segments.
[clean-dirty]1The methods mentioned so far yield good results whenever clean speech signals and frontal views of faces are available. A speech signal is said to be clean if it is noise free and if it corresponds to a single speaker; hence audio clustering based on MFCC (mel-frequency cepstral coefficients) features performs well. Moreover, time series of MFCC features seem to correlate well with facial-feature trajectories. If several faces are present, it is possible to select the facial feature trajectory that correlate the most with the speech signal, e.g. [9, 10]. However, in realistic settings, participants are not always facing the camera, consequently the detection of facial and lip movements is problematic. Moreover, methods based on cross-modal temporal correlation, e.g. [13, 22, 19, 23, 24, 3] require long sequences of audiovisual data, hence they can only be used offline such as the analysis of broadcast news, of audiovisual conferences, etc.
In the presence of simultaneous speakers, the task of diarization is more challenging because multiple-speaker information must be extracted from the audio data, one one hand, and the speech-to-face association problem must be properly addressed, on the other hand. In mixed-speech microphone signals, or dirty speech, there are many audio frames that contain acoustic features uttered by several speakers and MFCC features are not reliable anymore because they are designed to characterize acoustic signals uttered by single speakers. The multi-speech-to-multi-face association problem cannot be solved neither by performing temporal correlation between a single microphone signal and an image sequence nor by clustering MFCC features.
[multiple-speech]1 One way to overcome the problems just mentioned is to perform multiple speech-source localization [25, 26, 27] and to associate speech sources with persons. These methods, however, do not address the problems of aligning speech-source locations with visible persons and of tracking them over time. Moreover, they often use circular or linear microphone arrays, e.g. planar
microphone setups, hence they provide sound-source directions with one degree of freedom, e.g. azimuth, which may not be sufficient to achieve robust audio-visual association. Hence, some form of microphone-camera calibration is needed. Khalidov et al. propose to estimate the microphone locations into a camera-centered coordinate system and to use a binocular-binaural setup in order to jointly cluster visual and auditory feature via a conjugate mixture model . Minotto et al. 
learn an SVM classifier using labeled audio-visual features. This training is dependent on the acoustic properties of experimental setup. They combine voice activity detection with sound-source localization using a linear microphone array which provides horizontal (azimuth) speech directions. In terms of visual features, their method relies on lip movements, hence frontal speaker views are required.
Multiple-speaker scenarios were thoroughly addressed in the framework of audio-visual tracking. Gatica-Perez et al. 
proposed a multi-speaker tracker using approximate inference implemented with a Markov chain Monte Carlo particle filter (MCMC-PF). Navqi et al. proposed a 3D visual tracker, based as well on MCMC-PF, to estimate the positions and velocities of the participants which are then passed to blind source separation based on beamforming . Reported experiments of both [14, 15] require a network of distributed cameras to guarantee that frontal views of the speakers are always available. More recently, Kilic et al.  proposed to use audio information to assist the particle propagation process and to weight the observation model. This implies that audio data are always available and that they are reliable enough to properly relocate the particles. While audio-visual multiple-person tracking methods provide an interesting methodology, they do not address the diarization problem. Indeed, they assume that people speak continuously, which facilitates the task of the proposed audio-visual trackers. With the exception of , audio analysis is reduced to sound-source localization using a microphone array, and this in order to enforce spatial coincidence between faces and speech.
Recently we addressed audio-visual speaker diarization under the assumption that participants take speech turns and that there is no overlap between their emitted speech signals. We proposed a simple model that consists of a speech-turn discrete latent variable that associates the speech signal with one of the participants [31, 32]. The main idea of this work was to track multiple persons and to extract a single sound-source direction from short time intervals, e.g. using  to map sound directions onto the image plane. Audio and visual observations can then be associated using a recently proposed weighted-data EM algorithm 
. In the present paper we propose a novel dynamic audio-visual fusion model that can deal with simultaneously speaking participants. In particular, we exploit the spectral sparsity of speech signals and we propose a novel multiple speech source localization method based on a semi-supervised complex-Gaussian mixture model in the Fourier domain. We also generalize the single speaker-turn diarization model of[31, 32] to multiple speaking persons.
Recently we addressed audio-visual speaker diarization under the assumption that participants take speech turns and that there is no overlap between their speech segments. We proposed a model that consists of a speech-turn discrete latent variable that associates the current speech signal, if any, with one of the visible participants [31, 32]. The main idea was to perform multiple-person tracking in the visual domain, to extract sound-source directions (one direction at a time), and to map this sound direction onto the image plane . Audio and visual observations can then be associated using a recently proposed weighted-data EM algorithm .
[rel:originality]1 In this present paper we propose a novel DBN-based cross-modal diarization model. Unlike several recently proposed audio-visual diarization works , , , [31, 32], the proposed model can deal with simultaneously speaking participants that may wander around and turn their faces away from the cameras. Unlike , ,  which require long sequences of past, present, and future frames, and hence are well suited for post-processing, our method is causal and therefore it can be used online. To deal with mixed speech signals, we exploit the sparsity of speech spectra and we propose a novel multiple speech-source localization method based on audio-visual data association implemented with a cohort of frequency-wise semi-supervised complex-Gaussian mixture models.
We start by introducing a few notations and definitions. Unless otherwise specified, upper-case letters denote random variables while lower-case letters denote their realizations. Vectors are in slanted bold, e.g., while matrices are in bold, e.g. . We consider an image sequence that is synchronized with two microphone signals and let denote the time-step index of the audio-visual stream of data.
Let be the maximum number of visual objects, e.g. persons, available at any time . Hence at we have at most persons with locations on the image plane , where the observed random variable is the pixel location of person at . We also introduce a set of binary (or control) variables such that if person is visible at and if the person is not visible. Let denote the number of visible persons at . The time series and associated visibility binary masks can be estimated using a multi-person tracker, i.e. Section IV.
We now describe the audio data. Without loss of generality, the audio signals are recorded with two microphones: let be a binaural spectrogram containing number of frequencies and number of frames. Each frame is a binaural vector
. Binaural spectrograms are obtained in the following way. The short-time Fourier transform (STFT) is first applied to the left- and right-microphone signals acquired at time-stepsuch that two spectrograms, are associated with the left and right microphones, respectively. Each spectrogram is composed of complex-valued STFT coefficients. The binaural spectrograms is composed of complex-valued coefficients and each coefficients and , can be estimated from the corresponding left- and right-microphone STFT coefficients and , i.e. Section V. One important characteristic of speech signals is that they have sparse spectrograms. As explained below, this sparsity is explicitly exploited by the proposed speech-source localization method. Moreover, the microphone signals are obviously contaminated by background noise and by sounds emitted by other non-speech sources. Therefore, speech activity associated with each binaural spectrogram entry must be properly detected and characterized with the help of a binary-mask matrix : if the corresponding spectrogram coefficient contains speech, and if it does not contain speech. To summarize, the binaural spectrograms and associated speech-activity masks characterize the audio observations.
We remind that the objective of our work is to assign speech signal to persons, which amounts to one-to-one spatiotemporal associations between several speech sources (if any) and one or several observed persons. For this purpose we introduce a time series of discrete latent variables, where the vector has binary-valued entries such that if person speaks during the time-step , and if person is silent
. [ch:israel]1The temporal speaker diarization problem at hand can be formulated as finding a maximum-a-posteriori (MAP) solution, namely finding the most probable configuration of the latent state
that maximizes the following posterior probability distribution, also referred to as the filtering distribution:
We introduce the notation for the observed variables, while the are referred to as control variables. The filtering distribution (1) can be expanded as:
[ch:israel]1We assumed that the observed variables are conditionally independent of all other variables, given the speaking state and control input ; is conditionally independent of , given and . Fig. 1 shows the graphical model representation of the proposed model.
The numerator of (2) is the product of two terms: the observation likelihood (left) and the predictive distribution (right). The observation likelihood can be expanded as:
The predictive distribution (right hand side of the numerator of (2)) expands as:
[complexity]1To summarize, the evaluation of the filtering distribution at an arbitrary time-step requires the evaluation of (i) the observation likelihood (3), i.e. Section VI, (ii) the state transition probabilities (4), i.e. Section III-B, (iii) the filtering distribution at (5), and of (iv) the normalization term (III-A). Notice that the number of possible state configuration is where is the maximum number of people. For small values of (2 to 6 persons), solving the MAP problem (1) is computationally efficient.
Priors over the dynamics of the state variables in (4) exploit the simplifying assumption that the speaking dynamics of a person is independent of all the other persons. [turn-taking]1Several existing speech-turn models rely on non-verbal cues, such as filled pauses, breath, facial gestures, gaze, etc. [35, 36], and a speech-turn classifier can be built from annotated dialogues. The state transition model of  considers all possible transitions, e.g., speaking/non-speaking, visible/not-visible, etc., which results in a large number of parameters that need be estimated. These models cannot be easily extended when there are speech overlaps and one has to rely on features extracted from the data. To define the speaking transition priors , we consider three cases: (i) person visible at and visible at , or and in this case the transitions are parametrized by a self-transition prior which models the probability to remain in the same state, either speaking or not speaking, (ii) person not visible at and visible at , or , in this case, the prior to be either speaking or not speaking at is uniform, and (iii) person not visible at , or , in which case the prior not to be speaking is equal to 1. The following equation summarizes all these cases:
where if and if . [state-transition-1]1Note that this does not consider the case of person not visible at and at
for which the prior probability to be speaking is 0. In all our experiments we used.
[state-transition-2]1The multiple-speaker tracking and diarization model proposed in this work only considers persons that are both seen and heard. Indeed, in informal scenarios there may be acoustic sources (speech or other sounds such as music) that are neither in the camera field of view, nor can they be visually detected and tracked. The proposed audio-visual association model addresses this problem, i.e. Section VI.
We propose to use visual tracking of multiple persons in order to infer realizations of the random variables
introduced above. The advantage of a multiple-person tracker is that it is able to detect a variable number of persons, possibly appearing and disappearing from the visual field of view, to estimate their velocities, and to track their locations and identities. Multiple object/person tracking is an extremely well studied topic in the computer vision literature and many methods with their associated software packages are available. Among all these methods, we chose the multiple-person tracker of. In the context of our work, this method has several advantages: (i) it robustly handles fragmented tracks (due to occlusions, to the limited camera field of view, or simply to unreliable detections), (ii) it handles changes in person appearance, such as a person that faces the camera and then suddenly turns his/her head away from the camera, e.g. towards a speaker, and (iii) it performs online discriminative learning such that it can distinguish between similar appearances of different persons.
[visual-tracking]1Visual tracking is implemented in the following way. Un upper-body detector 
is used to extract bounding boxes of persons in every frame. This allows the tracker to initialize new tracks, to re-initialize lost ones, to avoid tracking drift, and to cope with a large variety of poses and resolutions. Moreover, an appearance model, based on the color histogram of a bounding box associated with a person upper body (head and torso), is associated with each detected person. The appearance model is updated whenever the upper-body detector returns a reliable bounding box (no overlap with another bounding box). We observed that upper-body detection is more robust than face detection which yields many false positives. Nevertheless, in the context of audio-visual fusion, the face locations are important. Therefore, the locations estimated by the tracker,, correspond to the face centers of the tracked persons.
In this section we present a methodology for extracting binaural features in the presence of either a single audio source or several speech sources. We consider audio signals recorded with a binaural microphone pair. As already explained in Section III, the short-time Fourier transform (STFT) is applied to the two microphone signals acquired at time-slice and two spectrograms are thus obtained, namely .
Let’s assume that there is a single (speech or non-speech) signal emitted by an audio source during the time slice . In the STFT domain, the relationships between the source-STFT spectrogram and microphone-STFT spectrograms are, for each frame and each frequency (for convenience we omit the time index ):
where is the unknown source spectrogram, and are the unknown noise spectrograms associated with the left and right channels, and and are the unknown left and right acoustic transfer functions that are frequency-dependent. The above equations correspond to the general case of a moving sound source. However, if we assume that the audio source is static during the time slice , i.e. the source emitter is in a fixed position during the time slice , the acoustic transfer functions are time-invariant and only depend on the source position relative to the microphones. We further define binaural features, i.e. the ratio between the left and right acoustic transfer functions, . Notice that we omitted the frame index because in the case of a static source, the acoustic transfer function is invariant over frames. Likewise the acoustic transfer function, the binaural features do not depend on and they only contain audio-source position information .
where is the complex-conjugate of and it is assumed that the signal-noise cross terms can be neglected. If the noise signals are spatially uncorrelated then the noise-noise cross terms can also be neglected. The binaural feature vector at can be approximated with the ratio between the cross-PSD and auto-PSD functions, i.e. the vector with entries
We now consider the case of speakers () that emit speech signals simultaneously (for convenience we omit again the time index )
where and are the acoustic transfer functions from the speech-source to the left and right microphones, respectively. The STFT based estimate of the cross-PSD for each frequency-frame point is
In order to further characterize simultaneously emitting speech signals, we exploit the well-known fact that speech signals have sparse spectrograms in the Fourier domain. Because of this sparsity it is realistic to assume that only one speech source is active at each frequency-frame point of the two microphone spectrograms (13) and (14). Therefore these spectrograms are composed of STFT coefficients that contain (i) either speech emitted by a single speaker, (ii) or noise. Using this assumption, the binaural spectrogram and associated binary mask matrix can be estimated from the cross-PSD and auto-PSD in the following way. We start by estimating a binary mask for each frequency-frame point,
[adaptive-th]1where is an adaptive threshold whose value is estimated based on noise statistics . Then, we compute the binaural spectrogram coefficients for each frequency-frame point at time-slice as:
It is important to stress that while these binaural coefficients are source-independent, they are location-dependent. This is to say that the binaural spectrogram only contains information about the location of the sound source and not about the content of the source. This crucial property allows one to use different types of sound sources for training a sound source localizer and for predicting the location of a speech source, as explained in the next section.
In this section we propose an audio-visual spatial alignment model that will allow us to evaluate the observation likelihood (3). The proposed audio-visual alignment is weakly supervised and hence it requires training data. We start by briefly describing the audio-visual training data. The training data contain pairs of audio recordings and their associated directions. Let be a training dataset containing binaural vectors. Each binaural vector is extracted from its corresponding audio recording using the method described in Section V-A, i.e. where each entry is computed with (12).
[ch:israel]1Each audio sample in the training set consists of a white-noise signal that is emitted by a loudspeaker placed at different locations, e.g. Fig.5. The PSD of a white-noise signal is significant at each frequency thus: . A visual marker placed onto the loudspeaker allows to associate its pixel location with each sound direction, hence the source directions correspond to an equal number of pixel locations . To summarize, the training data consist of pairs of binaural features and associated pixel locations: .
We now consider the two sets of visual and auditory observations during the time slice , namely , , and . If person , located at , is both visible and speaks at : the binaural features associated with the emitted speech signal depend on the person’s location only, hence they must be similar to the binaural features of the training source emitting from the same location. This can be simply written as a nearest-neighbor search over the training-set of audio-source locations:
and let be the binaural feature vector associated with this location. Hence, the training pair can be associated with person .
We choose to model that at any frequency , the likelihood of and observed binaural feature follows the following complex-Gaussian mixture model (for convenience, we omit the the time index )
is the complex-normal distribution andis the set of real-valued model parameters, namely the priors with
, and the variances. This model states that the binaural feature is either generated by one of the persons, located at , hence it is an inlier generated by a complex-normal mixture model with means
, or is emitted by an unknown sound source, hence it is an outlier generated by a zero-centered complex-normal distribution with a very large variance.
The parameter set of (19) can be easily estimated via a simplified variant of the EM algorithm for Gaussian mixtures: the algorithm alternates between E-step that evaluates the posterior probabilities , is assignment varaible, means is generated by component :
and M-step that estimates the variances and the priors:
The algorithm can be easily initialized by setting all the priors equal to and by setting all the variances equal to a positive scalar . Because the component means are fixed, the algorithm converges in only a few iterations.
Based on these results one can evaluate (3), namely the speaking probability of person located at : the probability that a visible person either speaks:
or is silent:
In this section we describe the audio-visual datasets that are used to test the proposed method and to compare it with several state-of-the-art methods. We start by describing a novel dataset that was purposively gathered and recorded to encompass a wide number of multiple-speaker scenarios, e.g. speakers facing the camera, moving speakers, speakers looking at each other, etc. This novel dataset is referred to as AVDIAR.222https://team.inria.fr/perception/avdiar/
In order to record both training and test data we used the following camera-microphone setup. A color camera is rigidly attached to an acoustic dummy head. The camera is a PointGrey Grasshopper3 unit equipped with a Sony Pregius IMX174 CMOS sensor of size . The camera is equipped with a Kowa 6 mm wide-angle lens and it delivers 19201200 color pixels at 25 FPS. This camera-lens setup has a horizontal vertical field of view of .
For the audio recordings we used a binaural Senheiser MKE 2002 dummy head with two microphones plugged into its left and right ears, respectively. The orginal microphone signals are captured at Hz, we have downsampled them to Hz. The STFT, implemented with a 32 ms Hann window and 16 ms shifts between consecutive windows, is then applied separately to the left and right microphone signals. Therefore, there are 512 samples per frame and the audio frame rate is approximatively 64 FPS. Each audio frame consists of a vector composed Fourier coefficients covering frequencies in the range .
The camera and the microphones are connected to a single PC and they are finely synchronized using time stamps delivered by the computer’s internal clock. This audio-visual synchronization allows us to align the visual frames with the audio frames. The time index corresponds to the visual-frame index. For each we consider a spectrogram of length frames, or a time slice of 0.4 s, hence there is an overlap between the spectrograms corresponding to consecutive time indexes.
|Seq01-1P-S0M1, Seq04-1P-S0M1 , Seq22-1P-S0M1||1.32 A single person moving randomly and alternating between speech and silence.|
|Seq37-2P-S0M0, Seq43-2P-S0M0||1.32 Two static participants taking speech turns.|
|Seq38-2P-S1M0, Seq40-2P-S1M0, Seq44-2P-S2M0||1.32 Two static participants speaking almost simultaneously, i.e. there are large speech overlaps.|
|Seq20-2P-S1M1, Seq21-2P-S2M1||1.32 Two participants, wandering in the room and engaged in a conversation, sometime speaking simultaneously.|
|Seq12-3P-S2M1, Seq27-3P-S2M1||1.32 Three participants engaged in an informal conversation. They are moving around and sometimes they speak simultaneously.|
|Seq13-4P-S1M1, Seq32-4P-S1M1||1.32 Three to four participants engaged in a conversation. Sometimes they speak simultaneously and there are many short speech turns.|
The training data were recorded by manually moving a loudspeaker in front of the camera-microphone unit e.g. Fig. 5. A visual marker placed at the center of the loudspeaker enables recording of audio signals with their associated pixel positions in the image plane. The loudspeaker is roughly moved in two planes roughly parallel to the image plane, at 1.5 m and 2.5 m, respectively. For each plane we record positions lying on a uniform 20 grid that covers the entire field of view of the camera, hence there are training samples. The training data consists of 1 s of white-noise (WN) signals. Using the STFT we therefore obtain two WN spectrograms of size 25664, corresponding to the left and right microphones, respectively. These two spectrograms are then used to compute binaural feature vectors, i.e. Section V-A (one feature vector for each loud-speaker position) and hence to build a training dataset of audio recordings and their associated image locations , i.e. Section VI.
Similarly we gathered a test dataset that contains several scenarios. Each scenario involves participants that are either static and speak or move and speak, in front of the camera-microphone unit at distance varying between 1.0 m and 3.5 m. In an attempt to record natural human-human interactions, participants were allowed to wonder around the scene and to interrupt each other while speaking. We recorded the following scenario categories, e.g. Fig. 21:
Static participants facing the camera. This scenario can be used to benchmark diarization methods requiring the detection of frontal faces and of facial and lip movements.
Static participants facing each other. This scenario can be used to benchmark diarization methods that require static participants not necessarily facing the camera.
Moving participants. This is a general-purpose scenario that can be used to benchmark diarization as well as audio-visual person tracking.
In addition to the AVDIAR dataset, we used three other datasets, e.g. Fig. 34. They are briefly described as follows:
The MVAD dataset described in . The visual data were recorded with a Microsoft Kinect sensor at 20 FPS,333Note that our method doesn’t use the depth image available with this sensor and the audio signals were recorded with a linear array of omnidirectional microphones sampled at 44100 Hz. The recorded sequences are from 40 s to 60 s long and contain one to three participants that speak in Portuguese. The speech and silence segments are 4 s to 8 s long. Since the diarization method proposed in  requires frontal faces, the participants are facing the camera and remain static through all the recordings.
The AVASM dataset contains both training and test recordings used to test the single and multiple speaker localization method described in . The recording setup is similar to the one described above, namely a binaural acoustic dummy head with two microphones plugged into its ears and a camera placed underneath the head. The images and the audio signals were captured at 25 FPS and 44100 Hz, respectively. The recorded sequences contain up to two participants that face the camera and speak simultaneously. In addition, the dataset has audio-visual alignment data collected in a similar fashion as the AVDIAR dataset.
The AV16P3 dataset is designed to benchmark audio-visual tracking of several moving speakers without taking diarization into account . The sensor setup used for these recordings is composed of three cameras attached to the room ceiling, and two circular eight-microphone arrays. The recordings include mainly dynamic scenarios, comprising a single, as well as multiple moving speakers. In all the recordings there is a large overlap between the speaker-turns.
These datasets contain a large variety of recorded scenarios, aimed at a wide range of application. e.g. formal and informal interaction in meetings and gatherings, human-computer interaction, etc. Some of the datasets were not purposively recorded to benchmark diarization. Nevertheless they are challenging because they contain a large amount of overlap between speakers, hence they are well suited to test the limits and failures of diarization methods. Unlike recordings of formal meetings, which are composed on long single-speech segments with almost no overlap between the participants, the above datasets contain the following challenging situations e.g. Table I:
The participants do not always face the cameras, moreover, they turn their heads while they speak or listen;
The participants, rather then being static, move around and hence the tasks of tracking and diarization must be finely intertwined;
In informal meetings participants interrupt each other and hence not only that there is no silence between speech segments, but the speech segments overlap each other, and
Participants take speech turns quite rapidly which results in short-length speech segments, which makes audio-visual temporal alignment quite challenging.
To effectively benchmark our model with state-of-the art methods, we use the diarization error rate (DER) to quantitatively measure the performance: smaller the DER value, better the performance. DER is defined by the NIST-RT evaluation testbed,444http://www.nist.gov/speech/tests/rt/2006-spring/ and corresponds to the percentage of audio frames that are not correctly assigned to one or more speakers, or to none of them in case of a silent frame. DER consists of the composition of the following measurements:
False-alarm error, when speech has been incorrectly detected;
Miss error, when a person is speaking but the method fails to detect the speech activity, and
Speaker-labeling error, when a person-to-speech association does not correspond to the ground truth.
To compute DER, the md-eval software package of NIST-RT is used, setting the forgiveness collar to a video frame of e.g. 40 ms for 25 FPS videos.
Vijayasenan et al.  (DiarTK) use audio information only. DiarTK allows the user to incorporate a large number of audio features. In our experiments and comparisons we used the following features: mel-frequency cepstral coefficients (MFCC), frequency-domain linear prediction (FDLP), time difference of arrival (TDOA), and modulation spectrum (MS). Notice that TDOA features can only be used with static sound-sources, hence we did not use TDOA in the case of moving speakers.
Minotto et al.  learn an SVM classifier based on based on labeled audio-visual features. Sound-source localization provides horizontal sound directions which are combined with the output of a mouth tracker.
Barzelay et al.  calculate audio-visual correlations based on extracting onsets from both modalities and on aligning these onsets. The method consists of detecting faces and on tracking face landmarks, such that each landmark yields a trajectory. Onset signals are then extracting from each one of these trajectory as well as from the microphone signal. These onsets are used to compare each visual trajectory with the microphone signal, and the trajectories that best match the microphone signal correspond to the active speaker. We implemented this method based on  since there is no publicly available code. Extensive experiments with this method revealed that frontal views of speakers are needed. Therefore, we tested this methods with all the sequences from the MVAD and AVASM datasets and on the sequences from the AVDIAR dataset featuring frontal images of faces.
Gebru et al.  track the active speaker, provided that participants take speech turns with no signal overlap. Therefore, whenever two persons speak simultaneously, this method extracts the dominant speaker.
[comparisons-2]1 Additionally, we used the following multiple sound-source localization methods:
GCC-PHAT which detects the local maxima of the generalized cross-correlation method: we used the implementation from the BSS Locate Toolbox .
TREM which considers a regular grid of source locations and selects the most probable locations based on maximum likelihood: we used the Matlab code provided by the authors, .
GCC-PHAT and TREM were used in conjunction with the proposed diarization method using the AVDIAR dataset as well as the MVAD and AV3P16 datasets.
[discussion-1]1 Overall, it can be noticed that the method of  is the least performing method. As explained above this method is based on detecting signal onsets in the two modalities and on finding cross-modal correlations based on onset coincidence. Unfortunately, the visual onsets are unable to properly capture complex speech dynamics. The DiarTK method of  is the second least performing method. This is mainly due to the fact that this method is designed to rely on long speech segments with almost no overlap between consecutive segments. Whenever several speech signals overlap, it is very difficult to extract reliable information with MFCC features, since the latter are designed to characterize clean speech. DiarTK is based on clustering MFCC features using a Gaussian mixture model. Consider, for example, MFCC feature vectors of dimension 19, extracted from 20 ms-long audio frames, and a GMM with diagonal covariance matrices. If it is assumed that a minimum of 50 samples are needed to properly estimate the GMM parameters, speech segments of at least 501920 ms, or 19 s, are needed. Therefore it is not surprising that DiarTK performs poorly on all these datasets.
[discussion-2]1 Table II shows that the method of  performs much better than DiarTK. This is not surprising, since the speech turns taken by the participants in the MVAD dataset are very brief. Minotto et al.  use a combination of visual features extracted form frontal views of faces (lip movements) and audio features (speech-source directions) to train an SVM classifier. The method fails whenever the participants do not face the camera, e.g. sequences Two12, Two13 and Two14, where participants purposely occlude their faces several times throughout the recordings. The method proposed in this paper in combination with TREM achieves the best results on almost all the tested scenarios. This is due to the fact that the audio-visual fusion method is capable of associating very short speech segments with one or several participants. However, the performance of our method, with either TREM or GCC-PHAT, drops down as the number of people increases. This is mainly due to the limited resolution of multiple sound-source localization algorithms (of the order of horizontally) and thus, it makes it difficult to disambiguate two nearby speaking/silent persons. Notice that tracking the identity of the participants is performed by visual tracking, which is a trivial task for most of these recordings, since participants are mostly static.
[discussion-3]1 Table III shows the results obtained with the AVASM dataset. In these recordings the participants speak simultaneously, with the exception of the Moving-Speaker-01 recording. We do not report results obtained with DiarTK since this method yields non-meaningful performance with this dataset. The proposed method performs reasonable well in the presence of simultaneously speaking persons.
[discussion-4]1 Table IV shows results obtained with the AV16P3 dataset. As with the AVASM dataset we were unable to obtain meaningful results with the DiarTK method. As expected the proposed method has the same performance as  in the presence of a single active speaker, e.g. seq11-1p-0100 and seq15-1p-0111. Nevertheless, the performance of  rapidly degrades in the presence of two and three persons speaking almost simultaneously. Notice that this dataset was recorded to benchmark audio-visual tracking, not diarization.
[discussion-5]1 Table V shows the results obtained with the AVDIAR dataset. The content of each scenario is briefly described in Table I. The proposed method outperforms all other methods. It is also interesting to notice that our full method performs better than with either TREM or GCC-PHAT. This is due to the robust semi-supervised audio-visual association method proposed above. Fig. 40, Fig. 46, and Fig. 52 illustrate the audio-visual diarization results obtained by our method with three scenarios.555 Videos illustrating the performance of the proposed method using these scenarios are available at https://team.inria.fr/perception/avdiarization/.
We proposed an audio-visual diarization method well suited for challenging scenarios consisting of participants that either interrupt each other, or speak simultaneously. In both cases, the speech-to-person association problem is a difficult one. We proposed to combine multiple-person visual tracking with multiple speech-source localization in a principled spatiotemporal Bayesian fusion model. Indeed, the diarization process was cast into a latent-variable dynamic graphical model. We described in detail the derivation of the proposed model and we showed that, in the presence of a limited number of speakers (of the order of ten), the diarization formulation is efficiently solved via an exact inference procedure. Then we described a novel multiple speech-source localization method and a weakly supervised audio-visual clustering method.
We also introduced a novel dataset, AVDIAR, that was carefully annotated and that enables to assess the performance of audio-visual (or audio-only) diarization methods using scenarios that were not available with existing datasets, e.g. the participants were allowed to freely move in a room and to turn their heads towards the other participants, rather than always facing the camera. We also benchmarked our method with several other recent methods using publicly available datasets. Unfortunately, we were not able to compare our method with the methods of [2, 3] for two reasons: first, these methods require long speech segments (of the order of 10 s), and second the associated software packages are not publicly available, which would have facilitated the comparison task.
In the future we plan to incorporate richer visual features, such as head pose estimation and head-pose tracking, in order to facilitate the detection of speech turns on the basis of gaze or of people that look at each other over time. We also plan to incorporate richer audio features, such as the possibility to extract speech signals emitted by each participant (sound-source separation) followed by speech recognition, and hence to enable not only diarization but also speech-content understanding. [camera-network]1Another extension is to consider distributed sensors, wearable devices, or a combination of both, in order to be able to deal with more complex scenarios involving tens of participants [44, 45].
International Workshop on Machine Learning for Multimodal Interaction. Springer, 2005, pp. 28–39.
Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 88–95.
M. I. Mandel, R. J. Weiss, and D. P. Ellis, “Model-based expectation-maximization source separation and localization,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 382–394, 2010.
A. Deleforge, R. Horaud, Y. Y. Schechner, and L. Girin, “Co-localization of audio sources in images using binaural features and locally-linear regression,”IEEE Transactions on Audio, Speech and Language Processing, vol. 23, no. 4, pp. 718–731, 2015.