Who said that?: Audio-visual speaker diarisation of real-world meetings

06/24/2019 ∙ by Joon Son Chung, et al. ∙ NAVER Corp. 1

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker. We show strong quantitative and qualitative performance on a dataset of real-world meetings. The method is also evaluated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. We also show that beamforming can be used together with the video to further improve the performance when multi-channel audio is available.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the recent years, there has been a growing demand to be able to record and search human communications in a machine readable format. There has been significant advances in automatic speech recognition due to the availability of large-scale datasets [panayotov2015librispeech, barker2018fifth]

and the accessibility of deep learning frameworks 

[Abadi16, paszke2017automatic, Vedaldi15], but to give the transcript more meaning beyond just a sequence of words, the information on ‘who spoke when’ is crucial.

Speaker diarisation, the task of breaking up multi-speaker audio into single speaker segments, has been an active field of study over the years. Speaker diarisastion can mostly be addressed as a single-modality problem where only the audio is used, but there are also a number of papers that have used additional modalities such as video. Previous works on speaker diarisation, both audio and audio-visual, can be divided into two strands.

The first is based on speaker modelling (SM) which uses the assumption that each individual has different voice characteristics. Traditionally, speaker models are constructed with Gaussian mixture models (GMMs) and i-vectors 

[dehak2011front, cumani2013probabilistic, matvejka2011full], but more recently deep learning has been proven effective for speaker modelling [variani2014deep, lei2014novel, ghalehjegh2015deep, snyder2017deep, snyder2018x]. In many systems, the models are often pre-trained for the target speakers [hung2008towards, biagetti2016robust] and are not applicable to unknown participants. Other algorithms are capable of adapting to unseen speakers by using generic models and clustering [friedland2012icsi, sell2018diarization]. There are also a number of works in the audio-visual domain that are based on feature clustering [friedland2009multi, sarafianos2016audio].

The second strand uses a technique referred to as sound source localisation (SSL), which is claimed to demonstrate better performance compared to the SM-based approaches according to a recent study [rozgic2010multimodal], particularly with powerful beamforming methods such as SRP-PHAT [dibiase2000high]. However, SSL-based methods are only effective when the location of speakers are either fixed or known. Therefore SSL has been used as parts of audio-visual methods, where the location of the identities can be tracked using the visual information [schmalenstroeer2009fusing]. This approach is dependent on the ability to effectively track the participants. A recent paper [cabanas2018multimodal] combines SSL with a visual analysis module that measures motion and lip movements, which is relevant to our work.

A number of works have combined SM and SSL approaches using independent models for each type of observation, then fused these information with a probabilistic framework based on the Viterbi algorithm [schmalenstroeer2009fusing] or with Bayesian filtering [rozgic2010multimodal].

In this paper, we present an audio-visual speaker diarisation system based on self-enrollment of speaker models that is able to handle movements and occlusions. We first use a state-of-the-art deep audio-visual synchronisation network to detect speaking segments from each participant when the mouth motion is clearly visible. This information is used to enroll speaker models for each participant, which can then determine who is speaking even when the speaker is occluded. By generating speaker models for each participant, we are able to reformulate the task from an unsupervised clustering problem into a supervised classification problem, where the probability of a speech segment belonging to every participant can be estimated. In contrast to the previous works that compute likelihoods for each type of observation before the multi-modal fusion, the audio-visual synchronisation is used in the self-enrollment process. Finally, when multi-channel microphone is available, beamforming is employed to estimate the location of the sound source, then the spatial cues from both modalities are used to further improve the system’s performance. The effectiveness of the method is demonstrated on the internal dataset of real-world meetings and the public AMI corpus.

This paper is organised as follows. In Section 2.1, we first describe the audio-only baseline system based on state-of-the-art methods for speech enhancement, activity detection and speaker diarisation. Section 2.2 introduces the proposed audio-visual system. Finally, Section 3 describes the datasets and the experiments in which we demonstrate the effectiveness of our method on the public AMI dataset.

2 System description

2.1 Audio-only baseline

The baseline system provided for the second DIHARD challenge is used as our audio-only baseline. The system takes key components from the top-scoring systems in the first DIHARD challenge and shows state-of-the-art performance on audio-only diarisation.

2.1.1 Speech enhancement

The speech enhancement is based on the system used by USTC-iFLYTEK in their submission to the first DIHARD challenge [sun2018speaker]

. The system uses Long short-term memory (LSTM) based speech denoising model trained on simulated training data. It has demonstrated significant improvements in deep learning-based single-channel speech enhancement over the state-of-the-art, and the authors have shown its effectiveness for diarisation with a second-place result in the first DIHARD challenge.

2.1.2 Speech activity detection

The speech activity detection baseline uses WebRTC [johnston2012webrtc] operating on enhanced audio processed by the speech enhancement baseline.

2.1.3 Speaker embeddings and diarisation

The diarisation system is based on the JHU Sys4 used in their winning entry to DIHARD I, with the exception that it omits the Variational-Bayes refinement step. Speech is segmented into 1.5 second windows with 0.75 second hops, 24 MFCCs are extracted every 10ms, and a 256-dimensional x-vector is extracted for each segment. The extracted vectors are scored with PLDA (trained with segments labelled only for one speaker) and clustered with AHC (average score combination at merges).

The x-vector extractor and PLDA parameters were trained on the VoxCeleb [Nagrani17] and VoxCeleb2 [Chung18a] datasets with data augmentation (additive noise), while the whitening transformation was learned from the DIHARD I development set [sell2018diarization]. We use the pre-trained model released by the organisers of the DIHARD challenge.

The system is not designed to handle overlapped speech, and additional speakers are counted as missed speech in evaluation.

2.2 Multi-modal diarisation

Figure 1: Pipeline overview.

The audio processing part of the audio-visual system shares most of the baseline methods described above: the speech enhancement and speech activity detection modules are identical to that in the baseline system, and for experiments on the AMI corpus, we also use the pre-trained x-vector model used by the JHU system to extract speaker embeddings.

Three modes of information are used to determine the current speaker in the video. The pipeline is summarised in Figure 1 and described in the following paragraphs.

2.2.1 Audio-to-video correlation

Cross-modal embeddings of the audio and the mouth motion are used to represent the respective signals. The strategy to train this joint embedding is described in [chung2018perfect], but we give a brief overview here.

The network consists of two streams: the audio stream that encodes Mel-frequency cepstral coefficients (MFCC) inputs into 512-dimensional vectors; and the video stream that encodes cropped face images also into 512-dimensional vectors. The network is trained as a multi-way matching task between one video clip and audio clips. Euclidean distances between the audio and video features are computed, resulting

distances. The network is trained with a cross-entropy loss on the inverse of this distance after passing through a softmax layer, so that the similarity between matching pairs is greater than non-matching pairs.

The cosine distance between the two embeddings is used to measure correspondence between the two inputs. Therefore, we expect small distance between the features if the face image corresponds to the current speaker and in-sync and large distance otherwise. Since the video is from a single continuous source, we assume that the AV offset is fixed throughout the session. The embedding distance is smoothed over time using a median filter in order to eliminate outliers.

2.2.2 Speaker verification

We develop speaker models for each individual (identified in Sec. 2.4.2) so that the active speaker can be determined even when audio-visual synchronisation cannot be established due to occlusion.

The audio-visual pipeline (Sec. 2.2.1) is run over the whole video in advance, in order to determine most confident speaking segments (each of 1.5 seconds) for each identity. In our case we use =10, and if there are fewer than confident segments above a AV correlation threshold, we only use the segments whose correlation is above the threshold. These are used to enroll the speaker models.

For the experiments on the AMI dataset, we use the x-vector network (described in Sec. 2.1.3) to extract speaker embeddings, so that the results can be compared like-for-like to the baseline.

For the experiments on the internal meeting dataset, we use a deeper ResNet-50 model [He16] also trained on the same data as the baseline. The deeper model is used here since its features generalise better to this more challenging dataset compared to the shallower x-vector model.

At test time, speaker embeddings are extracted by computing features over 1.5-second window, moving 0.75 seconds at a time, in line with the baseline system. By comparing the embeddings at each timestep to the enrolled speaker models, the likelihood of the speech segment belonging to any individual can be estimated. Even without any visual information at inference time, this now becomes a supervised classification problem, which is typically more robust compared to unsupervised clustering.

2.2.3 Sound source localisation

Besides the speaker embeddings, the direction of the sound source can provide useful cues on who is speaking.

Recordings from the 4-channel microphone from the GoPro camera can be converted to Ambisonics B-Format using the GoPro Fusion Studio software. By solving the B-format representations for azimuth and elevation , the direction of the audio source can be estimated for each audio sample. The direction for every video frame is determined by generating a histogram of all values over a second period with bin size of .

For the AMI videos, the Time Delay of Arrival (TDOA) information is calculated using the BeamformIt [anguera2007acoustic] package. As with the internal dataset, the direction of arrival is also computed with a histogram of values over a second period. However, only 4 bins of is used since the video is split over 4 cameras and the exact geometry between them is unknown.

The likelihood of the audio belonging to any person at a given time correlates to the angle between the estimated audio source and the face detection in the video for the identity in question.

2.3 Multi-modal fusion

Each of the three modalities (AV correlation, speaker models, direction of audio) give confidence scores for each speaker and timestep. These scores are combined into a single confidence score for every speaker and timestep using a simple weighted fusion as stated below, where is the confidence score from the speaker model, is the score from the AV correspondence and , are the directions of the face and the estimated DoA of audio, respectively. When the identity is not visible on the camera, the second and third terms are put to zero.


2.4 Implementation details

2.4.1 Face detection and tracking

A CNN face detector based on Single Shot MultiBox Detector (SSD) [Liu16] is used to detect face appearances on every frame of the video. This detector allows faces to be tracked across wide range of poses and lighting conditions. A position-based face tracker is used to group individual face detections into face tracks.

2.4.2 Face recognition

The method requires face images for each participant so that they can be identified and tracked regardless of their position in the room. This can be from user input or from their profile images. The face images for all participants are supplied to the VGGFace2 [cao2017vggface2] network, and their embeddings are stored. For each face track detected (Sec. 2.4.1), face embeddings are extracted using the VGGFace2 network and compared to each of the

stored embeddings, so that they can be classified into one of

identities. We apply the constraint that co-occurring face tracks at any point in time cannot be of the same identity.

Dataset Input System VAD Reference VAD
JHU Baseline [sell2018diarization] ES All 1ch 10.5 6.6 12.8 30.0 5.6 0.0 12.2 17.8
Ours (SM) ES All 1ch+V 10.5 6.6 6.7 23.8 5.6 0.0 7.9 13.5
Ours (SM+AVC) ES All 1ch+V 10.5 6.6 4.0 21.1 5.6 0.0 4.8 10.4
Ours (SM+AVC+SSL) ES All 8ch+V 10.5 6.6 2.8 19.9 5.6 0.0 3.6 9.2

Cabanas et al. [cabanas2018multimodal]
ES WB 8ch+V - - - 27.2 - - - -
Ours (SM+AVC) ES WB 1ch+V 11.4 7.1 4.9 23.3 6.1 0.0 5.9 12.0
Ours (SM+AVC+SSL) ES WB 8ch+V 11.4 7.1 3.8 22.3 6.1 0.0 4.9 10.9

Cabanas et al. [cabanas2018multimodal]
ES NWB 8ch+V - - - 20.6 - - - -
Ours (SM+AVC) ES NWB 1ch+V 9.5 5.7 2.7 17.8 5.1 0.0 3.3 8.4
Ours (SM+AVC+SSL) ES NWB 8ch+V 9.5 5.7 1.4 16.6 5.1 0.0 1.9 7.0

JHU Baseline [sell2018diarization]
IS All 1ch 11.2 4.0 10.2 25.4 6.5 0.0 11.2 17.7
Ours (SM) IS All 1ch+V 11.2 4.0 7.6 22.9 6.5 0.0 8.8 15.3
Ours (SM+AVC) IS All 1ch+V 11.2 4.0 6.2 21.3 6.5 0.0 7.1 13.6
Ours (SM+AVC+SSL) IS All 8ch+V 11.2 4.0 4.9 20.0 6.5 0.0 5.8 12.3

Cabanas et al. [cabanas2018multimodal]
IS WB 8ch+V - - - 32.3 - - - -
Ours (SM+AVC) IS WB 1ch+V 13.3 5.1 7.7 26.1 7.9 0.0 8.9 16.9
Ours (SM+AVC+SSL) IS WB 8ch+V 13.3 5.1 6.5 24.8 7.9 0.0 7.8 15.7

Cabanas et al. [cabanas2018multimodal]
IS NWB 8ch+V - - - 21.7 - - - -
Ours (SM+AVC) IS NWB 1ch+V 9.3 2.8 4.8 16.8 5.3 0.0 5.4 10.6
Ours (SM+AVC+SSL) IS NWB 8ch+V 9.3 2.8 3.4 15.5 5.3 0.0 4.0 9.3

JHU Baseline [sell2018diarization]
Internal 1ch 1.8 4.5 72.2 78.6 0.0 0.0 73.3 73.3
Ours (SM) Internal 1ch+V 1.8 4.5 24.8 31.1 0.0 0.0 25.6 25.6
Ours (SM+AVC) Internal 1ch+V 1.8 4.5 18.7 25.0 0.0 0.0 19.4 19.4
Ours (SM+AVC+SSL) Internal 8ch+V 1.8 4.5 13.1 19.4 0.0 0.0 13.7 13.7

Table 1: Diarisation results (lower is better). The results are on the AMI dataset except for the last four rows. WB: Whiteboard; NWB: No whiteboard; ch+V: channel audio + video; SM: Speaker Modelling; AVC: Audio Visual Correspondence; SSL: Sound Source Localisation; MS: Missed Speech; FA: False Alarm; SPKE: Speaker Error; DER: Diarisation Error Rate.

3 Experiments

The proposed method is evaluated on two independent datasets: our internal dataset of meetings recorded with 360 camera, and the publicly available AMI meeting corpus. Each will be described in the following paragraphs.

3.1 Internal meeting dataset

The internal meeting dataset consists of audio-visual recording of regular meetings in which no particular instructions are given to the participants with regard to the recording of the video. The meetings form parts of daily discussions from the workspace of the authors and are not set up in any way with the diarisation task in mind. A large proportion of the dataset consists of very short utterances with frequent speaker changes, providing an extremely challenging condition.

The video is recorded using a GoPro Fusion camera, which captures videos of the meeting with two fish-eye lenses. The videos are stitched together into a single surround-view video of 5228x2624 resolution at 25 frames per second. The audio is recorded using a 4-channel microphone at 48 kHz. A still image from the dataset is shown in Figure 2.

The dataset contains approximately 3 hours of validation set and 40 minutes of carefully annotated test set. The test video contains 9 speakers. In the case of overlapped speech, we only annotated the ID of main (loudest) speaker. The embedding extractor and the AV synchronisation network are trained on external datasets, and the validation set is only used for tuning the AHC threshold in the baseline system and the fusion weights in the proposed system.

Figure 2: Still image from the internal meeting dataset.
Figure 3: Still images from the AMI corpus.

3.2 AMI corpus

The AMI corpus consists of 100 hours of video recorded across a number of locations and has been used by many previous works on audio-only and audio-visual diarisation. Of the 100 hours of video, we evaluate on meetings in ES (Edinburgh) and IS (Idiap) categories, which contain approximately 30 and 17 hours of video respectively. On the IS videos, IS1002a, IS1003b, IS1005d, IS1007d were not used in the experiments due to partially missing data. The image quality is relatively low, with the video resolution of 288x352 pixels.

The audio is recorded from an 8-element circular equi-spaced microphone array with a diameter of 20cm. However, we only use one microphone from the array in most of our experiments. The video is recorded with 4 cameras providing close-up views of each of the meeting’s participants, and unlike the internal dataset (Sec. 3.1), the images are not stitched together.

The ES videos is used as the validation set for tuning the thresholds.

3.3 Evaluation metric

We use Diarisation Error Rate (DER) as our performance metric. The DER can be decomposed into three components: missed speech (MS, speaker in reference, but not in hypothesis), false alarm (FA, speaker in hypothesis, but not in reference) and speaker error (SPKE, speaker ID is assigned to the wrong speaker).

The tool used for evaluating the system is the one developed for the RT Diarization evaluations by NIST [istrate2005nist], and includes acceptance margin of 250 ms to compensate for human errors in reference annotation.

3.4 Results

Results on the AMI corpus [carletta2005ami] are given in Table 1. The numbers for meetings where the whiteboard is used are provided separately, so that the results can be compared to [sell2018diarization].

Missed speech and false alarm rates are the same across different models for each dataset since we use the same VAD system in all of our experiments. Therefore the speaker error rate (SPKE) is the only metric affected by the diarisation system.

Our speaker model only system (SM) uses the visual information only to find out when to enroll the speaker models, and during inference only uses the audio. Since the audio processing pipeline and the embedding extractor are common across our system and the JHU-based baseline, the performance gain arises from changing a clustering problem into a classification problem. This alone results in 48% and 26% relative improvement in speaker error on the ES and IS sets, respectively.

It is also clear from the results that the addition of the AV correspondence (AVC) and sound source localisation (SSL) at inference time both provide boost to the performance. The contributions of these modalities to overall relative performance are 20-40% and 19-39% respectively depending on the test set.

Note that our results exceed the recent audio-visual method of [cabanas2018multimodal] across all test conditions by a significant margin, whilst using the same input modalities. [friedland2010dialocalization] also reports competitive results on a subset of the IS videos (SPKE of 7.3%, DER of 19.5% using 4 cameras and 8 microphones), however the results cannot be compared directly to our work since some of the test videos are no longer available at the time of writing this paper.

The speaker error rates are markedly worse on the internal meeting dataset, presumably due to the more challenging nature of the dataset and the larger number of speakers. From the results in Table 1, it can be seen that the baseline system does not generalise to this dataset, but the proposed multi-modal systems perform relatively well on this ‘in the wild’ data.

4 Conclusion

In this paper, we have introduced a multi-modal system which takes advantage of audio-visual correspondence to enroll speaker models. We have shown that speaker modelling with audio-visual enrollment have significant advantages over clustering methods typically used for diarisation. Areas for further research include learnable methods for multimodal fusion, improvements to the speech activity detection (SAD) modules and the combination of audio-visual diarisation and audio-visual speech separation for meeting transcription and for handling overlapped speech.

Acknowledgment. We would like to thank Chiheon Ham, Han-Gyu Kim, Jaesung Huh, Minjae Lee, Minsub Yim, Soyeon Choe and Soonik Kim for helpful comments and discussion.