Speaker diarisation is the challenging task of breaking up multi-speaker video into homogeneous single speaker segments, effectively solving “who spoke when”. Beyond being an interesting research problem in itself, it is also a valuable pre-processing step for a number of applications, including speech-to-text.
While state-of-the-art diarisation systems perform remarkably well for speech from constrained domains (e.g. conversational telephone speech [sell2014speaker, zhu2016online, garcia2017speaker, zhang2019fully] or meeting speech [yella2013improved]), this success does not transfer to more challenging conditions found in online videos ‘in the wild’. The challenges here include the lack of a fixed domain (videos can be from talk shows, news broadcasts, celebrity interviews, home vlogs), a large number of speakers (some of whom are off-screen), short rapid exchanges with cross-talk, and background degradation consisting of channel noise, laughter and applause.
These conditions make manual annotation of online videos a daunting task for human annotators, leading to a dearth of large-scale public diarisation datasets of unconstrained speech. While large-scale evaluations are held regularly by the National Institute of Standards in Technology (NIST-RTE), these are limited to constrained audio-only datasets, which are not freely available to the research community (Table 1).
To attempt to remedy some these issues, the DIHARD challenges [sell2018diarization, ryant2019second] were introduced in 2018. These are valuable annual challenges that cover 11 different data domains, including mother-child conversations, meetings and courtroom settings. One of these domains is also web videos, however there is a limited amount of data (only 2 hours). The datasets are also audio-only, and are only available to challenge participants (not released freely to the research community).
A large-scale diarisation dataset of videos ‘in the wild’ would encourage the development of new audio-visual diarisation techniques that deal with unconstrained conditions. Inspired by the recent success of automatic audio-visual dataset creation pipelines (VoxCeleb [Nagrani17, Chung18a, nagrani2020voxceleb], VGGSound [vedaldi2020vggsound]), we propose a scalable, audio-visual method for speaker diarisation in web videos. Our method relies heavily on the recent successes of active speaker detection [Chung16a] and face and speaker verification [Parkhi15, Schroff15, Xie19a]. We then integrate this method into a semi-automatic dataset creation pipeline – consisting of both automatic annotation and manual verification. We use this pipeline to curate VoxConverse, a challenging and diverse speaker diarisation dataset from ‘in the wild’ videos.
Our automatic diarisation method exploits the following three key ideas; Firstly, the speech for on-screen identities can be accurately segmented automatically using active speaker detection and then identified using face recognition, the core basis for the VoxCeleb pipeline[nagrani2020voxceleb]. Second, there has been great progress in creating audio-visual speech enhancement models [Afouras18, Ephrat18], which separate overlapping speech into single speaker streams. Given the amount of cross-talk and background noise in web videos, we use this model to better isolate and identify speaker identities. The above two ideas allow us to accurately identify and isolate speech for on-screen speakers. Finally, to accurately recognise off-screen speakers, we utilise state of the art speaker recognition embeddings that verify identities from audio alone (Figure 1).
Concretely, we make the following three contributions: (i) We create an automatic audio-visual diarisation method using active speaker detection, face recognition, speech enhancement and audio-only speaker recognition. (ii) We integrate our method into a semi-automatic dataset creation pipeline which consists of human annotation and automatic diarisation. Our pipeline is scalable, and significantly reduces the number of hours required to annotate videos. (iii) We use this pipeline to curate VoxConverse, a challenging ‘in the wild’ audio-visual diarisation dataset. We compare our audio-visual diarisation method to existing audio-only baselines on our dataset, and show that large performance gains can be obtained from integrating visual information.
2 Related Works
Speaker diarisation has been an active field of research for many years, but remains one of the most challenging tasks in speech processing. Deep learning techniques have not been applied to speaker diarisation to the same degree that they have for other tasks, partially due to the lack of end-to-end models for diarisation, but also due to the lack of diverse, large-scale datasets like ImageNet[Deng09] and VoxCeleb [Nagrani17].
Much of the progress in the field has been driven by a series of NIST Rich Transcription challenges (NIST-RTE), which focuses almost solely on the meeting domain. The series also proposed the diarisation error rate (DER) as an evaluation metric for speaker diarisation, which is now used as the primary metric across all domains and evaluations. Research into speaker diarisation has largely evolved independently for different domains, with broadcast news[tranter2004speaker], telephone speech [canavan1997callhome], and meetings [janin2003icsi, carletta2005ami] being the most popular domains. For each domain, specific datasets have been introduced and used, all created by manual annotation. We provide a summary in Table 1.
The DIHARD series of challenges [sell2018diarization, ryant2019second] were introduced to overcome the domain dependency in the field – the data consists of recordings from different conversational domains, including audiobooks, broadcast interviews, child speech and so on. The evaluation conditions are challenging, and even the best performing systems score relatively high diarisation error rates of around 20% with ground truth voice activity detector (VAD), and 30% with system VAD. The annotation is performed with very fine granularity, which allows evaluation without a forgiveness collar. Barring DIHARD, all other datasets and evaluations include a generous forgiveness collar and exclude overlapping speech from scoring. Inspired by DIHARD, we annotate overlapping speech in VoxConverse and include it in evaluation. For almost all existing datasets, annotation is done manually and solely using the audio. Annotation without visual information is challenging, particularly when the number of speakers is large, since it is easy to be confused between voices without the additional identity redundancy provided by the face. Unlike other works, our dataset creation pipeline is semi-automatic, scalable and audio-visual.
|2005 NIST RTE||Meetings||✗||Manual|
|AMI Meeting Corpus [carletta2005ami]||Meetings||✓||Manual|
|ICSI Meeting Corpus [janin2003icsi]||Meetings||✓||Manual|
|Fisher I and II [cieri2004fisher]||Telephony||✗||Manual|
|DIHARD [sell2018diarization, ryant2019second]||Mixed||✗||Manual|
3 Dataset description
The development set of VoxConverse consists of 217 multispeaker videos covering 1,222 minutes with 7,903 speaker turns annotated. The test set contains approximately 1,800 mins (30 hours) and will be released after the VoxCeleb Speaker Recognition Challenge in October 2020. The statistics of the development dataset can be seen in Table 2.
Videos included in the dataset are shot in a large number of challenging multi-speaker acoustic environments, including political debates, panel discussions, celebrity interviews, comedy news segments and talk shows. This provides a number of background degradations, including dynamic environmental noise with some speech-like characteristics, such as laughter and applause. Our dataset is audio-visual, and contains face detections and tracks as part of the annotation.
The videos in the datasets consist of quick, short speech segments. On average, 94% of the video time contains speech, and 2–3% of this contains speech where one speaker overlaps with another speaker. The overlap percentage varies between videos; one video for example has an overlap percentage of 27.4%. Videos vary in length from 22 seconds to 18 minutes. Unlike other domains such as telephony, each video has on average between 4 and 5 speakers, with one video in the dataset having 19 speakers.
|set||# videos||# mins||# spks||# speaker turns||video durations (s)||speech %||overlap %|
|Dev||217||1,222||1 / 4.5 / 19||7,903||22.1 / 337.8 / 1097.4||11.1 / 93.8 / 99.8||0 / 2.9 / 27.4|
|Test||To be released in October 2020|
4 Dataset collection
The dataset collection process consists of two stages – initial annotations are generated automatically using our proposed audio-visual method, and the annotations are then checked and refined by humans.
4.1 Automatic pipeline
The automatic computer vision pipeline to curateVoxConverse is similar to that used to compile VoxCeleb1 [Nagrani17] and VoxCeleb2 [Chung18a].
Stage 1. Collection of videos. The first stage is to obtain a list of videos. We start from a number of keywords including ‘panel debate’ and ‘discussion’ in order to obtain videos where multiple people are talking alternately or at the same time. The list of videos is obtained by searching the keywords on YouTube, and duplicate URLs that appear in the search results of multiple keywords is removed. The list contains a range of videos, ranging from US presidential debates and talk shows to documentaries.
Stage 2. Shot detection. Shot boundaries are then determined to find within-shot frames for which face tracking is to be run. The boundaries are found by comparing intensity and brightness across consecutive frames [pyscenedetect].
Stage 3. Face detection and tracking. A CNN face detector based on the Single Shot Scale-invariant Face Detector (S3FD) [zhang2017s3fd] is used to detect faces on every frame of the video. This detector allows the detection of faces at various scales and poses. Within each shot, face detections are grouped together into face tracks using a position-based tracker, as in [Nagrani17, Chung18a].
Stage 4. Face-track clustering. A face recognition CNN is used to extract embeddings for every face track. The network used here is based on the ResNet-50 [He16]
trained on the VGGFace2 dataset. The embeddings are extracted 5 times per face track at uniform intervals, and then averaged. The embeddings are clustered using Agglomerative Hierarchical Clustering[day1984efficient], but a large penalty is added to the distance matrix between overlapping face track so that they are never clustered together.
Stage 5. Active speaker detection (ASD). The goal of this stage is to determine if the visible face is the speaker. Two systems are used for this purpose. The first method uses a variant [chung2019perfect] of SyncNet [Chung17a]
, which is a two-stream CNN that determines the active speaker by estimating the correlation between the audio track and the mouth motion of the video. The second method isolates the speech of the target speaker from a mixture of sounds using an audio-visual speech enhancement (AVSE) network[Afouras18] then uses an off-the-shelf voice activity detector, WebRTC [johnston2012webrtc], to determine the speech segment.
Each method has its weaknesses – the SyncNet ASD activates when the phoneme in the background speech matches the viseme shown on the target face since the model does not consider temporal context; the WebRTC voice activity detector is often activated from the residual signal left in the AVSE output which, despite the reduced power, causes false alarms. Therefore, a face track is considered to be speaking only if both of the methods agree, which helps reduce false alarms from laughter and music.
Stage 6. Labelling off-screen speech. A pre-trained speaker recognition model [chung2020defence] is used to verify the identity of speech that comes from off-screen speakers. Any parts of the audio with positive voice activity detector (VAD) output, but with no visible active speaker is considered to be an off-screen speech segment. Speaker embeddings are extracted for the whole video, then the off-screen speech segments are compared to all speech segments with visible active speaker using the cosine distance between the embeddings. If the cosine distance is below a threshold, the off-screen segment is assigned to the speaker; if not, the segment is left as unknown for the human annotator to verify. This procedure is closely related to the multi-modal diarisation method of [chung2019said].
Discussion. In creating the VoxCeleb datasets, very conservative thresholds were used in both active speaker detection and face verification, since it was necessary to be very certain about the speaker labels without any human intervention. This is, however, at the cost of high false rejection, which meant that a large number of true speech segments were discarded.
In contrast, a speaker diarisation dataset must contain continuous audio recording with different identities speaking in turns. Therefore, we cannot discard parts of video based on low confidence, but the entire video must be labelled in full. The thresholds are optimised to minimise the overall Diarisation Error Rate (DER) (see section 5), since high false alarms and high false rejections both lead to increased man-hours during the manual verification and correction stage. Note how our pipeline consists of two ASD methods, as we show later, this redundancy is beneficial for performance.
4.2 Manual verification
The output of the automatic pipeline has been checked and corrected by the authors of this paper using a customised version of the VGG Image Annotator [dutta2019vgg, dutta2016via]. This was done so that the authors can identify the failure modes and make guidelines for the external annotators when the process is scaled up. The tool allows the user to watch and verify the annotation at various speeds, and with aid of video.
During the annotation process, a number of failure modes were identified. The most common is non-visible speech segment assigned to the wrong speaker, but false alarm of the VAD and missed overlapped speech are also relatively common.
Guidelines. Speech segments are split when pauses are greater than 0.25 seconds. Unlike some previous datasets in diarisation, laughter is not assigned to identities, as it is difficult to assign an accurate label to audience laughter. Anything that can be transcribed, including short utterances such as ‘yes’ and ‘right’, are considered to be speech. Known speakers are named in the annotation process to facilitate easier cross-checking. The annotators are asked to be as careful as possible that the marked boundaries are within 0.1 seconds of the true boundary.
Quality check. In order to verify the quality of manually checked annotations, a subset of the data has been labelled independently by two different annotators. This subset contains 1 hour of material from 15 YouTube videos. The diarisation error rate between the two annotations is approximately 1%, using the labels from one annotator as the reference and the other as the prediction. This error can be mostly attributed to disagreements on the source of off-screen speech segments.
Discussion. Diarisation labels for ‘in the wild’ conversations are difficult to obtain. It is almost impossible for a human to annotate the videos in our dataset without the video. Even with the video, it takes at least 10 times the video duration to annotate the speech segments to satisfactory quality if starting from scratch, particularly when the number of speakers is large. In contrast, the verification of our audio-visual method output takes around twice the video duration, and can be obtained by less experienced annotators.
The time taken to annotate correlates strongly to the quality of the output from the automatic method. The first few videos in the development set were annotated with initial hyperparameters that gave relatively poor performance. The diarisation labels were then manually fixed, and the parameters were re-tuned on this data to minimise the diarisation error rate. More videos were then generated using the new set of parameters and this process was repeated a few times. While it is possible that some types of errors are more time-consuming for humans to fix compared to others, we have observed that the annotation became faster after each iteration.
We compare our audio-visual method to an audio-only DIHARD 2019 baseline, and also compare performance to two ablations.
DIHARD 2019 baseline. The second DIHARD [sell2018diarization, ryant2019second] challenge provides a baseline system based on the JHU submission of the first DIHARD challenge. We use this public code111https://github.com/iiscleap/DIHARD_2019_baseline_alltracks as an audio-only baseline.
The overall procedure is as follows. Speech segments are obtained using VAD, and divided into short overlapping segments (1.5s with 0.75s overlap). Speaker embeddings are extracted using the x-vector[snyder2018x] system, and the similarities between the embeddings are scored with a pre-trained probabilistic linear discriminant analysis (PLDA) [ioffe2006probabilistic, kenny2013plda] model also provided in the code. Segments are then grouped using agglomerative hierarchical clustering (AHC) based on PLDA scores. We report the best performance by tuning the threshold of the AHC on the development set.
Two variants are compared, with and without the speech enhancement module [sun2018speaker] which has been made publicly available222https://github.com/staplesinLA/denoising_DIHARD18
. The system uses a Long short-term memory (LSTM) based speech denoising model trained on simulated training data. This model shows state-of-the-art performance on speech enhancement, and has shown its effectiveness for diarisation in the first DIHARD challenge.
Ablations. A crucial design choice that we made is that we used two active speaker detection methods, and a segment was only marked positive when both methods gave a positive output. We consider two ablations of our method – one using only SyncNet-based ASD, and the other using only AVSE-based ASD.
Evaluation protocol. The methods are evaluated on the VoxConverse test set. We use the diarisation error rate (DER), defined as the sum of missed speech (MS), false alarm speech (FA), and speaker misclassification error rates (Speaker Confusion, SC). A forgiveness collar of 0.25 seconds is applied in order to compensate for small inconsistencies in annotation.
Training. All thresholds are tuned on the VoxConverse development set. The AHC threshold for speaker clustering is the only hyperparameter to be tuned in the audio-only baseline. The audio-visual method requires three key thresholds – cosine distance for face clustering, SyncNet confidence for active speaker detection, and cosine distance for speaker identification. The first of these affect performance the most, since any error in the identity clustering directly causes speaker confusion.
|DIHARD 2019 baseline [sell2018diarization]||11.1||1.4||11.3||23.8|
|DIHARD 2019 baseline w/ SE [sell2018diarization, sun2018speaker]||9.3||1.3||9.7||20.2|
|Ours (SyncNet ASD only)||2.2||3.7||4.0||10.0|
|Ours (AVSE ASD only)||1.7||5.7||4.5||11.9|
Results. Table 3 shows the results of all the evaluations. Our audio-visual method obtains a DER much lower than the audio-only state-of-the-art baselines, showing the efficacy of using visual information for diarisation on this dataset. The ablation analysis for the ASD methods proves the effectiveness of using two active speaker detectors – the combined method has a significant decrease in false alarm rate for only a small increase in missed speech.
With regards to the difficulty of VoxConverse, we note that the DIHARD 2019 baseline obtains a DER of about 20% on our dataset (Table 3), and hence there is a lot of room for improvement. While this is lower than the 26% that the same model achieves on the extremely challenging DIHARD development set (with ground truth VAD), we hypothesize that this difference may be attributed to the use of a 0.25-second forgiveness collar in our evaluation protocol.
We have described a high performance audio-visual algorithm for automated diarisation, and used it to generate a new speaker diarisation dataset, VoxConverse, from ‘in the wild’ videos. The pipeline is fully scalable and is effective across a range of domains. VoxConverse currently contains 50 hours of annotated video, but we are in the process of scaling up. The final data will be used in the second VoxCeleb Speaker Recognition Challenge in October 2020 and, after that, will be released publicly to the research community free of charge.
Acknowledgements. This work is funded by the EPSRC Programme Grant Seebibyte EP/M013774/1. Arsha is funded by a Google PhD Fellowship. Triantafyllos is funded by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems and the Oxford-Google DeepMind Graduate Scholarship.