Discriminative Neural Clustering for Speaker Diarisation
This paper proposes a novel method for supervised data clustering. The clustering procedure is modelled by a discriminative sequence-to-sequence neural network that learns from examples. The effectiveness of the Transformer-based Discriminative Neural Clustering (DNC) model is validated on a speaker diarisation task using the challenging AMI data set, where audio segments need to be clustered into an unknown number of speakers. The AMI corpus contains only 147 meetings as training examples for the DNC model, which is very limited for training an encoder-decoder neural network. Data scarcity is mitigated through three data augmentation schemes proposed in this paper, including Diaconis Augmentation, a novel technique proposed for discriminative embeddings trained using cosine similarities. Comparing between DNC and the commonly used spectral clustering algorithm for speaker diarisation shows that the DNC approach outperforms its unsupervised counterpart by 29.4 Furthermore, DNC requires no explicit definition of a similarity measure between samples, which is a significant advantage considering that such a measure might be difficult to specify.
READ FULL TEXT