TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

10/08/2021
by   Nithin Rao Koluguri, et al.
0

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73 Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

05/14/2020

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Current speaker verification techniques rely on a neural network to extr...
07/07/2021

MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification

Most of the recent state-of-the-art results for speaker verification are...
05/10/2021

Study on the temporal pooling used in deep neural networks for speaker verification

The x-vector architecture has recently achieved state-of-the-art results...
10/28/2017

Attention-Based Models for Text-Dependent Speaker Verification

Attention-based models have recently shown great performance on a range ...
07/10/2020

Conditioned Time-Dilated Convolutions for Sound Event Detection

Sound event detection (SED) is the task of identifying sound events alon...
08/12/2021

Xi-Vector Embedding for Speaker Recognition

We present a Bayesian formulation for deep speaker embedding, wherein th...
03/31/2021

Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

We demonstrate that 1x1-convolutions in 1D time-channel separable convol...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker recognition is a broad research area that solves two major tasks based on characteristics of voices: speaker identification and speaker verification. Speaker identification is about identifying a person and speaker verification is about verifying whether the speaker is who they claim to be. Speaker diarization is a task of partitioning audio recordings into speaker-homogeneous segments belonging to each individual speaker. Typically, speaker recognition and diarization systems operate on unconstrained speech utterances, which are converted to a vector of fixed length, called speaker embeddings. These speaker embeddings represent the identity of each speaker and are used for speaker recognition and speaker diarization tasks.

In recent years, deep neural networks (DNNs) have been actively employed for speaker embedding extractors since d-vector [27] was proposed. Subsequently, x-vector [26]

was widely used because of the superior performance achieved by employing statistical pooling and time delay neural network (TDNN). Other architectures such as ResNet-based convolutional neural networks (CNNs)

[29] and CNNs with cross convolutional layers [9] were employed for capturing the traits of speech. In addition, to cope with the variable-length inputs, Transformer [24], CNN-LSTM [13] and a slew of variants of TDNN [28, 31, 6] were applied for DNN-based speaker embedding extractors.

In this work, we develop a speaker embedding extractor model that shows superior performance on both speaker verification and diarization tasks. We adopt the architecture from ContextNet [12]

, a state-of-the-art automatic speech recognition (ASR) model, which combines local features from 1D depth-wise separable convolutions and global context from Squeeze and Excitation layers.

We train text-independent speaker recognition models where the identity of the speaker is based on how speech is spoken, not necessarily on what is being said. The following are the main contributions of this paper:

  • [leftmargin=2.0ex,itemsep=0.2pt,topsep=0pt]

  • We propose use of 1D separable depth-wise convolutions compared to full 1D convolutions.

  • We bring in global context to speaker embedding models by introducing global average pooling after the squeeze and excitation module. This contrasts with the embedding extractors based on TDNN, such as x-vector [26] or most recently ECAPA-TDNN [8].

  • TitaNet-M is half the size of comparable speaker embedding extractors like ECAPA-TDNN or Conformer-based baselines and achieves superior performance in speaker diarization.

  • We train our networks end-to-end using angular softmax margin loss and use cosine similarity as a backend for speaker representations. Such approach leads us to avoid the burden of training external models like probabilistic linear discriminant analysis (PLDA) 

    [14]

    and agglomerative hierarchical clustering (AHC) 

    [25] as in well-known speaker verification systems [26, 28], or speaker diarization systems [25, 18].

We introduce three models with different sizes. The architecture is easily scalable in both depth and width by design, and we show that the scaling in width is very effective in reducing the model size with a small change in performance. In the experimental section, we demonstrate the performance of the model on a VoxCeleb1 cleaned test speaker verification trial file. In addition, we evaluate diarization performance on popular evaluation datasets like AMI (Lapel), AMI (MixHeadset)  [3], NIST-SRE-2000 [19] and CH109 [2].

2 Model Architecture

Figure 1: TitaNet Encoder and Decoder Architecture

2.1 Encoder

The model is based on the ContextNet ASR architecture [12] comprising of an encoder and decoder structure. We use the encoder of the ContextNet model as a top-level feature extractor, and feed the output to the attentive pooling layer. This layer computes attention features across channel dimensions to capture the time-independent utterance-level speaker representations.

TitaNet is a 1D time depth-wise channel separable convolutional model with ContextNet-like architecture combined with channel attention pooling. Fig. 1 describes the ContextNet- model encoder and attention pooling decoder, where is the number of blocks, is the number of repeated sub-blocks per block, and is the number of filters in the convolution layers of each block. The encoder starts with a prologue block , followed by mega blocks and ends with epilogue block

. Prologue and epilogue blocks differ from mega blocks, they both have the same convolution module (Conv), batchnorm and relu layers and have fixed kernel sizes of 3 in prologue and 1 in epilogue for all the network architectures we propose. They do not contain residual connections and dropout layers. Each mega block begins with time-channel separable convolutional

[16]

layer with stride 1 and dilation 1, followed by batchnorm, relu and dropout.

Each time-channel separable convolution module is made up of two parts: a depth-wise convolutional layer and a pointwise convolutional layer. Depth-wise convolutions apply a single filter per input channel (input depth). Pointwise convolutions are convolutions, used to create a linear combination of the outputs of the depth-wise layer. These layers are repeated times, which can be modified to vary the depth of the network. These repeated layers are residually connected with Squeeze and Excitation layers with global average pooling for context inclusion. By using global context, the SE layer squeezes a sequence of local feature vectors into a single global context vector, broadcasts this context back to each local feature vector, and merges the two via multiplications. The width of the network can be increased or decreased by varying output channel filter sizes of each mega block. For TitaNet models, width and depth are changed by varying these filter sizes, and the number of repeated layers, respectively.

2.2 Decoder and Embeddings

The top level acoustic features obtained from the output of encoder are used to compute intermediate features that are passed to the decoder for getting utterance level speaker embeddings. The intermediate time-independent features are computed using an attentive statistics pooling layer [6], where the channel attention features are computed across time-channels to get a time-independent feature representation of size .

The intermediate features

are passed through the Decoder consisting of two linear layers, one of output size 192 and another for a linear transformation from 192 to the final number of classes

, to compute the probability that the current segment belongs to a speaker from the training set. In this fashion, the network extracts fixed-length representation from variable length speech segments. We extract t-vectors before the final logits linear layer of fixed size

.

2.3 Loss function

The TitaNet model was trained end-to-end with additive angular margin (AAM) loss [7]. The AAM helps to optimize the cosine distance between speaker embeddings. For all the verification and diarization experiments presented in this paper, we use cosine similarity as the back-end:

(1)

where is margin, is scale and is the angle between the final linear layer weight and incoming feature . Here and are predefined hyper parameters.

3 Experiments

We designed three TitaNet models: TitaNet-S with 256 channels, TitaNet-M with 512 channels, and TitaNet-L with 1024 channels. All models have the same number of repeating layers as 3, and the same kernel filter sizes as 3,7,11 and 15. TitaNet-S has 6.4M , TitaNet-M – 13.4M, and TitaNet-L – 25.3M parameters.

3.1 Datasets

3.1.1 Training Data

We use following datasets to datasets to train TitaNet: VoxCeleb1 and VoxCeleb2 dev[4]

, NIST SRE portion of datasets from 2004-2008 (LDC2009E100), Switchboard-Cellular1 and Switchboard-Cellular2 

[10], Fisher [5], and Librispeech [21] (see Table 1). Combined, these datasets consist of about 4.8M utterances from 16.6K speakers. We augment the training data with RIR [15] impulse corpora, speed pertubation with 0.95x & 1.05x speeds and also spec augment [22].

3.1.2 Evaluation Data

We use the VoxCeleb1 cleaned test trial file to evaluate EER for speaker verification. We use the following three datasets for the evaluating speaker diarization system:

  • [leftmargin=2.0ex,noitemsep,topsep=0pt]

  • NIST-SRE-2000 [19]: all sessions from LDC2001S97.

  • AMI Corpus [3]: Lapel and MixHeadset audio subsets from partition set [1].

  • CH109 [2]: we use a subset of CALLHOME American English speech (CHAES), which contains only two speakers. There are 109 sessions in this subset. The remaining 11 sessions in CHAES are used as a dev set for CH109 and NIST-SRE-2000.

Dataset # of Speakers
Duration
(in Hrs)
# Utterances
(in K)
VoxCeleb1 1211 227 332
VoxCeleb2 5994 1895 2274
SRE 3787 503 944
Fisher 951 162 278
Switchboard 2400 247 425
LibriSpeech 2338 336 634
Total 16681 3373 4890
Table 1: Statistics of each dataset used for training TitaNet
Figure 2: DET curve for VoxCeleb1 cleaned trial comparing with previous studies
Models (Backend)
# Params
(M)
VoxCeleb1
EER (%) MinDCF
x-vector (PLDA) [26] 9 2.97 0.323
ECAPA (CS) [6] 22.3 0.69 0.082
Conformer (CS)[11] 26.4 2.43 0.264
TitaNet-S (CS) 6.4 1.15 0.131
TitaNet-M (CS) 13.4 0.81 0.106
TitaNet-L (CS) 25.3 0.68 0.087
Table 2: TitaNet comparison with other models for speaker verification task. All models has been evaluated with Cosine Similarity (CS) backend except x-vector which used PLDA.

3.2 Experiment Setup

Every speaker recognition experiment consists of common data pre-processing steps for training, development, and evaluation steps. During the pre-processing, we do not use a speech activity detector (SAD) to avoid dependence on an additional model. Instead, we split speech segments longer than 3 sec into random chunks of 1.5, 2, and 3 sec. We compute acoustic features for every 25 ms frame window shifted over 10 ms. The acoustic features are 80-dimensional mel spectrograms computed using a 512 FFT and a Hann window. Next, Mel-spectrogram features are normalized over the frequency axis. Every utterance fed to the encoder has size of , where is the number of frames in a given speech utterance file. The accuracy of speaker verification systems is measured using EER and minimum normalized detection cost (MinDCF) with = and = = . Both the EER for verification and the DER - for diarization are done using cosine similarity (CS) back-end.

In our diarization experiments, the evaluation datasets are divided into two groups: telephonic and non-telephonic speech. Based on experiments with the dev sets, we found window size of 1.5 sec with a shift of 0.75 sec works best for telephonic speech. For non-telephonic speech, the best settings were 3 sec and 1.75 sec for window and shift respectively. In the evaluation datasets used in this paper, AMI Lapel and MixHeadset fall under non-telephonic speech, and the rest of them are in the telephonic speech group. Unlike the previous studies [6, 25], we do not use external dev data to tune the clustering parameters by relying on an auto-tuning approach [23]. Similar to the previous systems [6], we use collar 0.25 sec and ignore overlap speech regions for speaker error rate calculation. All TitaNet models in Table 2

are trained for 250 epochs with SGD optimizer, with initial learning rate (LR)

using cosine annealing LR scheduler on 4 nodes with 8 V100 GPUs per node.

3.3 Evaluation Results

Models (Backend)
NIST-SRE
2000
AMI
Lapel
AMI
MixHeadset
CH109
x-vector (PLDA + AHC) [26] 8.39 - - 9.72
ECAPA (SC) [6] - 2.36 1.78 -
x-vector (MCGAN) [20] 5.73 - - -
TitaNet-S (NME-SC) 6.37 2.00 2.22 1.11
TitaNet-M (NME-SC) 6.47 1.99 1.79 1.13
TitaNet-L (NME-SC) 6.73 2.03 1.73 1.19
Table 3: TitaNet comparison with other models for speaker diarization with oracle SAD known speakers number, DER().

3.3.1 Speaker Verification

In the speaker verification experiments, we train the model on the datasets shown in the Table 1. We train these systems initially as a speaker identification model with 10 percent of audio files of each speaker set aside as validation data from training sets. With this setup, we trained TitaNet models end-to-end using additive margin angular loss. Table 2 shows the performance of TitaNet models on the VoxCeleb cleaned trial file. We observed a high-degree of sensitivity on validation curves with slight variations in the margin () and scale () for angular loss. With as 30 and as 0.2 TitaNet-L showed state of the art performance with EER of 0.68% on VoxCeleb1 cleaned test trial file outperforming previously reported results in [6, 30].

As it can be noticed from Table 2, Titanet models are easily scalable to achieve very competitive performances even with relatively few parameters around 6M. These models show a direct relationship on accuracy in contrast to the number of parameters. We show the Detection Error Trade-off (DET) curves to compare TitaNet-L model with other previously stated CNN based models.

3.3.2 Speaker Diarization

We employ our proposed speaker embedding extractor models for speaker diarization tasks. Cosine similarity is used for measuring the distance between speaker embeddings and Normalized Maximum Eigengap Spectral Clustering (NME-SC)

[23]

on the extracted embeddings to obtain the clustering result. We show the performance of each TitaNet model on popular evaluation datasets as shown in Tables 3 and 4. The diarization experiments are based on oracle SAD to evaluate the SAD-independent performance. In Table 3, we show the results for known number of speakers case and in Table 4, we present the results for unknown number of speakers for which the speaker count is estimated using NME-SC clustering algorithm.

TitaNet models outperform the previous state-of-the-art models on the AMI-Lapel, AMI-MixHeadset and CH109 evaluation datasets. It is worth noting that the performance of the small and medium TitaNet models show minor differences even if we reduce their model parameters by and respectively, compared to the largest model. In diarization systems we believe there is no major performance improvement using larger TitaNet models when compared to verification. We hypothesize that this is related to the fact that separability in the embedding space does not require a higher level of precision since the clustering process only involves relatively few speakers. The TitaNet models show very good improvement on all datasets, except on the NIST-SRE-2000 evaluation set. But note that the clustering approaches in [18, 26]

involves additional training for Hidden Markov Model or PLDA.

Models (Backend)
NIST-SRE
2000
AMI
Lapel
AMI
MixHeadset
CH109
x-vector (PLDA + AHC) [26] 7.12 - - -
x-vector (VBx) [18] 4.42 - 2.17 -
ECAPA (SC)[6] - 2.13 2.17 -
x-vector (MCGAN) [20] 6.76 - -
TitaNet-S (NME-SC) 5.49 2.3 1.97 1.42
TitaNet-M (NME-SC) 5.75 2.59 1.89 1.51
TitaNet-L (NME-SC) 5.38 2.03 1.89 1.63

This number is reported based on Kaldi evaluation partition [18]

Table 4: TitaNet comparison with other models for speaker diarization with oracle SAD estimated number of speakers, DER().

4 Conclusion

In this paper, we present TitaNet, a new speaker representation learning model that utilizes the global context of squeeze-and-excitation layers combined with channel attention pooling for extracting fixed length speaker embeddings. The model employs 1D depth-wise separable convolutions for speaker embedding models that showed state-of-the-art performance in ASR tasks. The TitaNet-M model, which is half the size of previous state-of-the-art systems outperforms them in speaker diarization tasks while achieving competitive numbers on verification tasks. The TitaNet-L model significantly outperforms existing models in speaker verification and diarization tasks.

The models’ implementation and pre-trained checkpoints are made available through NVIDIA NeMo toolkit [17].111https://github.com/NVIDIA/NeMo

5 Acknowledgments

We would like to thank NVIDIA AI Applications team for the help and valuable feedback.

References

  • [1] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M. Gill (2020) pyannote.audio: neural building blocks for speaker diarization. In ICASSP, Cited by: 2nd item.
  • [2] A. Canavan, D. Graff, and G. Zipperlen (1997) CallHome american english speech. Linguistic Data Consortium. Cited by: §1, 3rd item.
  • [3] J. Carletta et al. (2005) The AMI meeting corpus: a pre-announcement. In

    International workshop on machine learning for multimodal interaction

    ,
    Cited by: §1, 2nd item.
  • [4] J.S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In Interspeech, Cited by: §3.1.1.
  • [5] C. Cieri, D. Graff, O. Kimball, D. Miller, and K. Walker (2004) Fisher english training speech part 1 transcripts. Linguistic Data Consortium. Cited by: §3.1.1.
  • [6] N. Dawalatabad, M. Ravanelli, F. Grondin, J. Thienpondt, B. Desplanques, and H. Na (2021) ECAPA-TDNN embedding for speaker diarization. arXiv:2104.01466. Cited by: §1, §2.2, §3.2, §3.3.1, Table 2, Table 3, Table 4.
  • [7] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    ArcFace: additive angular margin loss for deep face recognition

    .
    In CVPR, Cited by: §2.3.
  • [8] B. Desplanques, J. Thienpondt, and K. Demuynck (2020) Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143. Cited by: 2nd item.
  • [9] Z. Gao, Y. Song, I. McLoughlin, W. Guo, and L. Dai (2018) An improved deep embedding learning method for short duration speaker verification. In Interspeech, Cited by: §1.
  • [10] J. Godfrey and E. Holliman (1993) Switchboard-1 release 2 LDC97S62. Linguistic Data Consortium. Cited by: §3.1.1.
  • [11] A. Gulati, J. Qin, C. Chiu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. In Interspeech, Cited by: Table 2.
  • [12] W. Han, Z. Zhang, Y. Zhang, et al. (2020) ContextNet: improving convolutional neural networks for automatic speech recognition with global context. arXiv:2005.03191. Cited by: §1, §2.1.
  • [13] J. Jung, H. Heo, et al. (2018) A complete end-to-end speaker verification system using deep neural networks: from raw signals to verification result. In ICASSP, Cited by: §1.
  • [14] P. Kenny (2010) Bayesian speaker verification with heavy-tailed priors.. In Odyssey, Cited by: 4th item.
  • [15] T. Ko, V. Peddinti, D. Povey, M. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, Cited by: §3.1.1.
  • [16] S. Kriman, S. Beliaev, B. Ginsburg, et al. (2020) QuartzNet: deep automatic speech recognition with 1D time-channel separable convolutions. In ICASSP, Cited by: §2.1.
  • [17] O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook, et al. (2019) Nemo: a toolkit for building ai applications using neural modules. arXiv:1909.09577. Cited by: §4.
  • [18] F. Landini, J. Profant, M. Diez, and L. Burget (2022) Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language 71, pp. 101254. Cited by: 4th item, §3.3.2, Table 4.
  • [19] A. Martin and M. Przybocki (2001) The NIST speaker recognition evaluations: 1996-2001. In Odyssey, Cited by: §1, 1st item.
  • [20] M. Pal, M. Kumar, et al. (2021)

    Meta-learning with latent space clustering in generative adversarial network for speaker diarization

    .
    IEEE/ACM TASLP. Cited by: Table 3, Table 4.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) LibriSpeech: an asr corpus based on public domain audio books. In ICASSP, Cited by: §3.1.1.
  • [22] D. Park, W. Chan, et al. (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv:1904.08779. Cited by: §3.1.1.
  • [23] T. Park, K. Han, et al. (2019) Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27, pp. 38–385. Cited by: §3.2, §3.3.2.
  • [24] P. Safari, M. India, and J. Hernando (2020) Self-attention encoding and pooling for speaker recognition. In Interspeech, Cited by: §1.
  • [25] G. Sell, D. Snyder, et al. (2018) Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge.. In Interspeech, Cited by: 4th item, §3.2.
  • [26] D. Snyder et al. (2018) X-vectors: robust dnn embeddings for speaker recognition. In ICASSP, Cited by: 2nd item, 4th item, §1, §3.3.2, Table 2, Table 3, Table 4.
  • [27] E. Variani, X. Lei, E. McDermott, I. Moreno, and J. Gonzalez-Dominguez (2014) Deep neural networks for small footprint text-dependent speaker verification. In ICASSP, Cited by: §1.
  • [28] J. Villalba, N. Chen, D. Snyder, et al. (2019) State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18.. In Interspeech, Cited by: 4th item, §1.
  • [29] Y. Yu, L. Fan, and W. Li (2019) Ensemble additive margin softmax for speaker verification. In ICASSP, Cited by: §1.
  • [30] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot (2019) BUT system description to voxceleb speaker recognition challenge 2019. arXiv:1910.12592. Cited by: §3.3.1.
  • [31] Y. Zhu and B. Mak (2020) Orthogonal training for text-independent speaker verification. In ICASSP, Cited by: §1.