Author's repository for reproducing RawNet2 with PyTorch and RawNet with PyTorch and Keras.
Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification. Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance. Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis. Effective back-end classification models that suit the proposed speaker embedding are also explored. We propose an end-to-end system that comprises two deep neural networks, one front-end for utterance-level speaker embedding extraction and the other for back-end classification. Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation. The proposed system is also comparable to the state-of-the-art x-vector system that adopts heavy data augmentation.READ FULL TEXT VIEW PDF
Input utterance with short duration is one of the most critical threats ...
The classical i-vectors and the latest end-to-end deep speaker embedding...
Deep neural network based speaker embeddings, such as x-vectors, have be...
The front-end factor analysis (FEFA), an extension of principal componen...
Recent advances in deep learning have facilitated the design of speaker
Recently, end-to-end mispronunciation detection and diagnosis (MD D) s...
Data augmentation is conventionally used to inject robustness in Speaker...
Author's repository for reproducing RawNet2 with PyTorch and RawNet with PyTorch and Keras.
Direct modeling of raw waveforms using deep neural networks (DNNs) is increasingly prevalent in a number of tasks due to advances in deep learning[1, 2, 3, 4, 5, 6, 7, 8]. In speech recognition, studies such as those of Palaz et al., Sainath et al., and Hoshen et al. deal with raw waveforms as input [1, 2, 3]. In speaker recognition, studies by Jung et al. and Muckenhirn et al. were the first to comprise systems that input raw waveforms [5, 6, 7]. Other domains such as spoofing detection and automatic music tagging are also adopting raw waveform inputs [4, 8].
DNNs that directly input raw waveforms have a number of advantages over conventional acoustic feature-based DNNs. First, minimization of pre-processing removes the need for exploration of various hyper-parameters such as the type of acoustic feature to use, window size, shift length, and feature dimension. This is expected to lower entry barriers to conducting studies and lessen the burden of follow-up studies. Additionally, with recent trends of DNN replacing more sub-processes in various tasks, a raw waveform DNN is well positioned to benefit from future advances in deep learning.
Studies across various tasks have shown that an assembly of multiple frequency responses can be extracted when raw waveforms are processed by each kernel of convolutional layers [6, 9]. Spectrograms, compared to raw waveform DNNs, have linearly positioned frequency bands, meaning the first convolutional layer sees only adjacent frequency bands (although repetition of convolutions can aggregate various frequency responses at deeper layers). In other words, spectrogram-based CNN can see fixed frequency regions depending on the internal pooling rule. This difference is hypothesized to increase the potential of the directly modeling raw waveforms; as increasing amounts of data become available, this data-driven approach can extract an aggregation of informative frequency responses appropriate to the target task.
. This model extracts frame-level embeddings using residual blocks with convolutional neural networks (CNN)[10, 11]
, and then aggregates features into utterance level using long short-term memory (LSTM)[12, 13]. Key improvements made by our study include the following:
Model architecture: adjustments to various network configurations
Objective function: additional objective functions to incorporate speaker embedding extraction and feature enhancement phase
Back-end classification models: comparison of various DNN-based back-end classifiers and proposal of a simple, effective back-end DNN classifier
Through changing these aspects, performance is significantly enhanced. The equal error rate (EER) of utterance-level speaker embedding DNN with cosine similarity on theVoxCeleb1 dataset is 4.8 %, showing 44.8 % relative error rate reduction (RER) compared to the baseline . An EER of 4.0 % was achieved for the end-to-end model using two DNNs, showing an RER of 46.0 %.
The rest of this paper is organized as follows. Section 2 describes the front-end speaker embedding extraction model. Section 3 addresses various back-end classification models. Experiments and results are in Sections 4 and 5 and the paper is concluded in Section 6.
We propose a model (referred to as “RawNet” for convenience) that is an improvement of the CNN-LSTM model in [5, 7] by changing architectural details (Section 2.1.), proposing a new pre-training scheme (Section 2.2.), and incorporating a speaker embedding enhancement phase (Section 2.3.).
The DNN used in this study comprises residual blocks, a gated recurrent unit (GRU) layer[15, 16], a fully-connected layer (used for extraction of speaker embedding), and an output layer. In this architecture, input features are first processed using the residual blocks  to extract frame-level embeddings. The residual blocks comprise convolutional layers with identity mapping  to facilitate the training of deep architectures. A GRU is then employed to aggregate the frame-level features into a single utterance-level embedding. Utterance-level embedding is then fed into one fully-connected layer. The output of the fully-connected layer is used as the speaker embedding and is connected to the output layer, where the number of nodes is identical to the number of speakers in the training set. The proposed RawNet architecture is depicted in Table 1.
, allowing for further improvement. First, activation functions are changed from rectified linear units (ReLU) to leaky ReLU. Second, the LSTM layer is changed to a GRU layer. Third, the number of parameters is significantly decreased, including lower dimensionality of speaker embedding (from 1024 to 128).
Extracting utterance-level speaker embeddings directly from raw waveforms often leads to overfitting toward the training set . In , multi-step training proposed in  was used to avoid such phenomenon. This training scheme first trains a CNN (for frame-level training), and then expands to a CNN-LSTM (for utterance level). This scheme demonstrates significant improvement compared to training a CNN-LSTM with random initialization.
However, the multi-step training approach in  is inefficient because after training 9 residual blocks, 3 residual blocks which contains a number of layers are removed when expanding the trained CNN model to a CNN-LSTM model. In our study, a new approach of interpreting the CNN training phase as pre-training is applied. This approach adopts fewer residual convolutional blocks, i.e. 6, connected to a global average pooling layer. After training the CNN, only the global average pooling layer is removed. The objective is to consider the number of convolutional blocks appropriate for training with the recurrent layer and not remove any parameters. This modification enables more efficient and faster training. Application of model architecture modifications detailed in Section 2.1, and the CNN pre-training scheme exhibited an RER of 26.4 % (see Table 2).
|Layer||Input:59,049 samples||Output shape|
|Res block||2||(2187, 128)|
|Res block||4||(27, 256)|
RawNet architecture. For convolutional layers, numbers inside parentheses refer to filter length, stride size, and number of filters. For gated recurrent unit (GRU) and fully-connected layers, numbers inside the parentheses indicate the number of nodes. An input sequence of 59,049 is based on the training mini-batch configuration. At the evaluation phase, input sequence length differs. Center loss and between-speaker loss is omitted for simplicity. For residual blocks, layers under the dotted line are conducted after residual connection.
For speaker verification, a number of studies enhance extracted utterance-level features through an additional process before back-end classification. Linear discriminant analysis (LDA) in i-vector/PLDA systems is one example [17, 18]. Well-known methods such as LDA or recent DNN-based deep embedding enhancement, including discriminative auto-encoder (DCAE), have been applied for this purpose in feature enhancement. In such approaches, one of the main objectives is to minimize intra-class covariance and maximize inter-class covariance of utterance-level features. In this study, we aim to incorporate two phases of speaker embedding extraction and feature enhancement into a single phase, using two additional objective functions.
To consider both inter-class and intra-class covariance, we utilize center loss  and speaker basis loss  in addition to categorical cross-entropy loss for DNN training. We adopt center loss 
to minimize intra-class covariance while the embedding in the last hidden layer remains discriminative. To achieve this goal, center loss function was proposed as
where refers to embedding of the utterance, refers to the center of class , and refers to the size of a mini-batch.
Speaker basis loss , aims to further maximize inter-class covariance. This loss function considers a weight vector between the last hidden layer and a node of the softmax output layer as a basis vector for the corresponding speaker and is formulated as:
where is the basis vector of speaker and is the number of speakers within the training set. Hence, the final objective function used in this study is
where refers to categorical cross-entropy loss and refers to the weight of ( in our study).
In speaker verification, cosine similarity and PLDA are widely used for back-end classification to determine whether two speaker embeddings belong to the same speaker . Although PLDA has shown competitive results in a number of studies, DNN-based classifiers have also shown potential in previous researches . A number of DNN-based back-end classifiers have been explored: concatenation of speaker embeddings, and b-vector and rb-vector systems [22, 23]. A novel back-end classifier using DNN is introduced based on analysis of a b-vector system.
The b-vector proposed by Lee et al. exploits element-wise binary operation of speaker embeddings to represent relationships . Operations include addition, subtraction, and multiplication. The results are concatenated, composing a b-vector that has three times the dimensions of a single speaker embedding. Although the technique itself is simple, results demonstrate that binary operations effectively represent the relationship between the speaker embeddings.
The rb-vector is an expansion of the b-vector, where an additional r-vector is used with the b-vector 
. The main purpose of the r-vector approach is to represent the relationship of the speaker embeddings to the training set. To compose r-vectors, a fixed number of representative vectors are derived using k-means on the speaker embeddings of the training set. For every trial, b-vectors are composed by conducting binary operations between representative vectors and the speaker embedding. These b-vectors are dimensionally reduced using principal component analysis (PCA) and then concatenated, producing r-vectors.
The b-vector approach uses various element-wise binary operations to derive the relationship between the speaker embeddings. However, because weighted summation operations in a DNN can replace (or even find better combinations of) addition and subtraction, we hypothesized that the core term contributing to the success of the b-vector is the multiplication operation. Therefore, we propose an approach using the concatenation of the speaker embedding, test utterance, and their element-wise multiplication. Experimental results show that by only adding element-wise multiplication, performance exceeds that of the b-vector (see Table 4, ‘concat&mul’).
|Input Feature||Front-end||Back-end||Loss||Dims||Augment||EER (%)|
|Shon et al. ||MFCC||x-vector||PLDA||Softmax||600||Light||6.0|
|Shon et al. ||MFCC||1D-CNN||PLDA||Softmax||512||Light||5.3|
|Hajibabaei et al. ||Spectrogram||ResNet-20||N/A||A-Softmax||128||Heavy||4.4|
|Hajibabaei et al. ||Spectrogram||ResNet-20||N/A||AM-Softmax||128||Heavy||4.3|
|Okabe et al. ||MFCC||x-vector||PLDA||Softmax||1500||Heavy||3.8|
|Nagrani et al. ||MFCC||i-vector||PLDA||-||-||-||8.8|
|Nagrani et al. ||Spectrogram||VGG-M||Cosine||Metric learning||256||-||7.8|
|Jung et al. ||Raw waveform||CNN-LSTM||Cosine||Softmax||1024||-||8.7|
|Jung et al. ||Raw waveform||CNN-LSTM||b-vector||Softmax||1024||-||7.7|
|Shon et al. ||MFCC||x-vector||PLDA||Softmax||512||-||7.1|
|Shon et al. ||MFCC||1D-CNN||PLDA||Softmax||600||-||5.9|
We use VoxCeleb1 dataset which comprises approximately 330 hours of recordings from 1251 speakers in text-independent scenarios and has a number of comparable recent studies in the literature. All utterances are encoded at a 16 kHz sampling rate with 16-bit resolution. As the dataset comprises various utterances of celebrities from YouTube, it includes diverse background noise and varied durations. We followed the official guidelines which divide the dataset into training and evaluation sets of 1211 and 40 speakers respectively.
We didn’t apply any pre-processing, such as normalization, except pre-emphasis  to raw waveforms. For mini-batch construction, utterances were either cropped or duplicated into 59049 samples () in the training phase, following [5, 7]. In the evaluation phase, no adjustments were made to length; the whole utterance was used.
RawNet comprises one strided convolutional layer, six residual blocks, one GRU layer, one fully-connected layer, and an output layer (see Table 1). Residual block comprises two convolutional layers, two batch normalization (BN) layers
, two leaky ReLU layers, and a max pooling layer as shown in Table 1. Residual connection adds the input of each residual block to the output of the second BN layer. A GRU layer with 1024 nodes aggregates frame-level embeddings into an utterance-level embedding. One fully-connected layer is used to extract speaker embeddings. The output layer has 1211 nodes, which represents the number of speakers in the training set. Back-end classifiers including b-vector, rb-vector, concat&mul comprise four fully-connected layers with 1024 nodes.
L2 regularization (weight decay) with a weight factor of was applied to all layers. An AMSGrad optimizer with learning rate with decay parameter was used . For center loss, was used. Training was conducted using a mini-batch size of 102. Recurrent dropout at a rate of 0.3 was applied to the GRU layer .
Table 2 demonstrates the effectiveness of modifications made to the model architecture (Section 2.1.) and the pre-training scheme (Section 2.2.). It also shows the effect of additional objective functions for incorporating explicit feature enhancement phases (Section 2.3.). First, modifications to the model architecture and the pre-training scheme reduced the EER from 8.7 % to 6.8 %. Application of additional objective functions further decreased EER to 4.8 %. The proposed RawNet demonstrates an RER of 44.8 % compared to the baseline . Intermediate, which does not include additional objective functions, could benefit from explicit feature enhancement techniques. On the other hand, RawNet, which includes additional objective functions, exhibits an EER of 4.8 % without explicit feature enhancement techniques. RawNet also outperforms Intermediate with feature enhancement showing that it has successfully incorporated the feature enhancement phase.
Comparison with state-of-the-art front-end embedding extraction systems with cosine similarity back-end is shown in Table 3. The proposed RawNet demonstrates the lowest EER compared to both i-vector systems and x-vector systems. For i-vector systems, we compared two configurations: with and without LDA feature enhancement. For x-vector systems, we compared two configurations based on Shon et al. were compared, where the data augmentation scheme introduced in  was conducted using reverberation and various noise.
Table 4 describes the performance of various back-end classifiers with the proposed RawNet speaker embeddings. In our experiments, PLDA did not show improved results compared to the baseline cosine similarity. On the other hand, DNN-based back-end classifiers demonstrated significant improvement, with an RER of 16 %. The rb-vector system did not show additional improvements above the b-vector system. Among DNN-based back-end classifiers, the proposed approach of using element-wise multiplication of speaker embeddings from enrol and test utterances performed best, with an EER of 4.0 %.
Recent studies using the VoxCeleb1 dataset are compared in Table 5. The first five rows depict systems that utilize data augmentation techniques. “Heavy” refers to the augmentation scheme of [24, 25], with various noise from the PRISM dataset and reverberations from the REVERB challenge dataset. “Light” refers to the scheme used in [21, 34] that doubles the size of the dataset using noise and reverberations. The i-vector system with PLDA back-end in two implementation versions (one from Nargrani et al. and the other from our implementation) demonstrates an EER of 8.8 % and 5.1 %, respectively. The x-vector system with PLDA back-end without data augmentation, as studied by Shon et al., demonstrates an EER of 7.1 %. The proposed RawNet system with concat&mul demonstrates the best performance among systems without data augmentation, exhibiting an EER of 4.0 %. The x-vector/PLDA system conducted by Okabe et al.  is the only system showing lower EER than our proposed system, but the former is subjected to intensive data augmentation, which hinders direct comparison.
In this paper, we propose an end-to-end speaker verification system using two DNNs, for extracting speaker embedding extraction and back-end classification. The proposed system has a simple, yet efficient, process pipeline where speaker embeddings are extracted directly from raw waveforms and verification results are directly shown using two DNNs. Various techniques that compose the proposed RawNet have been explored including the pre-training scheme and additional objective functions. RawNet with the concat&mul back-end classifier demonstrates an EER of 4.0 % on the VoxCeleb1 dataset, which is state-of-the-art among systems without data augmentation, including the x-vector system.
The proposed RawNet with concat&mul inputs raw waveforms and outputs verification results. Such a simplified process pipeline is expected to lower barriers to research and provide opportunities for many researchers to apply new techniques.
Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 499–515.
, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2015. [Online]. Available:http://download.tensorflow.org/paper/whitepaper2015.pdf
Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” inAdvances in neural information processing systems, 2016, pp. 1019–1027.