[Preprint] "AutoSpeech: Neural Architecture Search for Speaker Recognition" by Shaojin Ding*, Tianlong Chen*, Xinyu Gong, Weiwei Zha, Zhangyang Wang
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.READ FULL TEXT VIEW PDF
[Preprint] "AutoSpeech: Neural Architecture Search for Speaker Recognition" by Shaojin Ding*, Tianlong Chen*, Xinyu Gong, Weiwei Zha, Zhangyang Wang
Speaker recognition aims to retrieve the identity of the speaker given his/her utterances. According to the content similarity of the utterances, speaker recognition can be categorized into text-dependent and text-independent, the latter being more general and realistic for practical applications. Additionally, according to different application settings, speaker recognition usually fall into one of the two categories: speaker identification (SID) and speaker verification (SV). SID identifies the speaker of an utterance from a known speaker set, and SV determines if the speaker of an utterance matches its given enrollment.
End-to-end speaker recognition systems [variani2014deep, wan2018generalized, xie2019utterance, sadjadi2016ibm, li2017deep, snyder2018x] have emerged in recent years and achieved the state-of-the-art performance. These models usually admit a three-stage pipeline
: (1) A deep neural network, typically Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), as a feature extractor to generate frame-wise speaker embedding; (2) a temporal aggregation layer, to produce fixed length speaker embedding (i.e., d-vector); (3) A loss function to optimize the entire network. In testing stage, the network along with the temporal aggregation layer are first used to produce d-vectors for testing utterances, and then a similarity metric (e.g., cosine similarity) follows to generate final same/different speaker decision based on the d-vectors.
Most CNN/RNN-based speaker recognition works focus on improving the effectiveness of the speaker embedding through advanced training objectives and temporal aggregation strategies [gao2019improving, hajibabaei2018unified, xie2019utterance, cai2018exploring, li2017deep]. By contrast, network architecture design has received relatively less attention. Existing studies usually use off-the-shelf backbones that have been shown to be successful in standard tasks such as image classification (e.g., VGG-Net [simonyan2014very], ResNet [he2016deep]). However, these backbones are not designed for and may not be optimal for speaker recognition. Meanwhile, in other speech processing tasks such as speech recognition and speech synthesis, the improved design of network architecture has already yielded evident performance gains [chan2016listen, shen2018natural]. In view of those, we conjecture that enhancing network architecture design matters for improving speaker recognition performance too. Yet as the major hurdle for improving the backbone of any new task, manually exploring the gigantic design space of deep networks is notoriously tedious and ad-hoc.
This paper proposes an automated approach to identify the optimal CNN architecture for text-independent speaker recognition, named as AutoSpeech. It is inspired by recent advances on neural architecture search (NAS) [liu2018darts, real2019regularized, pham2018efficient, zoph2016neural, negrinho2017deeparchitect], which has proven success in designing deep networks that outperform hand-designed best performers in various tasks [baruwa2019leveraging, liu2019auto, gong2019autogan, quan2019auto]. We evaluated the effectiveness of AutoSpeech on both SID and SV tasks using VoxCeleb1 dataset [nagrani2017voxceleb]. Our derived CNN architectures significantly outperforms several networks used in state-of-the-art speaker recognition systems (based on VGG-M, ResNet-18, and ResNet-34), at lower model complexities.
Earlier speaker recognition systems used spectral features (e.g., MFCCs) as the speaker embedding [martinez2012speaker, tiwari2010mfcc], but these systems have been superseded by systems based on “i-vectors” [dehak2010front, garcia2011analysis, kenny2010bayesian]
. An i-vector takes speech from a speaker and uses it to adapt a speaker-independent Gaussian Mixture Model (GMM, referred to as a Universal Background Model). The means of the adapted GMM are concatenated to form a supervector, which is then reduced in dimensionality using joint factor analysis. More recently, end-to-end speaker recognition[variani2014deep, wan2018generalized, xie2019utterance, sadjadi2016ibm, li2017deep] systems have been shown to surpass the i-vector based systems and achieve state-of-the-art performance. Variani et al. first proposed neural network based speaker recognition system. In their work, they used maxout fully-connected network to produce d-vectors, and then they used cosine similarity to make the final decision. Following this, advance network architectures such as CNNs [xie2019utterance, li2017deep, cai2018exploring, hajibabaei2018unified, bhattacharya2019deep] and RNNs [li2017deep, wan2018generalized]
are used for feature extracts. Additionally, advanced training objectives[wan2018generalized, hajibabaei2018unified, bhattacharya2019deep] and temporal aggregation strategies [gao2019improving, xie2019utterance, cai2018exploring, li2017deep] are also proposed to improve speaker recognition performance.
The goal of neural architecture search is to search for an optimal model architecture for a given task. To accomplish this, a search space will be constructed firstly, which may vary across different tasks and objectives. Towards accurate image classification, many previous work [liu2018darts, real2019regularized, pham2018efficient] adopt a cell-based search space with complicated inner connection. To improve its efficiency, [tan2019mnasnet, wu2019fbnet, dai2019chamnet] adopts a sequential-based search space, benefiting a lot in reducing latency and FLOPs inherently. More recently, neural architecture search has also been applied to other tasks (e.g., image segmentation [liu2019auto, chen2019fasterseg], generative adversarial network [gong2019autogan], and phone recognition [baruwa2019leveraging]
), and task-specific search spaces are usually designed. To conduct architecture search over the defined search space, various optimization methods are proposed. Barret and Quoc first use the reinforcement learning (RL) to search for architecture[zoph2016neural], where a RNN serves as an architecture sampler. However, it is quite time-consuming (2,000 GPU days). To address this limitation, Hieu et al. propose to use weight sharing (ENAS[pham2018efficient]) accelerate RL-based search method. Furthermore, Liu et al. present a gradient-based method [liu2018darts] to accelerate the searching process.
In this section, we introduce how to use NAS to automatically find the optimal CNN architecture for speaker recognition. Following previous NAS methods [liu2018darts, zoph2016neural], we use a two-stage pipeline. First, we search for the optimal architecture of a neural cell that consists of several different types of operations. Second, we derive a CNN model using the searched cell, and then train the CNN model through the standard speaker recognition training scheme. We first define the search space of a neural cell in Section 3.1. Then, we describe the NAS algorithm in Section 3.2. Finally, we introduce how to derive a CNN model using searched neural cells and how to train it, in Section 3.3.
During the NAS process, the network is constructed by stacking multiple neural cells. We define a neural cell to have nodes, as illustrated in Figure 1. A cell can be viewed as a directed acyclic graph, where a node
corresponds to a tensor and an directed edgecorresponds to an operation Following [liu2018darts, zoph2016neural, real2019regularized, liu2018progressive], we set each cell to have two input nodes, four intermediate nodes, and one output node. The input nodes are set equal to the outputs of previous two cells, respectively (e.g., The first node in the -th cell is equal to the output of the (-2)-th cell, and the second node in the -th cell is equal to the output of the (-1)-th cell). The four intermediate nodes ( to ) are densely connected (red arrows in Figure 1, 14 edges in total), and each of them is computed as the summation based on all of its predecessors:
A best combination of the operations on these edges are found at the end of the searching process. Finally, the output node is defined as the concatenation of all intermediate nodes.
The search space defines all possible candidate operations in a neural cell. We include the following 8 operations that are prevalent in modern CNNs in our search space, and the resulting the search space size is (14 edges, 8 candidate operations for each edge).
no connection (zero)
We stack the neural cell 8 times to form the backbone CNN during the NAS process. Additionally, we define two types of neural cells: normal cells (cells that keep the spatial resolution of the feature tensor) and reduction cells (cells that divide the spatial resolution by 2 and multiply the number of filters by 2). Following previous studies [zoph2018learning, liu2018darts, real2019regularized], we set the cells whose position is located at the and of the total depth to reduction cells, and the other cells are normal cells. All the normal cells share the same architecture, and all the reduction cells share the same architecture, respectively. The output of the last cell is then fed to an average pooling layer, followed by a fully-connected layer that outputs the softmax probability.
To search for the optimal operation combinations in a neural cell, we define two sets of parameters: (1) a set of architecture parameters that controls the choice of operations and (2) a set of weight parameters of all operations in . We use architecture parameters to relax the categorical choice of a particular operation on edge to a softmax over all possible operations in the search space:
As a result, the search space becomes continuous, and NAS can be realized by optimizing the architecture parameters. In addition, since we have two types of neural cells (normal cells and reduction cells), the architecture parameters becomes , where are shared among all normal cells and shared among all reduction cells.
We use a differentiable NAS algorithm [liu2018darts] to learn and jointly through back-propagation. We denote the training loss as and the validation loss as . The NAS process can be viewed as a bi-level optimization problem [colson2007overview] (eq. 3), which aims to find an optimal that minimizes , where the optimal is determined by minimizing .
In our task, we use cross-entropy loss for both and :
where is the indicator function, is the number of speakers, is the ground-truth speaker, and is the softmax probability of speaker .
In practice, this bi-level optimization problem can be solved iteratively, as shown in Algorithm 1. During each iteration, we perform two steps: weight parameters update and architecture parameters update. In the weight parameters update step, we update by minimizing the the cross-entropy loss on training set , while fixing . In the architecture parameter update step, we update by minimizing the cross-entropy loss on validation set , while fixing .
The algorithm terminates when the choice of operations in the neural cell converges, which is empirically measured by the entropy of the architecture parameters as:
A smaller the entropy indicates a higher confidence of choosing a particular operation among all possible operations. As a result, we consider the architecture of the neural cell being converged when the entropy does not decrease.
Once the search process is converged, we can derive a neural cell architecture from the learned architecture parameters. For each node , we retain two operations with highest softmax probabilities (excepting zero operation) from all its predecessors , following [liu2018darts]. The softmax probability of an operation between nodes is defined as:
With the derived neural cell, we construct a CNN by stacking multiple neural cells. We explore different number of neural cells and different number of channels in our experiments, as we will describe in Section 4.3. Similar to the CNN architecture in NAS process, we still set the cells whose position is located at the and of the total depth to reduction cells, and the other cells are normal cells.
To obtain the final speaker recognition model, we train the CNN model from random initialized weights (denoted as training from scratch) on training set by minimizing the cross-entropy loss in eq. 4. We use the output of the average pooling layer of the trained CNN as the speaker embedding.
|VGG-M [nagrani2017voxceleb]||80.50||92.1||10.20||1,024||67 million|
|ResNet-18 [hajibabaei2018unified, bhattacharya2019deep]||79.48||90.97||8.17||512||12 million|
|ResNet-34 [chung2018voxceleb2, xie2019utterance, cai2018exploring]||81.34||94.49||4.64||512||22 million|
|Proposed ()||84.45||94.74||2.56||256||5 million|
|Proposed ()||83.45||94.21||1.99||256||18 million|
|Proposed ()||87.66||96.01||1.45||512||18 million|
We conducted experiments on VoxCeleb1 [nagrani2017voxceleb] dataset consisting of 153,516 utterances produced by 1,251 speakers. VoxCeleb1 has an identification split and a verification split for SID and SV, respectively. The identification split divides the entire dataset to a training set of 138,316 utterances, a development set of 6,904 utterances, and a test set of 8,251 utterances. All the three sets were produced by the same 1,251 speakers. The verification split divides the dataset into a training set of 148,642 utterances produced by 1,211 speakers and a test set of 4,874 utterances produced by 40 speakers. The test speakers do not overlap with training. The verification split also provided 37,720 trial pairs for SV evaluation. During the NAS process, we followed the identification split. These subsets were used as introduced in Section 3.2. During training from scratch, we train different models for SID and SV tasks following identification split and verification split, respectively.
For each utterance, we extracted a 257-dim spectrogram with a 25ms window and 10ms shift. We performed mean and variance normalization on each frequency bin of the spectrogram. To form a mini-batch, we randomly selected 3-second segments from utterances. As a result, the input size is.
We implemented the proposed method based on PyTorch[paszke2017automatic] and NVIDIA TITAN RTX GPU111Codes available at https://github.com/TAMU-VITA/AutoSpeech. During the NAS process, the network with 8 neural cells was training using the algorithm in Section 3.2
for 50 epochs with a batch size of 16. We set the initial channels to 16, due to the limited GPU memory. We used Adam Optimizer to optimize both the architecture parametersand weight parameters . We set the initial learning rate of the optimizer for to , and we set that of the optimizer for to . Both learning rate was annealed down to zero following a cosine schedule [loshchilov2016sgdr]. We set the weight decay of both optimizer to . The entire searching process takes around 5 days on a single GPU.
During training from scratch, the model was trained for 300 epochs with a batch size of 128. We explored different initial channels, as we will describe in Section 4.3. We used Adam Optimizer with initial learning rate of , which was annealed down to zero following a cosine schedule. We set the weight decay of the optimizer to . The entire training process takes around 1 day on a single GPU.
We explored the effect of using different number of neural cells and different number of channels during training from scratch. First, we derived a relatively light-weight model by setting the number of neural cells to and the number of initial channels . Following this, we tried to construct two larger models by growing the number of neural cells and the number of channels, respectively. For a fair comparison between the two large models, we kept the total number of parameters the same in the two models. Consequently, we have a second model with , and a third model with . The embedding dimensions and the number of parameters of these models are shown in Table 1.
When evaluating the derived architecture for speaker identification, we followed the identification split of VoxCeleb1 dataset. We trained the derived CNN on training set and evaluated it on testing set. To obtain utterance-level speaker embeddings, we divided each test utterance into 3-second segments with 1.5-second overlap. Following this, we extracted the speaker embeddings on these segments, and used the average of these embeddings as the utterance-level speaker embedding. We computed Top-1 and Top-5 accuracies on these utterance-level speaker embeddings to measure the performance.
When evaluating the derived architecture for speaker verification, we followed the verification split of VoxCeleb1 dataset. We trained the derived CNN on training set and evaluated it on the trial pairs. We obtain utterance-level speaker embeddings as described above, and we used cosine similarity as the similarity metric. We used Equal Error Rate (EER) to measure the performance, which is commonly used for speaker verification systems [nagrani2017voxceleb].
We first analyze the effect of the number of neural cells and the number of channels in the proposed method. We found that the model with 8 neural cells and 128 initial channels achieved the best performance among the proposed models on both identification (Top-1: , Top-5: ) and verification (EER: ). Additionally, the two large models both significantly outperform the light-weight ones. When comparing between the two large models, we found that the model with () achieves better performance that the model with (). This result suggests that increasing the number of channels is more effective than increasing the number of neural cells, when constructing a larger model from the searched neural cell.
We next compared the proposed method against three CNN architectures that were commonly used for speaker recognition: VGG-M [nagrani2017voxceleb], ResNet-18 [hajibabaei2018unified, bhattacharya2019deep], and RestNet-34 [chung2018voxceleb2, xie2019utterance, cai2018exploring]
. As there are no open-source implementations from previous works on ResNet-18 and ResNet-34, we implemented these two architecture under similar training settings as the proposed method for a fair comparison. As shown in Table1, all our proposed models outperform the baseline CNN architectures. Specifically, our light-weight model () outperforms ResNet-34, the best baseline architecture, by Top-1 and EER, with only 5 million parameters (ResNet-34: 22 million parameters). Our best model () outperforms ResNet-34 by Top-1 and EER, with only 18 million parameters. These results showed that the searched CNN architectures significantly improved both speaker identification and speaker verification performance, while achieving higher parameter efficiency and enjoying lighter weights.
This paper proposed an automatic approach to find the optimal CNN architecture for speaker recognition. The proposed approach has two stages. The first stage searches for the optimal architecture of a neural cell consisting of several different types of operations. The second stage derives a CNN model by stacking the searched neural cell several times and then trains the CNN model through the standard speaker recognition training process. Evaluation results on both speaker identification and speaker verification demonstrate significantly improved performance, while having lower model complexity.