HI-MIA : A Far-field Text-Dependent Speaker Verification Database and the Baselines

by   Xiaoyi Qin, et al.

This paper presents a large far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. Our database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt both single-channel and multi-channel data for training, respectively. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.


page 1

page 2

page 3

page 4


The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 202...

A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: the DeepMine Database

DeepMine is a speech database in Persian and English designed to build a...

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge

This paper describes the NPU system submitted to Interspeech 2020 Far-Fi...

MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

Motivated by unconsolidated data situation and the lack of a standard be...

The FFSVC 2020 Evaluation Plan

The Far-Field Speaker Verification Challenge 2020 (FFSVC20) is designed ...

Parameterized Channel Normalization for Far-field Deep Speaker Verification

We address far-field speaker verification with deep neural network (DNN)...

interface : Electronic Chamber Ensemble

This paper presents the interface developments and music of the duo "int...

1 Introduction

The goal of speaker verification is to verify the speaker identity associated with the enrolled target speaker from the digital audio signal level. Mostly, the speaker verification process contains the speaker embedding extraction module and the verification module. Approaches for those two modules were proposed in recent years, and the performances of speaker verification have been improved dramatically. Also, many open and free speech databases including thousands of speakers become publicly available. Most of the databases (e.g. AISHELL2[aishell2_2018], Librispeech[librispeech], Voxceleb1&2 [nagrani_voxceleb:_2017] [chung_voxceleb2_2018]) are recorded in a close-talking environment without noise. Nevertheless, this recording environment does not match with the far-field scenarios in real world smart home or Internet of Things applications. Speaker verification under noisy and reverberation conditions is one of the challenging topics. The performance of speaker verification systems drops significantly in the far-field condition where the speech is recorded in an unknown direction and distance (usually between 1m-10m). This problem also occurs in speech recognition. Although we have simulation toolkits to convert the close-talking speech to simulated far-field speech, there still exists significant channel mismatch comparing to the real recordings. Moreover, the goal of the front-end processing methods are different in speaker verification and speech recognition. Therefore, it is essential to develop an open and publicly available far-field multi-channel speaker verification database.

Various approaches considering the single-channel microphone or multi-channel microphone array have been proposed to reduce the impact of the reverberation and environmental noise. Those approaches address the problem at different levels of the text-independent automatic speaker verification (ASV) system. At the signal level, linear prediction inverse modulation transfer function [borgstrom_linear_2012] and weighted prediction error (WPE) [mosner_dereverberation_2018, yoshioka_generalization_2012] methods are used for dereverberation. Deep neural network (DNN) based denoising methods for single-channel speech enhancement [zhao_robust_2014, kolboek_speech_2016, oo_dnn-based_2016, eskimez_front-end_2018] and beamforming for multi-channel speech enhancement [mosner_dereverberation_2018, heymann_neural_2016, warsitz_blind_2007] are explored for ASV system under complex environments. At the feature level, sub-band Hilbert envelopes based features [falk_modulation_2010, sadjadi_blind_2014, ganapathy_feature_2011]

, warped minimum variance distortionless response (MVDR) cepstral coefficients

[jin_speaker_2010], power-normalized cepstral coefficients (PNCC) [PNCC] and DNN bottleneck features [yamada_improvement_2013] have been applied to ASV system to suppress the adverse impacts of reverberation and noise. At the model level, reverberation matching with multi-condition training models has achieved good performance.

Deep learning promotes the application of speaker verification technology greatly. The recognition system has been significantly improved from the traditional i-vector method [dehak_front-end_2011]to the DNN-based x-vector method[snyder_x-vectors:_2018]. Recently, CNN-based neural networks[cai_exploring_2018] also perform well in the speaker verification task. However, both traditional methods and deep learning approaches are data-driven methods that need large amount of training data. The lack of real world collected microphone array based far field data limits the development and application of far field speaker verification technology in different scenarios.

In this paper, we introduce a database named HI-MIA containing recordings of wake-up words under the smart home scenario. This database covers 340 speakers and a wide range of channels from close-talking microphones to multiple far-field microphone arrays. It can be used for far-field wake-up word recognition, far-field speaker verification and speech enhancement. In addition, we provide a set of speaker verification baseline systems[xiaoyi_farfield]

that are trained with the far-field speaker verification data under the transfer learning manner. With the model pre-trained by a large scale close-talking data, the system performs well on both far-field enrollment with far-field testing and close-talking enrollment with far-field testing tasks.

2 The Hi-Mia database

HI-MIA includes two sub databases, which are the AISHELL-wakeup111http://www.aishelltech.com/wakeup_data with utterances of 254 speakers and the AISHELL-2019B-eval dataset 222http://www.aishelltech.com/aishell_2019B_eval with utterances of 86 speakers.

2.1 AISHELL-wakeup

The AISHELL-wakeup database has 3,936,003 wake-up utterances with 1,561.12 hours in total. The content of utterances covers two wake-up words, ’ni hao, mi ya (”你好,米雅”) ’ in chinese and ’Hi, Mia’ in English. The average duration for all utterances is around 1 second. The dataset is fairly gender-balanced, with 131 male speakers and 123 female speakers, respectively. The distribution of age and gender is shown in Figure 2. During the recording process, seven recording devices (one close-talking microphone and six 16-channel circular microphone arrays) were set in a real smart home environment. The duration of utterances recorded by each microphone is 16 hours. The 16-channel circular microphone array records waveform in 16kHz, 16 bit, and the close-talking microphone records waveform in 44.1kHz, 16 bit as high fidelity (HiFi) clean speech recording.

Each speaker recorded 160 utterances, with 120 utterances recorded in a noisy environment and the remaining utterances recorded in the home environment. The details of the database are shown in Table 1.

AISHELL-wakeup AISHELL-2019B-eval
Text ID Content Speed Environment Environment
001-020 ni hao, mi ya Normal TV / Music Clean
021-040 hi,mia Normal
041-060 ni hao, mi ya Fast
061-080 hi,mia Fast
081-100 ni hao, mi ya Slow
101-120 hi,mia Slow
121-140 ni hao, mi ya Normal Clean TV / Music
141-160 hi,mia Normal
Table 1: The details of utterances for each recording speaker

The recordings of each speaker could be cataloged into three subset according to the speaking speed (normal speed, fast speed and slow speed). We simulated real smart home scenes by adding noise sources such as TV, music, and background noises to the room. The room setting is shown in Figure 1. The high-fidelity microphone is 25 cm away from the speaker. The circular microphone arrays are placed around the person with a distance including 1m, 3m and 5m from the person. The noise source is randomly placed close to one of the microphone arrays for each speaker.

2.2 AISHELL-2019B-eval

The details of the AISHELL-2019B-eval are also shown in table 1. The dataset contains recordings of 44 male speakers and 42 female speakers. Different from the AISHELL-wakeup, each speaker records 160 utterances, with 120 utterances recorded in a quiet environment and the remaining utterances recorded in the noisy environment. The room setting of AISHELL-2019B-eval is the same with the room setting of AISHELL-wakeup. Instead of placing the noise source in the microphone array, we place the noise source in a fixed location four meters away from the speaker.

Figure 1: The setup of the recording environment
Figure 2: Gender and age distribution

3 The Baseline Methods

3.1 Deep speaker embedding system

3.1.1 Model architecture

The superiority of deep speaker embedding systems have been shown in text-independent speaker recognition for closed talking [snyder_x-vectors:_2018, cai_exploring_2018] and far-field scenarios [nandwana_robust_2018, dku-voices]. In this paper, we adopt the deep speaker embedding system, which is initially designed for the text-independent speaker verification, for far-field speaker verification as baseline. Two models concerning multi-channel and single-channel are trained in our work.

The single-channel network structure is the same as in [cai_exploring_2018]

. There are three main components in this framework. The first component is a deep CNN structure based on the well known ResNet-34 architecture (Residual Convolutional Neural Network), and we increase the widths (number of channels) of the residual blocks from {16, 32, 64, 128} to {32, 64, 128, 256}. Then a global statistics pooling (GSP) layer is placed as the encoding layer after the ResNet34, which transforms the feature maps into a fixed-dimensional utterance-level representation. The output of GSP is normalized by it’s mean and standard deviation. A fully-connected layer then processes the utterance-level representation following by a classification output layer. We add a dropout with a rate of 0.5 before the output layer to prevent over-fitting. Each unit in the output layer refers to a target speaker. The cross-entropy loss is adopted here for measuring the verification error.

The network is trained using standard stochastic gradient descent(SGD) with momentum 0.9 and weight decay 1e-4. We use ReduceLROnPlateau in Pytorch to adjust the learning rate, and the initial value is set to 0.01. For each training step, an integer

within interval is randomly generated, and each data in the mini-batch is cropped or extended to frames.

After training, the utterance-level speaker embedding is extracted after the penultimate layer of the neural network for a given utterance. Cosine similarity and PLDA serve as back-end scoring methodes during testing.

3.1.2 Training data augmentation for far-field ASV

Data augmentation can effectively improve the robustness of the deep speaker embedding model. Therefore, we augment the data by adding reverberation and noise to simulate far-field speech in real environments. This will reduce the mismatch between training data and test data.

We use the same method as in [xiaoyi_farfield] for data augmentation, employing pyroomacoustics [pyroomacoustics] to simulate real room recordings. By randomly setting the size of the room and arbitrarily locating the position of the microphone and noise source, we obtain the far-field simulation data. To gain appropriate noise source, we choose both the environment and music noise in the MUSAN data set[musan] and set a signal-to-noise rate (SNR) from 0-20db. Besides, we set up a 6-channel microphone array for recording data which matches with the input of ResNet3d model.

3.2 Model Fine-tuning

Since we only have limited text-dependent far-field speaker data, if we perform training on these data directly, the text-dependent deep speaker embedding model cannot learn the discriminative speaker information very well, and the model is likely to overfit on a few speakers. Therefore, it is important to use a large amount of text-independent speaker data to train a baseline speaker model first.

Therefore, based on [xiaoyi_farfield], we adopt the transfer learning strategy by adopting a text-independent deep speaker embedding model to a text-dependent model. With transfer learning, the adapted text-dependent model takes the advantages of the pre-trained model with a large number of speakers, without training the whole network from scratch. After the text-independent deep speaker model is trained, transfer learning adapts the front-end local pattern extractor, the encoding layer and the embedding extraction layer to the text-dependent task.

Figure 3 shows the transfer learning process of the text-dependent deep speaker embedding model.

3.3 Enrollment data augmentation

In the close-talking enrollment with far-field testing task, the mismatch between the enrollment and testing data degrades the performance significantly.

We reduce the mismatch by data augmentation with different simulation strategies. In the testing, the simulated deep speaker embedding features are fused with the original enrollment embedding features.

Figure 3: Transfer the text-independent deep speaker embedding model to text-dependent model.

4 Experiments

4.1 Text-independent corpora

The AISHELL-2333http://www.aishelltech.com/aishell_2 is an open and publicly available Chinese Mandarin speech recognition dataset. In this study, we use the iOS channel of the dataset, which contains 984,907 close-talk utterances from 1,997 speakers. We use the dataset to simulate far-field data as text-independent database to pre-train an ASV model.

4.2 Text-dependent corpora

The mandarin wake-up word ”ni hao, mi ya” was chosen in our experiments. Furthermore, we use AISHELL-wakeup data as fine-tune training data and AISHELL-2019b-EVAL as the test set. Based on our previous experimental results, the last 42 people in AISHELL-2019B-EVAL is more challenging, so we select the utterances of the last 42 people as the test data.

In this paper, we have two tasks, one is for close-talking enrollment task and the other for the far-field enrollment task. Both of tasks were tested with far-field data. In the case of the close-talking enrollment with far-field testing, we used the data of close-talking HIFI mic for enrollment. In the case of the far-field enrollment with far-field testing, we used data of one microphone array, which is 1m away from the speaker, for enrollment.

4.3 Baseline system and fine-tuned model

In this work, we train two single-channel models and one multi-channel models. The performances of these models are shown in Table 2.

Comparing to the far-field enrollment task with the close-talking enrollment task, the far-field enrollment task achieves about 20% relative improvement in terms of equal error rate (EER) at the standard far-field testing data. That means although the enrollment data may not be clean after the augmentation, it is able to better match with the testing data. The Basic model(ResNet34-Cosine) in Table2 shown the result of AISHELL2 training data model which scored by cosine similarity. The fine-tune model(ResNet34-FT-Cosine) has 20% improvement than basic model. The PLDA model(ResNet34-FT-PLDA) compensated for the channel, with a 20% improvement. The speaker embedding features of 16 channels were fused and the performance was improved.

ID Model Enrollment EER
ResNet34-Cosine far-field 6.54%
close-talking 7.41%
ResNet34-FT-Cosine far-field 5.08%
close-talking 6.66%
ResNet34-FT-PLDA far-field 3.92%
close-talking 5.36%
ResNet34-FT-PLDA far-field 3.7%
Multi-feature Fusion close-talking 4.71%
Fusion (1 + 3 + 5 + 7) far-field 3.29%
Fusion (2 + 4 + 6 + 8) close-talking 4.02%
Table 2: EER of different speaker embedding systems.

4.4 Enrollment data augmentation

In table 2, the results of the close-talking enrollment with far-field testing scenarios always have a worse performance comparing to the results of the far-field enrollment with far-field testing scenarios. The main reason is the channel mismatch between the enrollment utterance and the testing utterance. Thus, we investigate enrollment data augmentation to compensate for the mismatch between the enrollment utterance and the testing utterance. We use the pyroomacoustics toolkit to simulate far-field speech and augment the original enrollment utterance with different numbers of simulated far-field utterances. The simulated far-field enrollment utterances with the original enrollment utterance are averaged at the embedding level. The results show that the enrollment data augmentation can reduce the gap between the far-field enrollment with far-field testing and the close-talking enrollment with far-field testing tasks. The detail of the simulation method is as follows.

4.5 System fusion

We use results of id 1, 3 ,5 and 7 for system fusion for the far-field enrollment task. We use the results of id 2, 4, 6 and 8 for system fusion regarding the close-talking enrollment tasks, We noticed that the system has significant improvement after system fusion, which means that our systems are complementary. The AISHELL-2 database is still not very big, we believe that if we use more text-independent training data to train the basic model, the system’s performance can be further improved.

5 Conclusions

In this paper, we describes the HI-MIA database collected in a far-field scenario. The database contains multi-channel far-field speech data that could be used in text dependent far-field speaker verification, wake-up word detection and speech enhancement. The database has two sub datasets. One named AISHELL-wakeup could be used as the training data and the other named AISHELL-2019B-eval is designed as the development and testing data. Besides, we proposes several baseline systems and propose the far-field enrollment and close-talking environment, these two tasks. We also introduce methods and strategies for training with limited text-dependent data and the corresponding enrollment data augmentation strategies. Results show that augmenting the enrollment utterance towards the test utterance can effectively improve system performance.

6 Acknowledgements

This research was funded in part by the National Natural Science Foundation of China(61773413), Natural Science Foundation of Guangzhou City(201707010363), Six Talent Peaks project in Jiangsu Province(JY-074), Science and Technology Program of Guangzhou City(201903010040).