CN-CELEB: a challenging Chinese speaker recognition dataset

10/31/2019 ∙ by Yue Fan, et al. ∙ 0

Recently, researchers set an ambitious goal of conducting speaker recognition in unconstrained conditions where the variations on ambient, channel and emotion could be arbitrary. However, most publicly available datasets are collected under constrained environments, i.e., with little noise and limited channel variation. These datasets tend to deliver over optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions. In this paper, we present CN-Celeb, a large-scale speaker recognition dataset collected `in the wild'. This dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world. Experiments conducted with two state-of-the-art speaker recognition approaches (i-vector and x-vector) show that the performance on CN-Celeb is far inferior to the one obtained on VoxCeleb, a widely used speaker recognition dataset. This result demonstrates that in real-life conditions, the performance of existing techniques might be much worse than it was thought. Our database is free for researchers and can be downloaded from http://project.cslt.org.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speaker recognition including identification and verification, aims to recognize claimed identities of speakers. After decades of research, performance of speaker recognition systems has been vastly improved, and the technique has been deployed to a wide range of practical applications. Nevertheless, the present speaker recognition approaches are still far from reliable in unconstrained conditions where uncertainties within the speech recordings could be arbitrary. These uncertainties might be caused by multiple factors, including free text, multiple channels, environmental noises, speaking styles, and physiological status. These uncertainties make the speaker recognition task highly challenging [21, 23].

Researchers have devoted much effort to address the difficulties in unconstrained conditions. Early methods are based on probabilistic models that treat these uncertainties as an additive Gaussian noise. JFA [12, 4] and PLDA [11]

are the most famous among such models. These models, however, are shallow and linear, and therefore cannot deal with the complexity of real-life applications. Recent advance in deep learning methods offers a new opportunity 

[10, 13, 19, 9]

. Resorting to the power of deep neural networks (DNNs) in representation learning, these methods can remove unwanted uncertainties by propagating speech signals through the DNN layer by layer and retain speaker-relevant features only 

[14]. Significant improvement in robustness has been achieved by the DNN-based approach [20], which makes it more suitable for applications in unconstrained conditions.

The success of DNN-based methods, however, largely relies on a large amount of data, in particular data that involve the true complexity in unconstrained conditions. Unfortunately, most existing datasets for speaker recognition are collected in constrained conditions, where the acoustic environment, channel and speaking style do not change significantly for each speaker [7, 8, 18]. These datasets tend to deliver over optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions.

To address this shortage in datasets, researchers have started to collect data ‘in the wild’. The most successful ‘wild’ dataset may be VoxCeleb [16, 2]

, which contains millions of utterances from over thousands of speakers. The utterances were collected from open-source media using a fully automated pipeline based on computer vision techniques, in particular face detection, tracking and recognition, plus video-audio synchronization. The automated pipeline is almost costless, and thus greatly improves the efficiency of data collection.

In this paper, we re-implement the automated pipeline of VoxCeleb and collect a new large-scale speaker dataset, named CN-Celeb. Compared with VoxCeleb, CN-Celeb has three distinct features:

  • CN-Celeb specially focuses on Chinese celebrities, and contains more than utterances from persons.

  • CN-Celeb covers more genres of speech. We intentionally collected data from genres, including entertainment, interview, singing, play, movie, vlog, live broadcast, speech, drama, recitation and advertisement. The speech of a particular speaker may be in more than 5 genres. As a comparison, most of the utterances in VoxCeleb were extracted from interview videos. The diversity in genres makes our database more representative for the true scenarios in unconstrained conditions, but also more challenging.

  • CN-Celeb is not fully automated, but involves human check. We found that more complex the genre is, more errors the automated pipeline tends to produce. Ironically, the error-pron segments could be highly valuable as they tend to be boundary samples. We therefore choose a two-stage strategy that employs the automated pipeline to perform pre-selection, and then perform human check.

The rest of the paper is organized as follows. Section 2 presents a detailed description for CN-Celeb, and Section 3 presents more quantitative comparisons between CN-Celeb and VoxCeleb on the speaker recognition task. Section 4 concludes the entire paper.

2 The CN-Celeb dataset

2.1 Data description

The original purpose of the CN-Celeb dataset is to investigate the true difficulties of speaker recognition techniques in unconstrained conditions, and provide a resource for researchers to build prototype systems and evaluate the performance. Ideally, it can be used as a standalone data source, and can be also used with other datasets together, in particular VoxCeleb which is free and large. For this reason, CN-Celeb tries to be distinguished from but also complementary to VoxCeleb from the beginning of the design. This leads to three features that we have discussed in the previous section: Chinese focused, complex genres, and quality guarantee by human check.

In summary, CN-Celeb contains over utterances from Chinese celebrities. It covers genres and the total amount of speech waveforms is hours. Table 1 gives the data distribution over the genres, and Table 2 presents the data distribution over the length of utterances.

Genre # of Spks # of Utters # of Hours
Entertainment 483 22,064 33.67
Interview 780 59,317 135.77
Singing 318 12,551 28.83
Play 69 4,245 4.95
Movie 62 2,749 2.20
Vlog 41 1,894 4.15
Live Broadcast 129 8,747 16.35
Speech 122 8,401 36.22
Drama 160 7,274 6.43
Recitation 41 2,747 4.98
Advertisement 17 120 0.18
Overall 1,000 130,109 273.73
Table 1: The distribution over genres.
Length (s) # of Utterances Proportion
<2 41,658 32.0%
2-5 38,629 30.0%
5-10 23,497 18.0%
10-15 10,687 8.0%
15-20 5,334 4.0%
20-25 3,218 2.5%
25-30 1,991 1.5%
>30 5,095 4.0%
Table 2: The distribution over utterance length.

2.2 Challenges with CN-Celeb

Table 3 summarizes the main difference between CN-Celeb and VoxCeleb. Compared to VoxCeleb, CN-Celeb is a more complex dataset and more challenging for speaker recognition research. More details of these challenges are as follows.

  • Most of the utterances involve real-world noise, including ambient noise, background babbling, music, cheers and laugh.

  • A certain amount of utterances involve strong and overlapped background speakers, especially in the dram and movie genres.

  • Most of speakers have different genres of utterances, which results in significant variation in speaking styles.

  • The utterances of the same speaker may be recorded at different time and with different devices, leading to serious cross-time and cross-channel problems.

  • Most of the utterances are short, which meets the scenarios of most real applications but leads to unreliable decision.

CN-Celeb VoxCeleb
Source media bilibili.com youtube.com
Language Chinese Mostly English
Genre 11 Mostly interview
# of Spks 1,000 7,363
# of Utters 130,109 1281,762
# of Hours 274 2,794
Human Check Yes No
Table 3: Comparison between CN-Celeb and VoxCeleb.

2.3 Collection pipeline

CN-Celeb was collected following a two-stage strategy: firstly we used an automated pipeline to extract potential segments of the Person of Interest (POI), and then applied a human check to remove incorrect segments. This process is much faster than purely human-based segmentation, and reduces errors caused by a purely automated process.

Briefly, the automated pipeline we used is similar to the one used to collect VoxCeleb1 [16] and VoxCeleb2 [2], though we made some modification to increase efficiency and precision. Especially, we introduced a new face-speaker double check step that fused the information from both the image and speech signals to increase the recall rate while maintaining the precision.

The detailed steps of the collection process are summarized as follows.

  • STEP 1. POI list design. We manually selected Chinese celebrities as our target speakers. These speakers were mostly from the entertainment sector, such as singers, drama actors/actrees, news reporters, interviewers. Region diversity was also taken into account so that variation in accent was covered.

  • STEP 2. Pictures and videos download. Pictures and videos of the POIs were downloaded from the data source (https://www.bilibili.com/) by searching for the names of the persons. In order to specify that we were searching for POI names, the word ‘human’ was added in the search queries. The downloaded videos were manually examined and were categorized into the genres.

  • STEP 3. Face detection and tracking. For each POI, we first obtained the portrait of the person. This was achieved by detecting and clipping the face images from all pictures of that person. The RetinaFace algorithm was used to perform the detection and clipping [6]. Afterwards, video segments that contain the target person were extracted. This was achieved by three steps: (1) For each frame, detect all the faces appearing in the frame using RetinaFace; (2) Determine if the target person appears by comparing the POI portrait and the faces detected in the frame. We used the ArcFace face recognition system [5] to perform the comparison; (3) Apply the MOSSE face tracking system [1] to produce face streams.

  • STEP 4. Active speaker verification. As in [16], an active speaker verification system was employed to verify if the speech was really spoken by the target person. This is necessary as it is possible that the target person appears in the video but the speech is from other persons. We used the SyncNet model [3] as in [16] to perform the task. This model was trained to detect if a stream of mouth movement and a stream of speech are synchronized. In our implementation, the stream of mouth movement was derived from the face stream produced by the MOSSE system.

  • STEP 5. Double check by speaker recognition.

    Although SyncNet worked well for videos in simple genres, it failed for videos of complex genres such as movie and vlog. A possible reason is that the video content of these genres may change dramatically in time, which leads to unreliable estimation for the stream of the mouth movement, hence unreliable synchronization detection. In order to improve the robustness of the active speaker verification in complex genres, we introduced a double check procedure based on speaker recognition. The idea is simple: whenever the speaker recognition system states a

    very low confidence for the target speaker, the segment will be discarded even if the confidence from SyncNet is high; vice versa, if the speaker recognition system states a very high confidence, the segment will be retained. We used an off-the-shelf speaker recognition system [22] to perform this double check. In our study, this double check improved the recall rate by % absolutely.

  • STEP 6. Human check.

    The segments produced by the above automated pipeline were finally checked by human. According to our experience, this human check is rather efficient: one could check hour of speech in hour. As a comparison, if we do not apply the automated pre-selection, checking hour of speech requires hours.

3 Experiments on speaker recognition

In this section, we present a series of experiments on speaker recognition using VoxCeleb and CN-Celeb, to compare the complexity of the two datasets.

3.1 Data

VoxCeleb: The entire dataset involves two parts: VoxCeleb1 and VoxCeleb2. We used SITW [15], a subset of VoxCeleb1 as the evaluation set. The rest of VoxCeleb1 was merged with VoxCeleb2 to form the training set (simply denoted by VoxCeleb). The training set involves utterances from speakers, and the evaluation set involves utterances from speakers (precisely, this is the Eval. Core set within SITW).

CN-Celeb: The entire dataset was split into two parts: the first part CN-Celeb(T) involves utterances from speakers and was used as the training set; the second part CN-Celeb(E) involves utterances from speakers and was used as the evaluation set.

3.2 Settings

Two state-of-the-art baseline systems were built following the Kaldi SITW recipe [17]: an i-vector system [4] and an x-vector system [20].

For the i-vector system, the acoustic feature involved -dimensional MFCCs plus the log energy, augmented by the first- and second-order derivatives. We also applied the cepstral mean normalization (CMN) and the energy-based voice active detection (VAD). The universal background model (UBM) consisted of Gaussian components, and the dimensionality of the i-vector space was . LDA was applied to reduce the dimensionality of the i-vectors to . The PLDA model was used for scoring [11].

For the x-vector system, the feature-learning component was a 5-layer time-delay neural network (TDNN). The slicing parameters for the five time-delay layers were: {-, -, , +, +}, {-, , +}, {-, , +}, {}, {

}. The statistic pooling layer computed the mean and standard deviation of the frame-level features from a speech segment. The size of the output layer was consistent with the number of speakers in the training set. Once trained, the activations of the penultimate hidden layer were read out as x-vectors. In our experiments, the dimension of the x-vectors trained on

VoxCeleb was set to , while for CN-Celeb, it was set to , considering the less number of speakers in the training set. Afterwards, the x-vectors were projected to -dimensional vectors by LDA, and finally the PLDA model was employed to score the trials. Refer to [20] for more details.

3.3 Basic results

We first present the basic results evaluated on SITW and CN-Celeb(E). Both the front-end (i-vector or x-vector models) and back-end (LDA-PLDA) models were trained with the VoxCeleb training set. Note that for SITW, the averaged length of the utterances is more than seconds, while this number is about seconds for CN-Celeb(E). For a better comparison, we resegmented the data of SITW and created a new dataset denoted by SITW(S), where the averaged lengths of the enrollment and test utterances are and seconds, respectively. These numbers are similar to the statistics of CN-Celeb(E).

The results in terms of the equal error rate (EER) are reported in Table 4. It can be observed that for both the i-vector system and the x-vector system, the performance on CN-Celeb(E) is much worse than the performance on SITW and SITW(S). This indicates that there is big difference between these two datasets. From another perspective, it demonstrates that the model trained with VoxCeleb does not generalize well, although it has achieved reasonable performance on data from a similar source (SITW).

Training Set Evaluation Set
System Front-end Back-end SITW SITW(S) CN-Celeb(E)
i-vector VoxCeleb VoxCeleb 5.30 7.30 19.05
x-vector VoxCeleb VoxCeleb 3.75 4.78 15.52
Table 4: EER(%) results of the i-vector and x-vector systems trained on VoxCeleb and evaluated on three evaluation sets.

3.4 Further comparison

To further compare CN-Celeb and VoxCeleb in a quantitative way, we built systems based on CN-Celeb and VoxCeleb, respectively. For a fair comparison, we randomly sampled speakers from VoxCeleb and built a new dataset VoxCeleb(L) whose size is comparable to CN-Celeb(T). This data set was used for back-end (LDA-PLDA) training.

The experimental results are shown in Table 5. Note that the performance of all the comparative experiments show the same trend with the i-vector system and the x-vector system, we therefore only analyze the i-vector results.

Firstly, it can be seen that the system trained purely on VoxCeleb obtained good performance on SITW(S) (1st row). This is understandable as VoxCeleb and SITW(S) were collected from the same source. For the pure CN-Celeb system (2nd row), although CN-Celeb(T) and CN-Celeb(E) are from the same source, the performance is still poor (14.24%). More importantly, with re-training the back-end model with VoxCeleb(L) (4th row), the performance on SITW becomes better than the same-source result on CN-Celeb(E) (11.34% vs 14.24%). All these results reconfirmed the significant difference between the two datasets, and indicates that CN-Celeb is more challenging than VoxCeleb.

Training Set Evaluation Set
System Front-end Back-end SITW(S) CN-Celeb(E)
i-vector VoxCeleb VoxCeleb(L) 8.34 17.43
CN-Celeb(T) CN-Celeb(T) 14.87 14.24
VoxCeleb CN-Celeb(T) 12.96 15.00
CN-Celeb(T) VoxCeleb(L) 11.34 15.50
x-vector VoxCeleb VoxCeleb(L) 5.93 13.64
CN-Celeb(T) CN-Celeb(T) 15.23 14.78
VoxCeleb CN-Celeb(T) 10.72 11.99
CN-Celeb(T) VoxCeleb(L) 12.68 15.62
Table 5: EER(%) results with different data settings.

4 Conclusions

We introduced a free dataset CN-Celeb for speaker recognition research. The dataset contains more than utterances from Chinese celebrities, and covers different genres in real world. We compared CN-Celeb and VoxCeleb, a widely used dataset in speaker recognition, by setting up a series of experiments based on two state-of-the-art speaker recognition models. Experimental results demonstrated that CN-Celeb is significantly different from VoxCeleb, and it is more challenging for speaker recognition research. The EER performance we obtained in this paper suggests that in unconstrained conditions, the performance of the current speaker recognition techniques might be much worse than it was thought.

References

  • [1] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2544–2550. Cited by: 3rd item.
  • [2] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §1, §2.3.
  • [3] J. S. Chung and A. Zisserman (2016) Out of time: automated lip sync in the wild. In Asian conference on computer vision, pp. 251–263. Cited by: 4th item.
  • [4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1, §3.2.
  • [5] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: 3rd item.
  • [6] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019) RetinaFace: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: 3rd item.
  • [7] W. M. Fisher (1986) Ther DARPA speech recognition research database: specifications and status. In Proc. DARPA Workshop on Speech Recognition, Feb. 1986, pp. 93–99. Cited by: §1.
  • [8] J. J. Godfrey, E. C. Holliman, and J. McDaniel (1992) SWITCHBOARD: telephone speech corpus for research and development. In 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 517–520. Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [10] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
  • [11] S. Ioffe (2006) Probabilistic linear discriminant analysis. In European Conference on Computer Vision, pp. 531–542. Cited by: §1, §3.2.
  • [12] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (4), pp. 1435–1447. Cited by: §1.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [14] L. Li, Y. Chen, Y. Shi, Z. Tang, and D. Wang (2017) Deep speaker feature learning for text-independent speaker verification. arXiv preprint arXiv:1705.03670. Cited by: §1.
  • [15] M. McLaren, L. Ferrer, D. Castan, and A. Lawson (2016) The Speakers in the Wild (SITW) speaker recognition database.. In Interspeech, pp. 818–822. Cited by: §3.1.
  • [16] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §1, 4th item, §2.3.
  • [17] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The Kaldi speech recognition toolkit. In

    Workshop on automatic speech recognition and understanding

    ,
    Cited by: §3.2.
  • [18] S. O. Sadjadi, T. Kheyrkhah, A. Tong, C. S. Greenberg, D. A. Reynolds, E. Singer, L. P. Mason, and J. Hernandez-Cordero (2017) The 2016 NIST speaker recognition evaluation.. In Interspeech, pp. 1353–1357. Cited by: §1.
  • [19] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [20] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §3.2, §3.2.
  • [21] L. L. Stoll (2011) Finding difficult speakers in automatic speaker recognition. Ph.D. Thesis, UC Berkeley. Cited by: §1.
  • [22] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman (2019) Utterance-level aggregation for speaker recognition in the wild. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. Cited by: 5th item.
  • [23] T. F. Zheng and L. Li (2017) Robustness-related issues in speaker recognition. Springer. Cited by: §1.