Short-duration Speaker Verification (SdSV) Challenge 2020: the Challenge Evaluation Plan

12/13/2019 ∙ by Hossein Zeinali, et al. ∙ 0

This document describes task1 of the Short-Duration Speaker Verification Challenge (SdSVC) 2020. The main aim of the challenge is to evaluate new technologies for text-dependent speaker verification (TD-SV). There is one more task in the SdSVC which is text-independent speaker verification which is explained in a separate description file. The evaluation dataset in the challenge is recently released multi-purpose DeepMine dataset. The dataset has three parts and among them part1 is for text-dependent speaker verification.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This document describes task1 of the Short-Duration Speaker Verification Challenge (SdSVC) 2020. The main aim of the challenge is to evaluate new technologies for text-dependent speaker verification (TD-SV). There is one more task in the SdSVC which is text-independent speaker verification which is explained in a separate description file.

The evaluation dataset in the challenge is recently released multi-purpose DeepMine dataset. The dataset has three parts and among them part1 is for text-dependent speaker verification.

2 Task Description

The task1 of the SdSVC-2020 challenge is speaker verification in text-dependent mode: given a segment of test speech and the target speaker enrollment data, automatically determine whether the target speaker is speaking a specific phrase in the segment. In contrast with text-independent speaker verification, here the lexical content of the utterance is also important. So, the TD-SV challenge is a two verification tasks which both speaker and phrase should be verified.

Each trial in the challenge contains a test segment of speech along with a model identifier which indicates three enrollment utterances and a phrase id that uttered in the utterances. The system is required to process each trial independently and produce a log-likelihood ratio (LLR) score which somehow combined both speaker and phrase verification scores.

There are 5 Persian phrases as well as 5 English phrases in the challenge. The in-domain training data contains utterances from 963 speakers. Some of the training speakers only have Persian phrases because they could not read. Model enrollment is done in a phrase and language-dependent way using three utterances for each model.

2.1 Trial types

There are 4 trial types in the text-dependent speaker verification. The first one is Target-Correct (TC) where the target speaker utters the correct pass-phrase. The second trial type is Target-Wrong (TW) where the target speaker utters a different phrase than the pass-phrase. In the same manner, the Imposter-Correct (IC) and Imposter-Wrong (IW) show trials where the imposter speaker utters the correct or a wrong pass-phrase. The system should verify TC type as target-trial and the other three types as non-target (imposter) trials. So, the main difference between text-dependent and independent speaker verification is considering TW trials as imposter trial while both TC and TW should be considered as target trials in the text-independent mode. There are not any cross-language and cross-gender trials in the challenge.

2.2 Training condition

The training condition is defined as the amount of data/resources used to build a Speaker Recognition (SR) system. Unlike SRE19, here, there is only a fixed training condition that allows to only use the following datasets for training and using any other public or private speech data for training is forbidden. The available training data is as follow:

  • VoxCeleb1

  • VoxCeleb2

  • SITW

  • LibriSpeech

  • DeepMine training part: it contains 963 speakers from DeepMine dataset.

Other non-speech data can be used for data augmentation purposes.

The in-domain DeepMine training part can be used for any purpose, such as adding to network training data, training LDA or PLDA model, score normalization or using a part of it as development data because there is no separate development data for the challenge.

2.3 Enrollment Condition

In contrast to the text-independent case, here the enrollment is done only using several repetitions of a specific phrase for each model. We decided to use three utterances for the model enrollment because by increasing the number of utterances, the text-dependent task will be easier. Note that using enrollment utterances of the other models during enrollment of the model is forbidden, for example, for calculating score normalization parameters.

2.4 Test Condition

Each trial in the evaluation contains a test utterance and a target model. As explained before, there are four types of trials in the evaluation and only TC will be considered as target trial and the rest will be considered as imposter trials. Similar to the SRE 2019 CTS challenge, the whole trials will be divided into two subsets: a progress subset, and an evaluation subset. The progress subset will be around 30% of the trials and will be used to monitor progress in the leaderboard. The remaining 70% of the trials will form the evaluation subset and will be used to generate the official final results determined at the end of the challenge.

3 Performance Measurement

The main metric for the challenge is normalized minimum Detection Cost Function (DCF) as defined is SRE08. This detection cost function is defined as a weighted sum of miss and false alarm error probabilities:

where , and . Based on the parameters, the normalized DCF () will be DCF divide by 0.1 as the best cost that could be obtained without processing the input data. In addition to , the Equal Error Rate (EER) will be reported.

4 Data Description

The main data for the challenge is the DeepMine dataset which was collected using crowdsourcing. Participants in the data collection project installed an Android application and record phrases in the application. The full description of the project and the dataset can be found in the following papers. When you are going to refer to the dataset, you can use the following conference papers [1, 2].

Ψ@inproceedings{deepmine2018odyssey, Ψtitle={{DeepMine} Speech Processing Database: Text-Dependent and ΨIndependent Speaker Verification and Speech Recognition in Ψ{Persian and English}.}, Ψauthor={Zeinali, Hossein and Sameti, Hossein and Stafylakis, Themos}, Ψyear=2018, Ψbooktitle={Proc. Odyssey 2018 The Speaker and Language Recognition ΨWorkshop}, Ψpages={386--392}, Ψ} Ψ@inproceedings{deepmine2019asru, Ψtitle={A Multi Purpose and Large Scale Speech Corpus in {Persian and ΨEnglish} for Speaker and Speech Recognition: the {DeepMine} Database}, Ψauthor={Zeinali, Hossein and Burget, Lukas and Cernocky, Jan}, Ψyear=2019, Ψbooktitle={Proc. ASRU 2019 The 2019 IEEE Automatic Speech Recognition Ψand Understanding Workshop}, Ψ} Ψ

As a short description, data was recorded in real environments in Iran, so there are various kinds of noises in the data. The main language of the data is Farsi (Persian) while most of the participants also participated in the English part because they could read English too. There are 5 Persian phrases as well as 5 English phrases in the Part1 of the dataset which are using in this challenge. The English phrases and transliteration of the Persian phrases are shown in Table 1. We also provide the phoneme transcription of the phrases and participants can use them in any way they want.

Id Phrase 01 sedaye man neshandahandeye hoviyyate man ast. 02 sedaye har kas monhaser be fard ast. 03 hoviyyate man ra ba sedaye man tayid kon. 04 sedaye man ramze obure man ast. 05 baniadam azaye yekdigarand. 06 My voice is my password. 07 OK Google. 08 Artificial intelligence is for real. 09 Actions speak louder than words. 10 There is no such thing as a free lunch.

Table 1: Phrases in Task1 of the challenge.

4.1 Data Organization

The data will be provided in three separate zip (tar) files. The first file only contains in-domain DeepMine training data. The second file contains enrollment data, model definition file and trial files. The last file only contains test utterances. The reason for providing test utterances in a separate file is using the same test data in both tasks. If all three files will be extracted in a directory, the directory structure is as follow:

Ψ<base directory>/

4.2 Format of Model Enrollment File

The enrollment file is a five-column space-separated text file named model_enrollment.txt and located in the docs directory. There is a header line at the begging of the file. The first record in each line indicates a model-id, the second record shows phrase id which indicates the phrase uttered in the corresponding utterances. The rest three columns show the enrollment file ids. There is only one space between two records in each line. So the format of the enrollment file is as follow:


where model-id is the model identifier, phrase-id is the phrase identifier and enroll-file-ids are the enrollment utterance identifiers.

For example:

Ψmodel-id phrase-id enroll-file-id1 enroll-file-id2 enroll-file-id3
Ψmodel_00000 07 enr_007492 enr_023277 enr_012882
Ψmodel_00001 02 enr_035341 enr_027674 enr_032835
Ψmodel_00002 09 enr_020095 enr_015193 enr_024742
Ψmodel_00003 06 enr_032246 enr_014610 enr_014698
Ψmodel_00004 09 enr_033841 enr_037127 enr_033859

4.3 Format of Trial File

The trial file is a two-column space-separated text file named trials.txt and located in the docs directory. There is a header line at the begging of the file. The first record in each line indicates a model-id and the second record indicates an evaluation file id. There is only one space between two records in each line. So the format of the trial file is as follow:


where model-id is the model identifier and evaluation-file-id is the test utterance identifier.

For example:

Ψmodel-id segment-id
Ψmodel_00000 evl_000018
Ψmodel_00000 evl_000021
Ψmodel_00000 evl_000035
Ψmodel_00000 evl_000109
Ψmodel_00000 evl_000117
Ψmodel_00000 evl_000165

5 In-domain Training Set

As explained before, the in-domain data consists of utterances from 962 speakers. All training utterances are stored in the wav/train directory. The train_labels.txt file in docs directory is a tab-separated text file that contains the provided information for each utterance. Each line in this file contains three columns, where the first column shows train-file-id, the second one indicates speaker-id and the last one shows phrase-id. There is a header line at the begging of the file. So, the format of train label file is as follow:


where train-file-id is the train utterance identifier, speaker-id is the speaker label and finally, the phrase-id is the identifier of phrase of each utterance.

For example:


6 Evaluation Rules and Requirements

The overall rules are pretty the same as NIST SREs. First of all, participants must flow the data restriction where there is only a fixed condition for training. The participants have agreed to process the test data according to the following rules and upload results to the challenge website for evaluation. The rules are:

  • The participants agree to make at least one valid submission for the main task which should perform better than the provided baseline.

  • The participants agree to process each trial independently. That is, each decision for a trial is to be based only upon the specified test segment and target speaker enrollment data. The use of information about other test segments and/or other target speaker data is not allowed.

  • The participants agree not to probe the enrollment or test segments via manual/human means such as listening to the data or producing the manual transcript of the speech.

  • The participants are allowed to use any automatically derived information for training, development, enrollment, or test segments.

  • The participants may make multiple challenge submissions (only one per day). Based on the leaderboard results participant should select up to three systems and submit them with system description to the provided links.

7 Evaluation Protocol

7.1 Challenge Website

7.2 Leaderboard platform

As mentioned before, there is an online leaderboard system and participants can submit one system per day for evaluation. During the challenge period, the leaderboard only shows the results of the systems on the progress set. At the end of the challenge, participants should submit the results of the selected systems to the provided links. The leaderboard results of the evaluation subset will be shown to participants 12 hours after the challenge deadline. The challenge leaderboard platform is:

7.3 Required submissions

In addition to submitting score files to the leaderboard system, the participants should submit the selected systems and corresponding system description using a separate link which will be provided later. The selected systems are as follow:

  • Primary system: this system is the primary system for participants and is mandatory to submit this system.

  • Single system: this submission shows the single system. If the primary system was a fusion of several systems, the participants should submit the single system as well, but, if the primary system was created only using the single system, this submission is not necessary. The single system definition for the challenge will be provided in the following paragraph.

  • Contrastive: based on the leaderboard results, the participants can select a second system as contrastive and submit it alongside the primary and single systems.

The final submission file should be a zip (tar) file contains the following files:

  • primary.sco

  • single.sco [optional in the case of using same system in the primary]

  • contrastive.sco [optional]

7.3.1 Single system definition

The definition for a single system is not clear and in all challenges, there are some problems for this. So, for this challenge, we have decided to define the single system in advance.

Let’s divide a verification system to front-end and back-end subparts. Front-end means any molding mechanism like i-vector or x-vector for extracting embedding from utterances, while back-end subpart is used for scoring, for example, LDA-Cosine or LDA-PLDA. Both subparts should be trained and used sequentially. In other words, the output of a system should be used in the consequent subparts. So we have the following rules:

  • Feature fusion by concatenating is allowed.

  • The network can be in any format and size but it should be trained only in one training pass (for example training two separate models is not allowed while a big network from combination of the two networks is fine) except for phrase or language-specific modeling which separate models can be trained for each phrase or each language. Note that in a phrase or language-dependent modeling only one of the networks will be used for scoring each trial.

  • The network can be used for scoring in end-to-end fashion or for extracting embeddings.

  • Any phrase or language-dependent scoring can be used but for each trial, only one of them should be used. For example, in phrase dependent mode, only the corresponding model to phrase should be used.

  • The only allowed score fusion for the single system is a fusion between DNN posteriors and backend outputs.

7.4 System Output Format

The system output should be a one-column text file. Each line of the file indicates a LLR score (a float number) of the corresponding trial. The order of scores must be the same as the trials file and all of the trials must be scored. Any inconsistency will cause an error in the evaluation of the system.

For example:


7.5 Data License Agreement

As mentioned before, the evaluation data for this challenge is a subpart of the DeepMine dataset. Because the dataset is not free, the participant should sign the data license agreement specified for the challenge. By this license, participants can use the data for submitting proposed systems to the challenge as well as corresponding papers. Any other usage of the data is not allowed. The license agreement file can be found in the challenge website.

7.6 System Description

Each participant is required to submit a full system description. The system description will be online to everybody, so, participants can use their team id as team name and wrote description in totally blind format. We will ask all participants about their opinion about this. In addition to the system description we strongly recommend participants to submit a full conference paper to the special session related to the challenge in InterSpeech2020. The papers will be reviewed as a normal paper, so they should be in a proper format and sufficient novelty for acceptance.

The system description should have at least 2 pages and must include the following information about the submitted systems:

  • a complete description of the system components, including front-end and back-end modules along with their configurations.

  • a complete description of the data partitions used to train the various models.

  • performance of the submitted systems on the progress and evaluation sets reported in the leaderboard website.

In case you are going to refer to this evaluation plane, you can use the following reference.

Ψtitle={Short-duration Speaker Verification Challenge ({SdSVC}) 2020:
ΨChallenge Description of Task1  Text-Dependent Speaker Verification.},
Ψauthor={Zeinali, Hossein nad Lee, Kong Aik and Alam, Jahangir and
ΨBurget, Luka\v{s}},
Ψinstitution={arXiv preprint arXiv:2112.32142},

8 Planned Evaluation Schedule

Release of evaluation plan: December, 2019 Evaluation platform open: December, 2019 Release of Train, Dev, and Eval sets: December, 2019 Challenge deadline: February, 2020 Post-challenge evaluation: Early March, 2020 INTERSPEECH Paper submission: March 29, 2020


  • [1] H. Zeinali, L. Burget, and J. Cernocky (2019) A multi purpose and large scale speech corpus in Persian and English for speaker and speech recognition: the DeepMine database. In Proc. ASRU 2019 The 2019 IEEE Automatic Speech Recognition and Understanding Workshop, Cited by: §4.
  • [2] H. Zeinali, H. Sameti, and T. Stafylakis (2018) DeepMine speech processing database: text-dependent and independent speaker verification and speech recognition in Persian and English.. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 386–392. Cited by: §4.