Log In Sign Up

Speech2Phone: A Multilingual and Text Independent Speaker Identification Model

by   Edresson Casanova, et al.

Voice recognition is an area with a wide application potential. Speaker identification is useful in several voice recognition tasks, as seen in voice-based authentication, transcription systems and intelligent personal assistants. Some tasks benefit from open-set models which can handle new speakers without the need of retraining. Audio embeddings for speaker identification is a proposal to solve this issue. However, choosing a suitable model is a difficult task, especially when the training resources are scarce. Besides, it is not always clear whether embeddings are as good as more traditional methods. In this work, we propose the Speech2Phone and compare several embedding models for open-set speaker identification, as well as traditional closed-set models. The models were investigated in the scenario of small datasets, which makes them more applicable to languages in which data scarceness is an issue. The results show that embeddings generated by artificial neural networks are competitive when compared to classical approaches for the task. Considering a testing dataset composed of 20 speakers, the best models reach accuracies of 100 scenarios, respectively. Results suggest that the models can perform language independent speaker identification. Among the tested models, a fully connected one, here presented as Speech2Phone, led to the higher accuracy. Furthermore, the models were tested for different languages showing that the knowledge learned was successfully transferred for close and distant languages to Portuguese (in terms of vocabulary). Finally, the models can scale and can handle more speakers than they were trained for, identifying 150 while still maintaining 55


page 5

page 7


Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers

We propose a new method for speaker diarization that can handle overlapp...

JukeBox: A Multilingual Singer Recognition Dataset

A text-independent speaker recognition system relies on successfully enc...

MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

A recent trend in speech processing is the use of embeddings created thr...

Speaker Recognition in Bengali Language from Nonlinear Features

At present Automatic Speaker Recognition system is a very important issu...

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

By implicitly recognizing a user based on his/her speech input, speaker ...

Disentangled representation learning for multilingual speaker recognition

The goal of this paper is to train speaker embeddings that are robust to...

Unified Hypersphere Embedding for Speaker Recognition

Incremental improvements in accuracy of Convolutional Neural Networks ar...

1 Introduction

Voice recognition is widely used in many applications, such as intelligent personal assistants (siri), telephone-banking systems (2001voice), automatic question response (watson), among others. In several of these tasks, it is useful to identify the speaker, for example, in automatic subtitling, voice-enabled authentication and meeting loggers. The last one motivated this work, since new speakers can be frequently added in a company, therefore, open-set speaker identification is desired in meeting loggers, avoiding any need for model retraining. This work investigates the task for the Portuguese language, which has limited resources for training. The results are also tested against English audio in order to verify whether the model can generalize for other languages. Thus, the work presents methods and experiments for building text and language independent speaker identification systems.

Embeddings or encodings can be thought of as a distributed knowledge representation technique which build low-dimensional representations for input data. One of its more widespread uses is to capture semantic relations among words (mikolov2013efficient; mikolov2013distributed)

. However, they have many other applications, such as representing raw images (e.g., face recognition

(2015deepface)), representing users and items in recommendation systems (goodfellow2016) and speaker diarization (bredin2017). Modern techniques for embedding creation usually rely on neural networks applied to huge unsupervised datasets. The learned representations can then be used in supervised models applied in smaller datasets. Embeddings have several advantages, such as reducing the dimensionality of input data, being relatively cheap to build compared to other techniques and its capacity to represent new and previously unseen input data.

In context of speaker identification, embeddings can be used to build open-set models efficiently representing new speakers as they are included in an application. In this paper, we propose a new text-independent multilingual open-set model and compare embeddings and classic models in order to analyze whether open-set systems are as efficient as closed-set ones.

This work is organized as follows. Section 2 presents related work using embeddings for speaker identification. Section 3 presents the audio dataset created and explains models and experiments performed. Section 4 contains results comparison and discussion. Finally, Section 5 presents the conclusions of this paper.

2 Related work

Performing speaker identification (discovering the identity of the speaker in the audio) and speaker diarization (segmenting an audio containing multiple speakers) are difficult tasks, specially, if performed directly from raw audio. Authors normally opt to use higher level representations of the audio signal. Some of the chosen representations include Mel-Frequency Cepstral (kansara2016) and its variations Delta and Delta-Delta (bisio2017speaker; bisio2018smart), PLP (Perceptual Linear Prediction) (kaur2018genetic), LPCC (Linear Predictive Cepstral Coefficients) (li2009i4u)

, I-Vector


, among others. Such representations have a much smaller dimensionality than the original audio signal, being ideal for use in machine learning models. Several models can be constructed to tackle this problem. Popular choices include SVMs (Support Vector Machines)


, GMMs (Gaussian Mixture Models)

(bisio2017performance; liu2018gmm) and neural networks (bredin2017).

One of the first works using neural networks for speaker identification is (konig1998nonlinear)

. The method relies on non-linear discriminant analysis to extract features relevant for speaker identification. The approach relies on the use of a MultiLayer Perceptron network, and resulted in a 15% gain compared to other similar system developed by the authors.


propose a neural network for feature extraction in speaker diarization in the following way: the network is given MFCC inputs from audio fragments and should identify if fragments are from the same speaker or not. One of the network’s hidden layers is specifically structured to generate two embeddings, one for each audio fragment. The authors found that combining MFCC and feature extraction led to an improvement in diarization.


propose the use of a deep autoenconder combined with MFCC inputs in order to extract characteristics from the audio. According to the proposal, a second neural network (supervised) is trained to perform speaker identification by comparing two audio fragments and indicating whether they are from the same speaker or not. This work follows a similar idea for open-set speaker identification, however we had more success combining generated embeddings with the algorithm K-Nearest Neighbour (KNN).

bisio2017speaker proposed a SVM based technique for speaker identification. The approach is enhanced by multiple observations for each speaker extracted from different devices. The authors investigated the task with two kinds of models: one consisting of four known speakers, and the other considering a fifth unknown speaker. The audio files were extracted from speakers in different distances from the device (one to five meters). The experiments presented accuracies up to 98% (four known speakers) and 66% (with a new unknown speaker).


proposes speaker embedding generation using a variation of Long-Short Term Memories (LSTM)


, focusing on the speaker diarization. The method uses the triplet loss function, which has been successful in face recognition tasks. The embedding generation is done with triples containing an audio fragment of a given speaker called anchor, a positive fragment (there is, from the same speaker) and a negative fragment (from another speaker). The training is executed by processing these triples with a neural network and in minimizing the loss function, relying on the distances between the anchor-positive pair and the anchor-negative pair. It was reported that the results outperformed classical methods in terms of purity and coverage.

deepspeaker proposed the Deep Speaker model, which also uses the triplet loss function. The model has been trained and tested on three datasets that are not publicly available. In the article, the authors showed that training on text-dependent datasets can achieve better performance than training on text-independent datasets for text-dependent speaker identification scenarios. In addition, the authors tested their Mandarin-trained model in English and found a significant loss of performance, however the model could still be used for the task. The authors also fine-tuned the Mandarin-trained model to English; thus, improving the identification of speakers in English.

Considering our scope, the closest works are bredin2017 and deepspeaker. However, there are some important differences. First, the authors of both papers focus on recurrent networks, while testing various models. Second, bredin2017 focuses on speaker diarization, while we focus on speaker identification and deepspeaker focuses on both tasks. Third, bredin2017 focused on smaller 16-dimensional embeddings, while deepspeaker used larger 512-dimensional embeddings and we used 80-dimensional embeddings. Finally, an important difference is that the triplet loss approach works better when the number of speakers increases. As our approach focuses on low resource languages, we investigated a technique less sensible to the number of speakers. In this approach, only audios for a single speaker are analyzed for generating embeddings for that speaker.

3 Methodology

To perform the experiments it was necessary to create a dataset, as described in Section 3.1. Section 3.2 details the preprocessing performed on the dataset to enable the execution of all proposed experiments. Section 3.3 describes the closed-set models. Section 3.4 describes the proposed open-set models. The most interesting experiments are described in Section 3.5.

3.1 Audio dataset

The audio dataset used in this paper includes 40 male speakers, aged between 20 and 50 years. The main dataset includes only Portuguese utterances, because that is the native language of the speakers. We opted to focus on fixing male speakers to increase the challenge for the models, since a dataset mixed with female voices would be easier to recognize, especially for models focused on audio pair analysis.

To collect the data, each speaker was given a phonetically balanced text comprised of 149 words. The reading time ranged from 42 to 95 seconds. Additionally, we asked each speaker to say the phoneme /a/ for approximately three seconds. The central second of each capture was extracted and then used as expected output for the embedding models. The phoneme /a/ was chosen because it is simple to articulate and very frequent in the Portuguese language.

3.2 Preprocessing

The first step used during preprocessing was to extract MFCCs using the Librosa (mcfee2015)

library. The default sampling rate (22KHz) was used. We chose to extract 13 MFCCs sampled 44 times each second. Windowed frames were used as defined by the default parameters in Librosa 0.6, namely, a 512 Hop Length and 2,048 as the window length for the Fast Fourier Transform.

We extracted five second instances from the original audio length. The five second window was defined after preliminary experiments varying the input duration. In order to maximize the number of instances, we used the overlapping technique, in which the window was shifted one second each time and an instance was extracted during the total audio duration. The main dataset resulted in 2,394 instances. The next step was to divide the dataset into smaller ones to attend the needs of each proposed experiment. Therefore, the original dataset was divided by eight, as seen in the Table 1.

Datasets Main Purpose Instances per dataset
, , , Training 2,194
, , , Testing 200
Table 1: Datasets

3.3 Closed-set models

Closed-set models specialize in a closed-set of speakers, in other words, in specific speakers. As a result, testing samples are different from training samples, but belong to the same known speakers. These models are classifiers that receive a five second MFCC window. The expected output is a one-hot vector with the size fixed as the number of speakers in the training set.

The Figures 3, 3 and 3

present the architecture of the models 1, 2 and 3, respectively. The first model is based on dense fully connected neural networks, the second on recurrent neural networks and the third on convolutional neural networks. All closed-set models have been trained for 3000 epochs using a learning rate of 0.00005 and a batch size of 64. Bnorm represents batch normalization, FC is one dense fully connected layer and finally a RNN (a single recurrent layer). Several other architectures were tested in preliminary experiments, and the best ones were fine tuned based on the coordinated descent approach


, but focusing on hyperparameter adjusting instead of model parameters. A result of this fine tuning, Model 2, used classical a recurrent approach rather than modern LSTM (Long-Short Term Memory), since it presented best performance.

Figure 1: Model 1 – Dense Neural Network
Figure 2: Model 2 – Recurrent Neural Network
Figure 3: Model 3 – Convolutional Neural Network

3.4 Open-set models

The goal of open-set models is to be speaker independent, additionally a desirable feature is to be multilingual and text independent. In pursuit of these goals, we proposed that for training, the neural network training use five second speech fragments as input and, as expected output, the reconstruction of a simple phoneme (/a/ in our experiments). As the phoneme sounds different according to the speaker, a good reconstruction would allow the model to distinguish between speakers. Focusing on a single phoneme allows for dimensionality reduction in the embedding layer. Our approach is loosely inspired on semantic word embedding mikolov2013distributed, where some of the model should predict the context of a given word. The open-set models receive as input data the equivalent of five seconds of speech and try to reconstruct one second of MFCCs for the phoneme /a/.

Open-set models can be tested in closed-set scenarios, so experiments have been proposed to compare the performance of open-set models in scenarios where speakers are known (closed-set). Ideally, an open-set model performs as good as a closed-set for a known speaker, while also being able to handle unknown speakers.

Figures 9 through 9

presents respectively the models 4 to 9. Model 4 will be referred as Speech2Phone and consists of a dense neural network with one hidden layer. Model 5 is combined with model 4, while the latter generates embeddings, the former analyzes a pair of embeddings to indicate if they belong to the same speaker. KNN is an alternative to Model 5 to find the most similar embedding of a speaker. Model 6 combines convolution dense layers to generate the output. Model 7 is inspired by Convolutional Autoencoders, also know as Fullly Convolutional Networks, and, therefore uses only convolutional layers. Model 8 explores recurrent networks. Finally, Model 9 combines recurrent and convolutional layers. Hyperparameters used to train each model are presented in Table


Figure 4: Model 4 – Fully Connected Shallow Neural Network (Speech2Phone)
Figure 5: Model 5 – Convolutional Neural Network
Figure 6: Model 6 – Fully Convolutional Neural Network
Figure 7: Model 7 - Recurrent Neural Network
Figure 8: Model 8 - Recorruent Convolutional Neural Network
Figure 9: Model 9 – Embedding Pair Comparator
Model Epochs Learning Rate Batch Size
4 1,000 0.00070 128
5 1,000 0.00005 256
6 10 0.00100 16
7 100 0.00500 256
8 1,500 0.00070 128
9 1,000 0.00010 16
Table 2: Open-set Hyperparameters

3.5 Experiments

We propose several experiments that use both closed and open-set models presented in sections 3.3 and 3.4

. Additionally, we propose experiments to compare the performance of open-set models in a closed-set scenario. Tensorflow

(abadi2016) and Tflearn (tflearn2016) were used to generate the neural networks for all experiments. All models were trained using the Adam Optimizer. Additionally, two cost functions were used for training the models. Open-set models should reconstruct an audio segment and were induced using Mean Squared

Error (MSE). Closed-set models should perform a classification task and were induced based on Categorical Cross-Entropy. The convolutional layers in all experiments have a stride of 1. For ease of reproduction, this work and the Python code used to reproduce all experiments is publicly available at Github

111 .

3.5.1 Specific closed-set experiments

The closed-set experiments were divided in four groups: (1) a baseline based on Gaussian Mixture Models (GMM) (automaticspeechrecognition) capable of only closed set identification; (2) fully connected neural networks; (3) convolutional neural networks; and (4) recurrent neural networks. In this scenario, 4 experiments were proposed.

  • Experiment 0

    : relies on a GMM, a classical model in literature before the advent of deep learning approaches. The Python library Sklearn

    (pedregosa2011) was used to generate the model. The aim of this experiment is to provide a baseline for the other models. This model and the majority of the others receive five seconds (2,808 attributes) of audio. The audios were extracted from 20 speakers.

  • Experiment 1

    : is a closed-set neural network for speaker identification. In this model, each known speaker corresponds to a position given an output neuron. This experiment uses Model 1, presented in Figure


  • Experiment 2: is similar to Experiment 1 but we are using recurrent layers and dense fully connected layers. This experiment uses Model 2, presented in Figure 3.

  • Experiment 3: is similar to Experiment 2 but we are using a convolutional layer and dense fully connected layers. This experiment uses Model 3, presented in Figure 3.

3.5.2 Open-set experiments

In many tasks it is desirable to recognize previously unknown speakers. Closed-set models have several drawbacks. First, they need to be retrained. Second, depending on the number of new speakers and the amount of data, retraining can be time consuming. Third, when adding several speakers, the model may require architectural reengineering to improve its performance. Open-set models can solve these problems. Open-set models have the ability to adapt to new speakers, and some open-set models may be language-dependent, assisting with tasks such as voice conversion between languages. In this paper we propose 5 open-set architectures and the best one is called Speech2Phone. The experiments using these architectures are:

  • Experiment 4a: this first experiment is based on the Speech2Phone Model, the model which presented more advantages in our experiments, previously presented in Figure 9. It consists of a fully connected, shallow feed-forward network for embedding extraction. In our tests, a shallow network performs better than deeper ones in the fully connected models. In order to identify the speaker, it is also necessary to search the extracted embedding in an embedding database. This is performed by running the KNN algorithm with and using Euclidean distance. This way, the label (speaker) for a given instance is the same as the closest embedding in the embedding database. A new speaker can be inserted into the system without retraining, simply by extracting their embeddings during the first usage and inserting the embeddings into the database.

  • Experiment 5: it is based on Experiment 4a, but it uses a convolutional neural network. This network has two subcomponentes: (a) convolutional layers acting as an encoder of the information; (b) fully connected layers acting as a decoder. The instances can be seen as a bidimensional image matrix where columns are time steps and the rows are cepstral coefficients. Convolutions have the advantage of being translationally invariant in this matrix. As we try to reconstruct a specific phoneme, this is a desired property, since where the instance may contain the target phoneme, it may occur in different parts of the window. This experiment uses Model 5, presented in Figure 9.

  • Experiment 6a: based on Experiment 5, but its decoder uses convolutional and upsampling layers. The ideia behind this model is that the output also has a bidimensional structure, then convolutional can be useful also in the construction of output phase. This experiment uses Model 6, previously presented in Figure 9. Experiment 6b also uses Model 6, but in a closed-set scenario, as described in Section 3.5.3.

  • Experiment 7a: similar to 5a, it consists of a recurrent fully connected neural network for embedding generation. The five-second window is split into five segments containing one second each, in which the model analyzes one at a time. Considering that the recurrence window is small, we used classical recurrence instead of long term models like LSTM, since vanishing gradients are less prone to occur. If the phoneme of interest happens in one of the five fragments, the recurrence network should store it in its memory before reconstructing it in the final step, potentially improving the reconstruction accuracy. The approach reduces the number of learned parameters and consequently also improves training times. Both this experiment and the Experiment 7b (Section 3.5.3) use Model 7, presented in Figure 9.

  • Experiment 8: combines both recurrent and convolutional layers and focuses on the open-set case. It relies on Model 8, presented in Figure 9.

  • Experiment 9: based on experiment 4a, but instead of KNN, a second neural network is induced to find the label for a new speaker. This second model receives pairs of embedding and should identify whether the pair is from the same speaker or not. This experiment uses Model 9, presented in Figure 9

3.5.3 Open-set models in Closed-set scenarios

To answer whether open-set models can perform as well as closed-set models in closed-set scenarios, three extra experiments have been proposed. Here, we selected three neural network architectures from the previous experiments: (a) fully connected (4b); (b) convolutional neural (6b); and (c) recurrent (8b):

  • Experiment 4b: based on Model 4, but it uses a closed-set testing. This means the instances for training and testing Model 4, although different from each other, belong to the same speakers. Ideally, an embedding model is as good as a classical one in a closed-set scenario, while still maintaining a good performance in the open-set case.

  • Experiment 6b: very similar to Experiment 6a, but evaluated in the closed-set scenario.

  • Experiment 7b: based on Experiment 7a, but adapted for the closed-set evaluation.

3.5.4 How multilingual is Speech2Phone?

To assess whether the Speech2Phone model (Model 4) transfers learning from one language to another, we proposed training it with our entire dataset, totaling 40 speakers, and

carried out the following experiments:

  • Experiment 4c: in this experiment, the test is performed on the English language using 20 speakers from the LibreSpeech dataset (librispeech).

  • Experiment 4d: the model is evaluated on the Spanish language so we used all male speakers (totaling 13 speakers) from the crowdsourced high-quality Argentinian Spanish speech dataset222

  • Experiment 4e: the model is evaluated on the Chinese language, so we chose 20 speakers from the training set of the Common Voice333 dataset.

4 Results and discussion

Results are presented in tables 3 and 4.1 through 6 and discussed in Section 4.5. To facilitate model comparison, most experiments are trained with 20 speakers.

4.1 Closed-set experiments

Table 3 shows accuracies for the closed-set evaluation. These models are trained and tested with new instances from the same speakers.

Training Testing
Exp. Acc. Instances Instances
0 85 1,037 () 100 ()
1 73 1,037 () 100 ()
2 100 1,037 () 100 ()
3 97 1,037 () 100 ()
Table 3: Results for Closed-set experiments

In the closed-set scenario, where the models must identify new samples of speakers previously seen during the training, four experiments were proposed.

Experiment 0, a baseline that explored the use of GMM, had the second worst accuracy (85 %) in the closed-set scenario. The model surpassed only Experiment 1 — a fully connected artificial neural network — which obtained an accuracy of 73%. Experiment 3, which explored the use of a convolutional neural network (Figure 3), presented the second best accuracy (97%) and was only surpassed by Experiment 2, which explored the use of a recurrent neural network with fully connected layers, and made no predictions in the test set to obtain an accuracy of 100%.

In the closed-set scenario, the superiority of recurrent and convolutional neural networks is notable. Recurrent neural networks have the ability to remember context, and this ability can greatly assist in identifying speakers in closed-set scenarios. Convolutional neural networks are excellent feature extractors and have managed to perform very closely with recurrent neural networks. However the fully connected neural network had the worst performance, losing even to the closed-set baseline experiment. It is important to note that convolutional and recurrent models are powerful models and, in turn, more prone to overfitting. This is an unwanted property that affected negatively the open-set models, as discussed in Section 4.2.

4.2 Open-set experiments

Table 4 shows results for the open-set evaluations using different speakers for training and testing models, presenting accuracies and metric obtained by analyzing the similarities between expected and obtained outputs.

Training Testing
Exp. Acc. R2 Instances Instances
4a 76.96 0.9101 1,037 () 1,257 ()
5 58.20 0.8993 1,037 () 1,257 ()
6a 64.43 0.8456 1,037 () 1,257 ()
7a 50.28 0.7899 1,037 () 1,257 ()
8 62.04 0.9023 1,037 () 1,257 ()
9 58.06 0.9101 79,116 () 620 ()
Table 4: Results for Open-set experiments

In the open-set scenario, where the models must identify samples of new speakers that were not previously seen in the training, six experiments were proposed, exploring the use of recurrent, convolutional and fully connected neural networks. In most of these experiments, the neural model received 5 seconds of audio from a particular speaker and needed to reconstruct 1 second of the phoneme /a/. This new approach was proposed thinking of languages with few available resources. The goal is to get good results with little training data in contrast to the models proposed by (bredin2017) and (deepspeaker) that use the triplet loss function and need a large data set to perform well. To conduct the proposed experiments, samples from 20 speakers for training and samples from another 20 speakers for testing were used.

Experiment 4a obtained an accuracy of 76.96%, the best accuracy among the proposed open-set models. This result is similar to the fully connected model for the closed set scenario (Experiment 1). However, a surprising result is that fully connected models perform worst in the closed-set task and best in the open-set task. In theory, Model 4 has more parameters than the majority of the open-set models and thus should be more prone to overfitting. The opposite happened, it seems that recurrent and convolutional models specialized in extracting particular features for the training speakers in order to reconstruct the output. This is not an overfitting in classical terms, as the model is not being used direct in the task for which it was trained. The proposed method is somewhat similar to the transfer learning concept, in which the model learned on known speakers are tested on new speakers. In other works, recurrent and convolutional models indeed generalize better, however only for known speakers, as this will be shown in Section 4.3.

Experiment 5, which explored the use of a convolutional neural network with a fully connected decoder, obtained the fourth best accuracy (58.20%). Translational invariance is a useful feature from convolutional networks and can be used to detected specific phonemes independently of where their occur in the audio. However, this model was not able to surpass Experiment 4a for the reasons previously discussed and also could not surpass the other convolutional model (Experiment 6a).

In Experiment 6a, which explored the use of a fully convolutional neural network for embeddings generation presented the second best result (64.43%) in the open-set scenario; second only to Experiment 4a, suggesting that convolutional models are suitable for the evaluated task.

Experiment 7 explored the use of a recurrent neural network with fully connected layers for embeddings generation, resulting in the worst accuracy in the open-set scenario (50.28%). Recurrent models can perform a more detailed analysis on the input audio, searching patterns in one input fragment at a time. However, a problem may happen when this pattern is split in a different analysis window. The simple recurrent model tested could not overcome this issue. We also evaluated an LSTM recurrent network, but it did not perform as well as simple recurrence, as there is no need for long term memory in a 5-step analysis process.

Experiment 8 explored the use of a recurrent and convolutional mixed network, where the encoder is fully convolutional and the decoder is fully connected and recurrent. This experiment achieved the third best result (62.04%) for the open-set scenario. The issues affecting Experiment 7 does not seem to have a high impact on this experiment, possibly due to recurrent layers having been positioned after the convolutional layers.

Experiment 9 is a special experiment based on two neural networks, where Model 4 extracts embeddings while Model 9 is used to compare embedding pairs and indicate whether they are from the same speaker or not. The experiment lead to an accuracy of 58.06%, approximately 19% lower than the result obtained with the use of Model 4 and KNN. Possibly Model 9 is disadvantaged because when deciding if both embeddings are from the same speaker, it receives only the two speaker embeddings while the KNN compares the distance between the new sample embedding and the embeddings of all reference samples, thus having a great advantage.

In the open-set scenario, the superiority of fully connected models is noticeable. This is because they are able to generalize better for new speakers and proved to be less prone to overfitting in the chosen task. Furthermore, it is clear that the KNN is superior to a neural network in deciding which speaker the embedding belongs to. The fully convolutional model also showed promising results with a performance 12.53% below the fully connected model. Experiment 4a presented the highest value for R2 metric indicating that there is a high correlation between the expected MFCC and the predicted MFCC, that is, the model can efficiently reconstruct the pronunciation of the phoneme /a/ in the speaker’s voice.

4.3 Open-set models in Closed-set scenarios

Table 5 indicates how open-set models perform in the closed-set scenario, that is, when trained and tested over the same speakers and still using the embeddings for speaker identification.

Training Testing
Exp. Acc. R2 Instances Instances
4b 77.50 0.9668 1,037 () 100 ()
6b 100.0 0.9981 1,037 () 100 ()
7b 88.75 0.8364 1,037 () 100 ()
Table 5: Results for Open-set models in Closed-set scenarios

Experiments 4b, 6b and 7b use the same models from Experiments 4a, 6a and 7a, respectively. Although direct comparison is not fair since open-set models have more difficulty than closed-set scenarios, Experiment 4b was able to maintain a very close accuracy in the closed-set scenario compared to 4a. However, it presented the worst accuracy (77.50%) and did not surpass even the baseline in the closed-set scenario (Experiment 0), even tough it was the best model in the open-set scenario. It is believed that the model has lost to others because it is less prone to overfitting for speakers and has a higher generalization, avoiding memorizing specific speaker details.

Experiment 6b obtained the best accuracy (100%), tied with Model 3 which is a specific closed-set recurrent model, while 6a obtained the second best accuracy in the open-set scenario. This shows that with our training technique, open-set convolutional models can achieve results as good as in closed-set scenarios. However, it is clear that the best open-set models in closed-set scenarios do not maintain their performance for open-set scenarios.

While 7a presented the worst result among the explored models, 7b was the second best open-set in the closed-set scenario reaching an accuracy of 88.75%.

Therefore, for speaker identification scenarios in open-set scenarios we recommend the use of Model 4, which we call Speech2Phone. In addition to having the same performance as closed-set models, this model does not need to be architecturally engineered after adding more speakers to training, but when adding many speakers there should still be performance loss.

4.4 How multilingual is Speech2Phone ?

Table 6 shows how well Speech2Phone scales to other languages.

Training Testing
Exp. (Language) Accuracy Instances Instances
4c (English) 74.98 2,394 7,887
4d (Spanish) 73.32 2,394 986
4e (Chinese) 75.66 2,394 415
Table 6: Results for Speech2Phone Model in other languages

To verify whether our training technique can be useful in other languages, we choose the best model in the open-set scenario, which we call Speech2Phone, and we train it over our entire dataset, 40 speakers, and we test in English, Spanish and Chinese (variety spoken in Taiwan). In addition to verifying that the Speech2Phone model is multilingual, these experiments are important to verify the performance of the model depending on the speech quality, since on the basis of training and testing the audios were recorded in the same quality using the same equipment and the same environment.

In Experiment 4c, where the model’s generalization capacity was tested for the English language, the model reached an accuracy of 74.98%. In Experiment 4d, where the model was tested in Spanish language, the model reached an accuracy of 73.32%. As for Taiwan Chinese, the model had an accuracy of 75.66 %.

With these experiments it is noteworthy that the model can generalize without losing much performance for other languages. The difference ranged by a small margin, e.g. about 1.5% to 3.5%. It should also be noted that the model can generalize to databases recorded with other equipment different than those used in training.

4.5 Scalability analysis

As the purpose of this paper is to investigate good models in the scenario of limited resources, an important research question is how well the model can scale to bigger datasets. To investigate this question, we selected our best embedding model (Model 4) and trained a new model over all of the 40 Portuguese speakers. As a follow up, multiple test sets were extracted from LibriSpeech dataset (librispeech), varying the number of new speakers in each one. Model 4 was used because this model presented good transfer learning capabilities between languages, as show in Experiments 4c. We chose the LibriSpeech dataset due to its quality and the number of speakers.

The results can be seen in Figure 10.

Figure 10: Scalability Analysis

The model has approximately 85% accuracy in the open-set case for two new speakers. This accuracy drops as the number of speakers in the testing set increases. For a test size having the same size of the training size, it presents about 62% accuracy. For 100 speakers it presented about 55% accuracy. It is possible to observe that the number of test speakers more than doubled compared to the number of training speakers with a relatively small drop in accuracy. These results were considered satisfactory. As transfer learning was used (English language), the model for Portuguese may present higher accuracies.

5 Conclusions and future work

In this paper we propose a new training technique for speaker identification models, focusing on languages with few resources available. To enable training this technique we have built a dataset. Several models were proposed in the closed-set and open-set scenarios and were trained and tested in this dataset. We noted that fully connected models perform better for detecting new speakers, even though they are not ideal for the known speakers scenario. The best model in the open-set scenario was named Speech2Phone, experiments were performed with this model showing its generalization abilities for other datasets and languages. Finally, a scalability analysis was performed with the model to see how scalable the model is for large datasets.

This work contributes directly to the area of speaker identification, presenting a new independent text and multilingual model, making important contributions mainly to languages with few available resources.

Additionally, the model proposed here can be used in tasks such as speech synthesis (ping2017deep), voice cloning (arik2018neural) and cross-lingual voice conversion (zhou2019cross). In these tasks, speaker identification system embeddings are used to represent the speaker. The model presented is useful for cross-lingual voice conversion due to its language-independent feature. Furthermore, an advantage of Speech2Phone compared to the models proposed by bredin2017 and deepspeaker, is the speed of execution since Speech2Phone has no recurring units and it is a simple network fully connected, making it speedier. This feature is very desirable for applications because of the need to run in real time.

For future work, we intend to increase the dataset used for training to the greatest extent possible, and make it public. In addition, we intend to explore the use of spectrograms and combine MFCCs with other representations, such as Delta and Delta-Delta resources to increase the Speech2Phone model’s performance.


We would like to thank Itaipu Technological Park (Parque Tecnológico Itaipu — PTI444 and Coordination for the Improvement of Higher Education Personnel (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior — CAPES555 for financial support for this paper, especially from the Latin American Center of Open Technologies (Centro Latino-Americano de Tecnologias Abertas - CELTAB). We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.