”Good words are worth much and cost little.” - George Herbert
In many speech-based applications, the construction of a speaker representation is required [irum2019speaker]
. Automatic Speaker Recognition (ASR) and Speech Synthesis are some examples that have recently made steady progress by leveraging large amounts of data and neural networks. Text-To-Speech (TTS) convincingly encompasses someone’s voice[shen2018natural], and modern Speaker Recognition systems identify a speaker [irum2019speaker] among thousands of possible candidates with high accuracy. Speaker recognition systems are trained to extract speaker-specific features from speech signals, and during evaluation, test speaker utterances are compared with the already existing utterances. However, dozen of test recordings are necessary, limiting usage when interacting with humans. When identifying a speaker or trying to create a convincing TTS system, only some key features might be necessary, such as certain inflexions or speech mannerisms. In this paper, we build a speaker recognition system that can identify a speaker by using a limited and personalized number of words. Instead of relying on full test utterance across all individuals, we interact with the speakers to iteratively select the most discriminative words.
Some pronunciation might be typical of certain speakers. For example, the phoneme ’r’ might be pronounced differently depending on your accent. Thus starting with general phoneme and refining based on the utterances received could result in better recognition systems. More generally, a desirable feature of speaker recognition is to adapt its strategy to the current speaker as important features vary from person to person.
Here we propose to envision the problem of building a representation of the speaker as a sequential decision-making problem. The system we want to develop will select words that a speaker must utter so that it can be recognized as fast as possible. Reinforcement learning (RL) [sutton2018reinforcement] is a framework to solve such sequential decision-making problems. It has been used in speech-based applications such as dialog [chandramohan2010optimizing, strub2017end] but not to the problem of speaker identification (note that [pietquin2005comparing] combines RL and phones similarity). We adapt a standard RL algorithm to interact with a speaker to maximize the identification accuracy given as little data as possible. After introducing an Interactive Speaker Recognition (ISR) game based on the TIMIT dataset to simulate the speaker ASR interaction, we show that the RL agent builds an iterative strategy that achieves better recognition performance while querying only a few words.
Our contributions are thus:
Finally, we test our method on the TIMIT dataset and show that ISR model successfully personalized the words it requests toward improving speaker identification, outperforming two non-interactive baselines (Sec. 5).
2 Interactive Speaker Recognition Game
In this paper, we aim to design an Interactive Speaker Recognition (ISR) module that identifies a speaker from a list of speakers only by requesting to utter a few user-specific words. To do so, we first formalize the ISR task as an interactive game involving the speaker and the ISR module. We then define the notation used to formally describe the game before detailing how we designed the ISR module.
2.1 Game Rules
To instantiate the ISR game, we first build a list of random individuals, or guests. Each guest is characterized by a few spoken sentences (enrolment phase), which act as their signature that we call voice print. In a second step, we label one of the guests as the target speaker that we aim to identify. Hence a game is defined by guests characterized with voice prints, and one of these guests is labeled as the speaker.
As the game starts, the voice prints are provided to the ISR module, and it needs to identify the speaker among the guests. To do so, the ISR engine may interact with the speaker, but it can only request the speaker to utter words within a predefined vocabulary list. At each turn of the game, the ISR module asks the speaker to say a word, the speaker pronounces it, and the ISR engine updates its internal speaker representation, as detailed in subsection 4.3, before asking the next word. Again, the ISR module may only request words. Thus, it needs to carefully choose them to correctly identify the speaker.
2.2 Game notation
A game is composed of a list of guests characterized by their voice print where is a subset from a larger group of registered guests of size , and a predefined vocabulary of size . The ISR module aims at building a list of words to be uttered by the speaker. The uttered version of is , where is the representation of word pronounced by the speaker. Note that, for a given , differs from one speaker to another.
2.3 Modelling the Speaker Recognition Module
From a machine learning perspective, we aim to design an ISR module that actively builds an internal speaker representation to perform voice print classification. As further discussed insubsection 4.2, this setting differs from standard SR methods that rely on generic but often long utterances [snyder2018x]. In practice, we can split this task into two sub-modules: 1) an interactive module that queries the speaker to build the representation, and 2) a module that performs the voice print classification. In the following, we refer to these modules as enquirer and guesser.
Formally, the guesser must retrieve the speaker in a list of guests characterized by their voice print and a sequence of words uttered by the speaker . Thus, the guesser has to link the speaker’s uttered words to the speaker’s voice print. The enquirer must select the next word that should be pronounced by the speaker given a list of guests and the sequence of previously spoken words . Thus, the enquirer’s goal is to pick the word that maximizes the guesser’s success rate. Therefore, the ISR module first queries the speaker with the enquirer. Once the
words are collected, they are forwarded to the guesser to perform the speaker retrieval. In practice, this artificial split allows training the guesser with vanilla supervised learning, i.e., by randomly sampling words to retrieve speakers. The enquirer can hence be trained through reinforcement learning, as explained in the next section.
3 Speaker Recognition as a RL Problem
Reinforcement Learning addresses the problem of sequential decision making under uncertainty, where an agent interacts with the environment to maximize its cumulative reward [sutton2018reinforcement]. In this paper, we aim at maximizing the guesser success ratio by allowing the enquirer to interact with the speaker, which makes RL a natural fit to solve the task. In this section, we thus provide the necessary RL terminology before relating the enquirer to the RL setting and defining the optimization protocol.
3.1 Markov Decision Process
In RL, the environment is modeled as a Markov Decision Process (MDP), where the MDP is formally defined as a tuple [howard1960dynamic, puterman2014markov]. At each time step , the agent is in a state , where it selects an action according to its policy . The agent then moves to the state according to a transition kernel and receives a reward drawn from the environment’s reward function . In this paper, we define the enquirer as a parametric policy where
is a vector of neural network weights that will be learnt with RL. At the beginning of an episode, the initial state corresponds to the list of guests:. At each time step , the enquirer picks the action by selecting the next word to utter , where . The speaker then pronounces the word , which is processed to obtain before being appended to the state . After words, the state is provided to the guesser. The enquirer is rewarded whenever the guesser identifies the speaker, i.e. if and where is the indicator function.
3.2 Enquirer optimization Process
In RL, policy search aims at learning the policy that maximizes the expected return by directly optimizing the policy parameters . More precisely, we search to maximize the mean value defined as . To do so, the policy parameters are updated in the direction of the gradient of . In practice, direct approximation of
may lead to destructively large policy updates, may converge to a poor deterministic policy at early training and it has a high variance. In this paper, we thus use the recent Proximal Policy Optimization approach (PPO)[schulman2017proximal]
. PPO clips the gradient estimate to have smooth policy updates, adds an entropy term to soften the policy distribution[geist2019theory], and introduce a parametric baseline to reduce the gradient variance [mnih2016asynchronous, schulman2017proximal].
4 Experimental Protocol
We first detail the data we used to create the ISR game before describing the speech processing phase. Finally, we present the neural training procedure.
We build the ISR game using the TIMIT corpus [garofolo1992timit]. This dataset contains the voice recordings of 630 speakers with eight different accents. Each speaker uttered ten sentences, where two sentences are shared among all the speakers, and the eight others differ. Sentences are encoded as 16-bit, 16kHz waveforms. First, we define the ISR vocabulary by extracting the words of the two shared sentences, so the enquirer module may always request these words whatever the target speaker. In total, we obtained twenty distinct words such as dark, year, carry while dropping the uninformative specifier a. Second, we use the eight remaining sentences to build the speakers’ voice print.
4.2 Audio Processing
Following [snyder2018x, snyder2017deep, snyder2016deep], we first down-sample the waveform to 8kHz before extracting the Mel Frequency Cepstral Coefficient (MFCC). We use MFCCs of dimension 20 with a frame-length of 25ms, mean-normalized over a sliding window of three seconds. We then process the MFCCs features through a pretrained X-Vector network to obtain a high quality voice embedding of fixed dimension 128, where the X-Vector network is trained on augmented Switchboard [godfrey1992switchboard], Mixer 6 [chodroff2016new], and NIST SREs [doddington2000nist]111available in kaldi library [Povey_ASRU2011] at http://www.kaldi-asr.org/models/m3). To get the spoken word representation (word that the enquirer will query), we split the two shared sentences into individual words by following the TIMIT word timestamps. We then extract the X-Vector of each word of every speaker to obtain . We compute the voice print by extracting the X-Vector of the eight remaining sentences before averaging them into a final vector of size 128 for each guest .
4.3 Speaker Recognition Neural Modules
We here describe the ISR training details and illustrate the neural architectures in Figure 2.
Guesser. To model the guesser, we first model the guest by averaging the voice print into a single vector . We then pool the X-Vectors with an attention layer conditioned on to get the guesser embedding [bahdanau2014neural]:
where [.,.] is the concatenation operator and MLP is a multilayer perceptron with one hidden layer of size 256. We concatenate the guesser embedding with the guest voice print before projecting them through a MLP of size 512. Finally, we use a softmax to estimate the probabilityof each guest to be the speaker, i.e.
. Both MLP have ReLU activations[nair2010rectified] with a dropout ratio of 0.5% [srivastava2014dropout]. The guesser is trained by minimizing the cross-entropy with ADAM [kingma2014adam], a batch size of 1024 and an initial learning rate of over 45k games with five random guests.
Enquirer. To model the enquirer, we first represent the pseudo-sequence of words by feeding the X-Vectors into a bidirectional LSTM [hochreiter1997long] to get the word hidden state of dimension 2*128. Note that we use a start token for the first iteration. In parallel, we average the voice print into a single vector to get the guest context. We then concatenate the word hidden state and the guest context before processing them through a one-hidden-layer MLP of size 256 with ReLU. Finally, a softmax activation estimates the probability of requesting the speaker to utter as the next word: . The enquirer is trained by maximizing the reward encoded as the the guesser success ratio with PPO [schulman2017proximal]. We use the ADAM optimizer [kingma2014adam]
with a learning rate of 5e-3 and gradient clipping of 1[pascanu2013difficulty]. We performed 80k episodes of length steps and random guests. When applying PPO, we use an entropy coefficient of 0.01, a PPO clipping of 0.2, a discount factor of 0.9, an advantage coefficient of 0.95, and we apply four training batches of size 512 every 1024 transitions.
We run all experiments over five seeds, and report the mean and one-standard deviation when not specified otherwise.
(a-b) Guesser test accuracy, respectively varying the number of words (resp. guests) being used (c) Enquirer test accuracy varying the number of queried words. The RL enquirer outperforms the heuristic baseline when selecting a low number of words.
5.1 Guesser Evaluation
In this section, we evaluate the guesser accuracy in different settings. As mentioned, we opt to request words to identify the speaker among guests. In this default setting, a random policy has a success ratio of , whereas the neural model reaches on the test set. As the guesser is trained on random words, these scores may be seen as an ISR lower-bound for the enquirer, which would later refine the word selection toward improving the guesser success ratio. Thus, this setting shows an excellent ratio between task difficulty and guesser initial success, allowing to train the enquirer with a relatively dense reward signal.
Word Sweep. We assess the guesser quality to successfully perform speaker recognition when increasing the number of words in (a). We observe that a single word only gives speaker retrieval, but the accuracy keeps improving when requesting more words. Noticeably, collecting the full vocabulary only scores up to 97% accuracy.
Guest Sweep. We report the impact of the number of guests in (b). The guesser accuracy quickly collapses when increasing the number of guests with having a success ratio. As the number of words remains small, the guesser experiences increasing difficulty in discriminating the guests. One way to address this problem would be to use a Probabilistic Linear Discriminant Analysis (PLDA) [ioffe2006probabilistic] to enforce a discriminative space and explicitly separate the guests based on their class.
5.2 Enquirer Evaluation
Model. As previously mentioned, the enquirer aims to find the best sequence of words that maximizes the guesser accuracy by interacting with the speaker. At each time step, we thus select the word with the highest probability according to the policy without replacement, i.e., the model never requests the same word twice.
Baseline. We compare our approach to two baselines: a random policy, and a heuristic policy. As the name suggests, the random baseline picks random words without replacement. To obtain a strong baseline, we pre-select words by taking advantage of the guesser model, where we value a sequence of words by computing the guesser accuracy over games. Optimally, we want to iterate over every tuple of words to retrieve the optimal set; yet, it is computationally intractable as it requires estimations. Therefore, we opt for a heuristic sampling mechanisms. We curated a list of the most discriminant words (words that increase globally the recognition scores) and sample among those instead of the whole list.
Results. In our default setting, the random baseline reaches speaker identification, and the heuristic baseline scores up to . The RL enquirer obtains up to , showing that it successfully leverages the guests’ voice prints to refine its policy. We show the RL training in Figure 4. At early training, we observe that the ISR module still has high variance, and may behave randomly. However, RL enquirer steadily improves upon training, and it consistently outperforms the heuristic baseline.
Word Diversity. To verify whether the enquirer adapts its policy to the guests, we generate a game for every speaker in the test set, and collect the requested words. We then compute the overlap
between the tuple of words by estimating the averaged Jaccard-index[jaccard1901distribution] of every pair of speakers as follow:
where is the number of speakers in the test set and is the word tuple of game . Intuitively, the lower this number, the more diverse the policy, e.g, the deterministic policy have a Jaccard-index of 1. In the default setting, the random policy has an index of 0.14 while the RL agent has an index of 0.65. Thus, the requested words are indeed diverse.
Requesting Additional Words We here study the impact of increasing the number of words requested by the enquirer (see (c) for results). First, we observe that the ISR module manages to outperform the heuristic policy when requesting two to four words, showing that the interaction with the speaker is beneficial in the low data regime. This effect unsurprisingly diminishes when increasing the number of words. However, we noticed that the enquirer always outputs the same words when . It suggests that the model faces some difficulties contextualizing the guests’ voice print before listening to the first speaker utterance. We assume that more advanced multimodal architecture, e.g., multimodal transformers [lu2019vilbert, tan2019lxmert], may ease representation learning, further improving the ISR agent.
6 Conclusions and Future Directions
In this paper, we introduced the Interactive Speaker Recognition paradigm as an interactive game to improve speaker recognition accuracy while querying only a few words. We formalize it as a Markov Decision Process and train a neural model using Reinforcement Learning. We showed empirically that the ISR model successfully personalizes the words it requests to improve speaker identification, outperforming two non-interactive baselines. Future directions can include : scaling to bigger datasets [Nagrani17, Chung18b], scaling up vocabulary size [dulac2015deep, he2016deep, seurin2019m] Our protocol may go beyond speaker recognition. The model can be adapted to select speech segments in the context of Text-To-Speech training. Interactive querying may also prevent malicious voice generator usage by asking complex words to the generator in a speaker verification setting.
We would like to thank Corentin Tallec for his helpful recommendations and the SequeL-Team for being a stimulating environment. We acknowledge the following agencies for research funding and computing support: Project BabyRobot (H2020-ICT-24-2015, grant agreement no.687831), and CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020