Speaker recognition is the process of recognizing the personal identity of a spoken utterance. Depending on the number of speaker candidates to be recognized, it is often referred to as speaker verification (single candidate) or speaker identification (multiple candidates). According to the textual content of the spoken utterance being recognized, a speaker recogntion task falls into three categories: text-dependent speaker recognition [hebert2008text], where the text of the utterance is always the same (e.g. a keyword [chen2014small] or password), or from a very small set; text-prompted speaker recognition, where the text of the utterance is randomly selected from a pre-defined large set to prevent spoofing attacks; and text-independent speaker recognition [kinnunen2010overview], where there is no restriction on the text of the utterance.
Regardless of the number of speaker candidates, the text of utterance, or the specific underlying technology, all speaker recognition systems require two stages of user interaction when deployed to production environment: the enrollment stage and the runtime stage:
During the enrollment stage, a user provides multiple audio samples to the system, and the system generates a user profile to represent the voice characteristics of this user, as shown in Fig. 1.
Once the users have completed the enrollment, the system is ready for runtime recognition, where the voice characteristics of the runtime audio is compared against the enrolled user profiles, as shown in Fig. 2.
In both the enrollment stage and the runtime stage, the speaker recognition system needs to extract acoustic features such as PLP [hermansky1990perceptual], MFCC [davis1980comparison], PNCC [kim2016power] or log Mel-filterbanks from the audio signals. After the acoustic features have been extracted, a speaker encoder model will be used to represent the audio by a speaker embedding, such as a GMM supervector [reynolds2000speaker], speaker factors from joint factor analysis [kenny2005joint]
, an i-vector[dehak2010front]
, or a neural network embedding[wan2018generalized, li2017deep, snyder2018x]
. In the context of this paper, the software that implements feature extraction and speaker encoder will be referred to as thespeech engine, as they are the most computationally expensive components in the speaker recognition system.
2 The version control problem
After a speaker recognition system has been deployed to production environment, we may still want to update the system for many reasons, including:
Updating the feature extraction component for better performance (e.g. using more frequency bands).
Updating the underlying speaker encoder technology for better performance (e.g. migrating from i-vector model to neural network based model).
Based on the same technology, updating the speaker encoder model to use a different neural network topology, a different loss function during training, or different training datasets.
Software optimization and refactoring to improve system robustness, scalability and maintainability.
Because of the enrollment stage, speaker recognition is a stateful system — the recognition result of a runtime audio depends on the output of other audio (i.e.
the enrollment audio). This is very different from other speech systems such as automatic speech recognition (ASR) and language recognition, where the systems are typically stateless.
As a consequence, the user profiles obtained during the enrollment process (Fig. 1) are “version dependent”. Once the speaker recognition system has been updated to a newer version, existing user profiles can no longer be used.
In the following sections, we will discuss strategies to re-enroll the user profiles based on a new version of the system in a production environment111Without loss of generality, we will refer to the new version of the system as the new “model” in the following sections for simplicity.. The strategies are different based on the type of deployment. According to where the speech engine runs and where the user profiles are stored, we categorize deployment solutions into three types:
Device-side deployment: The speech engine runs on user devices, and the user profiles are also stored on user devices.
Server-side deployment: The speech engine runs on cloud computing servers, and the user profiles are stored on cloud databases.
Hybrid deployment: The speech engine runs on cloud computing servers, but the user profiles are stored on user devices.
3 Device-side deployment
3.1 Device-side architecture
In device-side deployment, both the speech engine execution and the user profile storage happen on the user device. The user device could be either smartphones, smart home speakers, or smart security devices. The biggest advantage of device-side deployment is that, it does not require any Internet communication with servers. This means both enrollment and runtime stages can perform smoothly even when there is no Internet connection.
One big challenge of device-side deployment is the limited computational resources, such as CPU, memory, storage, and power. In most use cases, the user device (e.g. a smartphone) needs to perform many other tasks in parallel, thus the resource budget for speaker recognition is usually very limited. There are many approaches to reduce the computational cost of the speaker recognition system, such as model quantization [alvarez2016efficient, shangguan2019optimizing], model compression [nakkiran2015compressing], model sparsification [lecun1990optimal], or implementing part of the system on specialized hardware (e.g. digital signal processors).
3.2 Single version updating strategy
Version control for device-side deployment is relatively straightforward, as illustrated in Fig. 3. The user device only keeps a single model. After the enrollment stage, the user’s enrollment audio will be stored on the device. When there is a newer version of model available on the model storage server, the user device will download this newer model. Once the download completes, it will immediately trigger a process that uses the newly downloaded model to generate the new version of user profiles based on the enrollment audio. This process guarantees that the version of user profiles stored on the user device always match the version of the model.
4 Server-side deployment
4.1 Server-side architecture
In server-side deployment, both speech engine executation and user profile storage happen on backend servers, which is the opposite of device-side deployment. The biggest advantage of server-side deployment is that, the user device only needs to perform very simple operations, such as obtaining the enrollment audio from the user, and communication with the servers. All complicated logic and resource-intensive tasks will be implemented on the servers.
The typical architecture of server-side deployment can be illustrated in Fig. 4:
During the enrollment stage, the user device first uploads the enrollment audio to the backend database via the frontend reverse proxy server; next, the speech engine on the cloud computing server generates the user profile based on the enrollment audio; and finally, the user profile will be stored in the backend database. Both the enrollment audio and the user profile are stored together with the user’s unique ID.
During runtime stage, the user device sends the runtime audio together with a set of candidate user IDs to the frontend server; the frontend will fetch the profiles for the candidate users from the backend database, and send them together with the runtime audio to the cloud computing server; finally, the speech engine on the cloud computing server will send the recognition result back to the user device.
The request and response schema for enrollment and runtime stages can be roughly described as below:
The problem with the above architecture is obvious: During runtime stage, if the model on the cloud computing server has been updated to a newer version, it will mismatch with the user profiles stored in the database. In the remaining of this section, we will introduce three different version control strategies to handle this problem.
4.2 Single version offline updating strategy
Among all model updating strategies for server-side deployment, single version offline updating is the simplest one. Before we update the speaker recognition models in the cloud computing servers, the frontend server will first stop dispatching any new enrollment or runtime requests to the backend. Instead, the frontend will respond the user device with a special error message, indicating that the backend servers are currently being maintained and updated, and the user device should try again later.
Once the models in the cloud servers have been updated, a background process will be triggered to rerun the enrollment process for all users — the speech engine will process the enrollment audio for each user, generate a new user profile based on the new model, and replace the existing user profile in the database. Once this large-scale re-enrollment process has been completed, all user profiles in the database will have the same version as the models in the cloud computing servers, and the frontend could resume to accept new enrollment and runtime requests again.
Although this single version offline updating strategy is relatively simple and easy to implement, its disadvantages are also obvious:
It requires a downtime period of the entire speaker recognition service. If the users are geographically concentrated and the use cases are relatively simple, the updating can be typically scheduled to happen in the local late midnight when we expect very few requests. However, if the users are distributed across multiple time zones, we may expect requests to the service 24 hours a day, thus the downtime will cause significant frustrations to the user experience.
Unlike device-side deployment, where each device only stores the profiles for the owners of the device, in server-side deployment, the database needs to store the profiles of all users. For large-scale applications, the number of users could be huge, thus rerunning enrollment for all users will be a very computationally intensive task. It may not complete within the scheduled downtime.
4.3 Single version online updating strategy
To avoid the downtime issue in the single version offline updating strategy, an alternative solution is the single version online updating strategy. In this strategy, we associate each speaker recognition model with a unique version identifier string. During the enrollment stage, when we store the user profile in the database, it is stored together with the version identifier of the model that generated it. Then in the runtime stage, when the frontend server receives a new runtime request, it will first check whether the version identifier of the user profile in the database matches the version identifier of the model in the cloud computing server:
If the version identifiers match each other, the frontend server will directly trigger the runtime logic as illustrated in Fig. 4b.
If the version identifiers do not match, the frontend server will trigger another process to rerun the enrollment for the user. After the re-enrollment completes, the versions of the user profile and the model are guaranteed to match each other, and the frontend server will trigger the runtime logic.
As we can see, the single version online updating strategy postpones the re-enrollment process to an on-demand, per-request manner. This guarantees that the speaker recognition service will be available 24 hours a day without downtime.
However, this strategy also has one disadvantage. Once the model in the cloud computing server has been updated, the next runtime request from each user will always experience increased latency due to the re-enrollment. The significance of the latency increase depends on the efficiency of the re-enrollment process. However, since model updating typically happens every few weeks or months, this increased latency is possibly acceptable for most applications — it only happens once for each user after each model update.
Additionally, for large-scale distributed systems, single version online updating strategy has another challenge known as version bouncing. In a distributed system, there will be multiple cloud computing servers, each serving a copy of the speech engine. When we update the models for the cloud computing servers to a newer version, the update process typically will not finish synchronously on different machines. This will result in a state that some of the cloud computing servers are serving the new model version, while the other cloud computing servers are still serving the old model version. If a user device sends runtime requests to different servers, the re-enrollment process may happen multiple times, upgrading and degrading the model version in turn, as illustrated in Fig. 5.
There are several methods to avoid the version bouncing problem:
The frontend server can periodically send synchronization requests to all cloud computing servers, and maintain a table to record the current model version of each cloud computing server. With this table, if a user profile has been updated, the runtime request will only be dispatched to a cloud computing server with the updated model.
The frontend server can implement a load balancing algorithm based on the hash value of the user’s ID, such that requests for each user are always dispatched to the same cloud computing server. This will guarantee that re-enrollment will only update user profile from old version to new version once.
Finally, we can store multiple versions of profiles for each user in the database. Once the re-enrollment for a user has completed, we will store both the old version and the new version of this user’s profile. For future runtime requests, no matter which version of model is served in the cloud computing server, no re-enrollment will be needed as both versions of profiles are available.
4.4 Double version updating strategy
As we mentioned before, the single version offline updating strategy requires service downtime for each model update, and the single version online updating strategy will cause increased latency for runtime requests. Here we introduce the double version updating strategy, which will overcome these drawbacks.
In the double version updating strategy, we always serve two versions of models in the cloud computing servers at the same time, and always store two versions of user profiles in the database. During enrollment stage, we always enroll with both models; and during runtime stage, we use the “newest available model”. The coexistence of two versions guarantees that even if we have updated one model to a newer version, the other model is still available, allowing for a grace period for the user profiles to be updated.
There are typically two ways to simultaneously serve two models in the cloud computing servers. First, we could divide the cloud computing servers into two groups, each group serving one model. The group partition is fixed, so the frontend server does not need to periodically synchronize with the cloud computing servers. Alternatively, each cloud computing server could serve two models at the same time with separate processes.
Assuming different versions of models are served in different groups of servers, we use Fig. 6 as an example to illustrate this strategy. Originally, the cloud computing servers are serving model V1 and model V2 simultaneously, and we store both user profile V1 and V2 in the database. When the development team releases a newer model V3, it will replace the group of cloud computing servers that are still serving the oldest model V1. During this process, the frontend is still handling all enrollment and runtime requests:
Enrollment requests will be dispatched to both cloud computing servers serving model V2 and V3. User profiles for both V2 and V3 will be produced and stored in the database.
For a runtime request, if the user profiles have not been updated (only V1 and V2), the request will be dispatched to a cloud computing server serving model V2. Because user profile V2 is available, the runtime recognition can be performed smoothly without additional latency (as is the case in Fig. 6). At the same time, the frontend will trigger a re-enrollment process in the background to replace user profile V1 by user profile V3.
For a runtime request, if the user profiles have already been updated to V2 and V3, the request will be dispatched to a cloud computing server serving model V3 (another case not described in Fig. 6).
As we can see, in the double version updating strategy, while background processes are updating the models on cloud servers to the newer version, and updating user profiles to the newer version, the speaker recognition service will still be always available without additional latency. There will usually be sufficient time to update all user profiles until the next model release. Apparently, this is the most elegant version control solution for server-side deployment. However, the implementation of double version updating strategy is quite complicated, thus may not be the optimal solution for smaller projects with short development cycles.
5 Hybrid deployment
5.1 Hybrid architecture
In Section 3 and Section 4, we discussed the version control strategies for device-side and server-side deployment. Although device-side deployment is simple and requires no Internet communications, it’s not available for many applications where the on-device computational resource budgets are limited. At the same time, storing user profiles on server-side databases may results in privacy concerns [de2017europe].
An alternative solution is the hybrid deployment, where the speech engine execution happens on cloud computing servers, but the user profiles are stored on user devices, as illustrated in Fig. 7:
During enrollment stage, the user device first sends the enrollment audio to the frontend server; then the speech engine produces the user profile from the enrollment audio; finally, the frontend server will send the user profile back to the user device. Once the enrollment stage completes, the servers will immediately delete the user profile from the memory; the user device is responsible for storing the user profiles.
In the runtime stage, the user device sends the runtime audio together with candidate user profiles to the frontend server; the speech engine will compare the voice identity of the runtime audio against the candidate user profiles; finally, the recognition results will be sent back to the user device. The user profiles are typically encrypted when being stored on the user device and communicated to the servers for security.
Similar to Section 4.1, we provide the rough request and response schema for enrollment and runtime stages of hybrid deployment as below:
The hybrid deployment is very similar to the server-side deployment, except that user profiles are stored in the user devices instead of in a backend database. Because all device-server communications can only be initiated by the user device, the servers cannot access the user profiles at any given time, which poses a new challenge to the hybrid deployment.
5.2 Single version online updating strategy
For hybrid deployment, we could use a single version online updating strategy that is very similar to the strategy we introduced in Section 4.3 for server-side deployment. When the user device sends a runtime request to the frontend server, it will first check whether the user profile version matches the version of the model in the cloud computing server. If the versions do not match, it will trigger the enrollment stage to update the user profile, then perform runtime recognition after the re-enrollment completes.
Similar to server-side deployment, single version online updating strategy will cause increased latency to the first runtime request for each user after the model has been updated. This could be mitigated by implementing a daily handshaking communication between the user device and the server, initiated by the user device. This handshaking communication will simply check whether the version matches between the device and the server; if they mismatch, it will silently trigger the re-enrollment in the background. The handshaking communication could happen at the late midnight in the device’s local time zone to minimize user interference.
5.3 Double version updating strategy
For hybrid deployment, we could also use a similar double version updating strategy as the one introduced in Section 4.4 for server-side deployment. In this strategy, the cloud computing servers always serve two versions of models, and the user devices also always store two versions of user profiles. During enrollment stage, the server always produce two versions of user profiles and send them back to the user device. At runtime, even if one server-side model has been updated to a newer version, the other model is still available for those devices whose user profiles have not been updated.
For hybrid deployment, even if we use the double version updating strategy, we still need to make sure that all user devices complete the update within a certain time frame. Otherwise, if some user devices missed two server-side model updates, both versions of user profiles stored on the device will not be usable. One solution is to implement a periodic handshaking communication between the user device and the server, as we mentioned in Section 5.2.
In this paper, we introduced the concept of version control in speaker recognition systems. Version control is a common and challenging problem when deploying speaker recognition systems to production environments. Based on how we execute the speech engine and how we store the user profiles, we categorize speaker recognition deployment into three types: device-side deployment, server-side deployment, and hybrid deployment. We introduced version control strategies for each type of deployment, and discussed the advantages and disadvantages of each strategy.
Appendix A Glossary
Speaker embedding: A vector representing the voice characteristics of a spoken utterance.
Speaker encoder: The algorithm that generates the speaker embedding from the acoustic features of an utterance.
Speech engine: The software that implements acoustic feature extraction and speaker encoder.
User profile: The aggregated speaker embedding generated from multiple enrollment audio samples provided by the user.
: The model used by the speaker encoder algorithm. In deep learning based approaches, the model is usually a neural network.
Frontend: In server-side and hybrid deployment, the reverse proxy server that dispatches requests from user devices to backend servers.
Cloud server: In server-side and hybrid deployment, the backend server that runs the speech engine.
Database: In server-side deployment, the backend database that stores enrollment audio and user profiles.