Automatic Speech Recognition (ASR) plays a major role in several emerging smart applications and services. Recent studies show that ASR can be used to detect emerging medical conditions such as Parkinson’s disease , Post-Traumatic Stress Disorder (PTSD)  and neurodegenerative diseases such as Alzheimer’s disease  and dementia  by continuously and passively observing the user’s speech. ASR is also used in banking and financial sector for biometric verification purposes . Moreover, several smart devices (i.e., smart TVs and speakers) are now embedded with ASR functionality. The commonality across all of these applications and services is that they all require user speech features to be sent to servers (or cloud) for classification purposes. These speech features are fed into machine learning models and matched to a known class. Fig. 1 shows a typical ASR system. If the ASR is for speaker verification then the user’s feature is matched against the enrolled identity of the claimed speaker.
While these technologies are very useful for healthcare monitoring, to enhance security in banking and finance and to improve user experience, continuously sending the speech features to servers pose serious privacy threats to the users. The users can be tracked by the service providers or their medical conditions can be inferred and sold to insurance companies or the speech biometrics can be hacked by adversaries. These are irrevocable problems. Within this context, this paper develops a privacy-preserving solution to redesign the back-end of the speaker verification system where user’s speech features remain in encrypted domain during the transmission, storage and processing.
Speaker verification is a task of verifying a user using their voice biometrics. As shown in Fig. 1, the speaker verification has two parts: 1) enrolment and 2) matching. During the first stage, the user needs to enrol their speech biometrics via speaker enrolment process. These enrolled biometric templates might be stored in an authentication server (which resides along with other servers). During the verification, a fresh speech feature is extracted and sent to the server who performs a comparison against the stored template using a machine learning technique. If the comparison is successful then the authentication server allows the user to access the service.
Traditionally the user’s speech features (or the templates) are only encrypted during the storage but decrypted during the processing (i.e., verification) stage. This means that the server (or the adversaries who compromise the server) can access the features and track the users. This is where the privacy risk and this paper develops a novel technique to transforms the back-end processing in the encrypted domain. This paper proposes a technique where users encrypt their speech biometric templates using their own keys prior to enrolling them in the authentication server. During the matching stage, the the user again encrypts the freshly generated speech feature using it’s own encryption key and send only the encrypted feature for matching. Since the server has got only the encrypted features, it has to perform the matching process in encrypted domain. Hence, this paper redesign the back-end matching process to support the encrypted domain processing.
This can be achieved via fully homomorphic encryption (FHE) techniques. The FHE was invented by Craig Gentry in 2009 
to perform both multiplication and addition in encrypted domain without the need to decrypt the data. The state-of-the-art FHE schemes are efficient and used to redesign several machine algorithms to process encrypted data. Therefore, if the speech features are encrypted with FHE, the server should be able to perform the verification without the need to decrypt the feature vectors. Hence, this paper proposes a methodology to exploit the properties of FHE scheme to develop a privacy-preserving speaker verification system in the encrypted domain.
We use bold lower-case letters like to denote column vectors; for row vectors we use the transpose . We use bold upper-case letters like to denote matrices, and identify a matrix with its ordered set of column vectors. Real numbers are denoted as and a real number matrix with is denoted as . We use to denote the ring of integers modulo , to denote the set of matrix with entries in . An integer polynomial ring with degree is denoted as where the coefficients of the polynomial is always bounded by . denotes the vector norm where .
I-B Paper organisation
The rest of this paper is organised as follows: state-of-the-art works related to the proposed scheme are summarised in Section II. Building blocks required for the proposed work is provided in Section III. Section IV proposes the privacy-preserving speaker verification system using CKKS homomorphic encryption. Testing environment, data-set, and parameter selection to achieve bit security are provided in Section V. Experimental results and efficiency compared with the traditional scheme are given in Section VI. The security and privacy analysis is given in Section VII followed by conclusions are discussed in Section VIII.
Ii Related Works
Several services are now exploiting unique features of speech for healthcare monitoring [2, 3, 4], authenticating banking applications , and smart home applications. These services need to collect and store users’ speech data over the Internet. At the same time, privacy regulations like GDPR in Europe are enforcing organisations to provide sufficient privacy guarantee when they use, process and store customer data. Since speech data is considered as unique and contain personnel information, the privacy of the voice data should be guaranteed.
To achieve this, we require novel techniques to redesign the speech processing back-end systems to protect the privacy while ensuring the utility of the data. There are several privacy-preserving techniques in literature that transform various types of data into encrypted domain using traditional homomorphic encryption or randomisation techniques i.e., facial biometric [32, 33], emotions [34, 36, 35], or voice biometric .
In the domain of speech processing, there are only a few notable privacy-preserving works exist [29, 31, 37, 24]. Smaragdis and Shashanka proposed the first application of secure multi-party computation (SMC) concepts for privacy-constrained speech technology 
. In their work, they realised secure speech recognition using the hidden Markov model (HMM) and a generalised version of the Paillier public-key scheme, which allowed training and classification between multiple parties and achieved perfect accuracy.
Pathak et. al redesigned Gaussian Mixture Model (GMM) based speaker recognition to achieve a similar privacy goal. This work relies on homomorphic cryptosystems such as BGN and Paillier encryption. This work has shown a proof-of-concept of privacy-preserving speaker recognition without compromising the accuracy. However, the shortcoming of the above cryptographic approaches is that far too much time is spent on the encryption i.e., few minutes required for processing.
Recently, the work in  used randomisation technique from information theory to develop a privacy-preserving speaker verification scheme. This work is neither computationally inefficient nor compromises privacy. The solution presented in  is significantly advanced than the existing solution in terms of accuracy, privacy and speed. However,  is interactive and requires multiple rounds of computations and it cannot be used for different front-end systems. Moreover, the security of all the schemes mentioned above relies on a mathematically intractable problems such as integer factorisation (schemes based on randomisation) and discrete logarithm (schemes based on homomorphic encryption). As we started to see the rise of quantum computers, the security of all these might be broken soon .
In contrast to the traditional partial homomorphic encryption schemes (i.e., Paillier, BGN, etc), the rise of fully homomorphic schemes (FHEs) show promising results recently in terms of efficiency. While FHE resists attacks arising from quantum computers (due to lattice hard problems ), they also support non-interactive computation on encrypted domain. Some of the notables works in the intersection of FHE and machine learning are [20, 18, 19] and many more. The work in 
, trains 30,000 logistics regression models in encrypted domain within 20 minutes but performs encrypted domain inference in 5 seconds using CKKS FHE scheme. The work in
, jointly done by Princeton University and Microsoft in 2016, transforms a trained Convolutional Neural Network (CNN) into a model suitable for encrypted domain inferencing. The work uses a simple CNN with 5 layers and 28x28 input dimension for MNIST dataset and requires 400MB bandwidth and 5 minutes to perform inference in encrypted domain. Finally the work in
, uses a novel discretization approach to transform neural networks suitable for advanced FHE scheme. A simple Neural network with 3 layers (with hidden layer of 100 neurons) took only 1.7 seconds to perform image classification at 96% accuracy for 128-bit security
. There are several other works in this domain that are focusing on redesigning the traditional machine learning (mainly deep learning) algorithms to work on FHE domain.
However, to the best of our knowledge, there are no FHE based speech processing machine learning algorithms exist in literature to achieve end-to-end privacy in real-time. Within this context, we develop a novel algorithm that changes the back-end of speaker verification system to process an encrypted speech data in real-time without the need for multiple rounds of communications. Moreover, the proposed algorithm supports real-time end-to-end encrypted speaker verification for negligible loss of accuracy at 128-bit security.
Iii Background Information
This section briefly describes the four building blocks required for the proposed algorithm.
Iii-a The Speaker Verification Systems
As shown in Fig. 1, the speaker verification systems composed of two components: 1) front-end and 2) back-end. The front-end is mainly focused on extracting feature vectors from speech. The back-end performs noise reduction and similarity calculation of speaker features.
The front-end extracts a number of acoustic features such as linear predictive cepstral coefficients, perceptual linear prediction coefficient, and mel-frequency cepstral co-efficients. Then several techniques used to enhance these features to get a better verification accuracy. In 1995, Reynolds et al.  applied Gaussian Mixture Model technique based on Universal Background Model (GMM-UBM) on these features to increase the accuracy by a significant percentage. Since then the GMM-UBM based speaker verification became the foundation of speaker verification research. Fifteen years later, Dehak et al.  proposed a ground breaking model called i-Vector to further decrease the speaker and channel variation while increasing the verification accuracy. Moreover, the i-Vectors are significantly lower dimension (i-Vectors are around 200x1 size) than the GMM-UBM models (GMM-UBM super-vectors are around 40,000x1 size).
Recently, motivated by the powerful feature extraction capability of deep neural networks (DNNs), a lot of deep learning based speaker recognition methods were proposed[10, 8]. The DNN based schemes boost the performance of the speaker verification to a new level even in the wild environment. Similar to the i-Vector, the DNN based feature extraction methods output x-Vector  and d-Vector . The dimensions of these vectors are very similar to i-Vectors.
As depicted in Fig. 1, the front-end can either use GMM-UBM or DNN-UBM to obtain i-, x- or d-Vectors [10, 8]. Hence, only these features are sent to the server for enrolment and matching. This paper focuses on protecting these feature vectors stored and processed in the back-end. One of the dominant techniques used in the back-end to perform similarity calculation is Cosine distance between the enrolled (or claimed) and test feature vectors of the user [9, 11, 12].
Iii-B Cosine Distance Calculation
Lets suppose, the user enrolled a feature vector at the server. During the verification, the user is sending . Now the server calculate the cosine distance between the target and test vectors as follows:
where dimension is the size of the i-, d- or x-Vectors (the is between and in the state-of-the-art works). To further reduce the channel-and speaker depended noise, a projection matrix is used as follows :
Iii-C Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) schemes support homomorphic properties such as addition and multiplication in encrypted domain. To explain this briefly, lets denote two numbers in plain domain as and and the corresponding homomorpically encrypted values in encrypted domain as and . Denote the encryption and decryption functions as and . The encryption function takes and in plain domain and public key as inputs and outputs the corresponding encrypted value i.e., and . The decryption function takes the encrypted value and secret key as inputs and outputs the plain domain values i.e., and . Within this context, FHE properties allows to compute addition and multiplication in encrypted domain without the need to decrypt the value i.e., , . Therefore, mathematical functions can be computed in encrypted domain using only encrypted values. For example, if a cloud wants to compute a function but only has encrypted inputs and , the cloud can exploit the FHE to evaluate the function as follows: where . Since the cloud is not holding the secret key , the evaluated function remains in encrypted domain.
An encryption scheme with the above FHE properties was invented by Craig Gentry in 2009 . The scheme is based on Lattice-based cryptography hence secure against attacks arising from quantum computers [15, 16, 17]. Since Gentry’s ground breaking work, there are numerous improvements were made by several researchers to improve efficiency and scalability. Currently FHE has reached an inflection point where several relatively complex algorithms can be evaluated in encrypted domain in near-real time [18, 19, 20]. Single-instruction-multiple-data (SIMD) is one of the powerful techniques that has enhanced the efficiency of FHE by more than 3 orders of magnitude . While there are handful of FHE schemes, this paper focuses on FHE scheme based Cheon-Kim-Kim-Song (CKKS)  since it is the most efficient method to perform approximate homomorphic computations over real and complex numbers.
Iii-D CKKS FHE Scheme
CKKS scheme supports real numbers and SIMD operation, therefore, its a suitable candidate for applications rely on vectors of real-numbers. CKKS works with polynomials because they provide a good trade-off between security and efficiency as compared to standard computations on vectors.
Given a message , a vector of real values, it is first encoded into a plaintext integer polynomial where and denotes the degree of the polynomial. The CKKS encryption encrypts into two ciphertext polynomials where is the size of the ciphertext modulo. In ciphertext domain, CKKS supports homomorphic addition, multiplication, and rotation operations. The rotation operation homomorphically performs a cyclic shift of the vector by some step. The multiplication and rotation operations in the CKKS scheme need additional corresponding evaluation keys and the key-switching procedures.
Moreover, each real number data is scaled with some big integer , called the scaling factor, and then rounded to an integer prior to encrypting the data. When the two data encrypted with the CKKS scheme are multiplied homomorphically, the scaling factors of the two data are also multiplied. This scaling factor should be reduced to the original value using the rescaling operation (i.e., dividing by ).
In CKKS, the size of the ciphertext are big (i.e., is big) hence it requires higher computational complexity. To reduce the complexity, the residue number system can be used. In the residue number system, the big integer is split into several small integers, and the addition and the multiplication of the original big integers are equivalent to the corresponding component-wise operations of the small integers i.e., where , , and . The denotes the number of multiplications can be performed to a ciphertext correctly. For example, if there are four CKKS ciphertexts , , , and then requires one level of multiplication and requires two levels of multiplications. Instead of performing via three multiplications, computing and followed by will require only two levels of multiplications. The efficiency of an algorithm is depend on circuits with smaller multiplicative depths.
The security of the CKKS scheme relies on the polynomial degree and the ciphertext modulo . Table I shows the parameters for achieving 128-, 192- and 256-bit security. For a given , the maximum size for is decreasing with the increasing security level. If the application requires more levels of multiplication in ciphertext domain then it requires larger . For a given security model, only way to increase the size of is by increasing the size of . The increasing the has consequences in terms of computational complexity.
|128-bit Security||192-bit Security||256-bit Security|
|N||Max. size of q||Max. size of q||Max. size of q|
Iii-E Newton Rapshon Method for Inverse Square Root Calculation
While FHE computes multiplication and addition in encrypted domain, several fundamental mathematical operations such as finding an inverse or a square root of a number is not feasible. However, we can use Newton iterative method introduced by Isaac Newton in 1669  to calculate these in a FHE friendly way. Since the cosine distance calculation in (2) requires inverse square root operation, this section describes the Newton iterative method to perform this operation using just multiplication and addition.
Let’s define a function , where the root of this function gives the inverse square-root of i.e., leads to . The Newton iterative formula for finding the root is given by the following equation :
where is the derivation of at . Hence, using this derivation, the equation (3) can be modified into:
To find an inverse square root of , the equation (4) must be repetitively computed. The number of iteration required is heavily dependent on i.e., the initial value for (4). If the is bounded by and (i.e., ) then a good starting point is the average of the bounds i.e., . With this initialisation, (4) can be computed using only multiplications, hence, it is FHE friendly replacement for inverse square root operation.
Iv The Proposed Scheme
In this section, we put together all the techniques explained in Section III to develop a privacy-preserving speaker verification technique using CKKS based fully homomorphic encryption scheme. The user will be provided with a client application to extract features from their speech, generate secret and public keys required for CKKS FHE scheme, and interact with the server.
Iv-a Feature Extraction
As shown in Fig. 1, the speech data can be converted into a feature vector with dimension . Regardless of the feature extraction models, the dimension of the feature vector is around . The raw speech data goes through several speech processing modules to get Mel frequency cepstral coefficients (MFCC) followed by GMM supervectors with large dimension. These high dimensional vectors can be reduced via several advanced techniques such as i-Vector models (GMM/UBM i-Vectors), d- and x- vectors via Deep Neural Networks (GMM/UBM DNN) . Since this work focuses on the back-end of the speaker verification system, we selected a computationally efficient GMM/UBM based i-Vector model for feature extraction. The proposed scheme is directly applied to any front-end feature extraction scheme that outputs a low-dimensional vector (i.e., is around ).
Iv-B Key generation for CKKS FHE scheme
The security key generation relies on several factors and depends on the underlying application. As shown in Table I, the high-level parameters and must be selected by considering the efficiency and security. Moreover, scaling factor and number of multiplication levels must be set in advance. Since the application might be used by several users, the server presets these parameters common for all the users. Given these global parameters (, , and ), each user (i.e., the client application running on the user’s device) generates public-key and secret-key. The public key can be used for encryption, rescaling and rotation and will be sent to the server. The secret key never leaves the user device.
Iv-C Enrolling the feature vector
Using the client application, the user can extract speech features, generate keys for encryption, and start the enrolment process. The enrolment process is simple and require executing the following 4 steps:
Extract a speech feature vector from speech
Obtain the initialisation variable (more details about this will be provided in the next section)
Generate and store secret-public key pairs ( and )
Apply CKKS encryption to get the following encrypted vectors and
Now user sends to the server for enrolment where denotes the user ID. The server stores the data in a database against the user ID .
Iv-D Speaker verification
The speaker verification part is the core contribution of this proposed work. Similar to the enrolment, the user extracts a feature vector and applies CKKS encryption to get . For the encryption, the user uses the same key that is being generated during the enrolment stage. To complete the verification stage, the user sends to the server. Now server retrieves the stored data from database using and evaluate (2) to obtain the verification score. The projection matrix in (2) is available to the server in plain domain. Please note that the matrix is obtained by the server during the training process and it doesn’t derived from the user’s speech data (see [9, 24] for more details).
If we closely look at verification equation (2), the server computes the numerator to get a scalar, then computes the denominator to get a scalar followed by division between these scalars. Hence, we can reformulate (2) as follows:
Since in (6) is encrypted, it’s not possible to directly compute the inverse square-root of required for (5). Hence, we exploit the Newton-Rapshon method as explained in Section III-E. Newton-Rapshon method is iterative, hence, using (4), the approximated result after first iteration is given by:
and after second iteration:
Using (5), we can expand (9) into (10) (shown in the top of the next page). As described in Section III-E, in (10) is the initialisation variable and its already supplied by the user to the server during the enrolment. Hence, equation (10) can be revised as (11). As shown in (12), the server requires four multiplication levels to compute (11). Similarity, we can incorporate the second iterative result in (8) which consumes six multiplicative levels. The result of third iteration consumes seven multiplication levels and nine levels for fourth iteration and so on and so forth. Increasing the number of multiplication levels lead to larger parameters for CKKS encryption which will directly impact the efficiency of the scheme. Given this context, lets focus on how server can compute the score using only the first iteration result as shown in (12).
The server first computes all four Level 1 multiplications shown in (12) which is a encrypted-vector-plain-matrix computation involving plain matrix . The SIMD feature in CKKS supports element-wise multiplication and addition operation. To exploit this feature, the server needs to reassemble the matrix into vectors , , , as described below. If
where diagonals of are reassembled as vectors in a cyclic manner in . Hence we can perform the vector-matrix computation as follows:
where denotes the element-wise Hadamard product operation. Since the vector is encrypted, the result of (13) is a -dimensional encrypted vector. Moreover, in (13), multiplications are element-wise hence the whole operation consumes only one CKKS multiplication level.
All four Level 1 computations provide encrypted vectors which will be used for Level 2 computation. At Level 2, there are four multiplications which are encrypted-vector-encrypted-vector dot product computation. To perform this dot product computation, we exploit CKKS SIMD element-wise multiplication and rotation features. Lets suppose, and , then to compute , we first perform element-wise Hadamard product using CKKS SIMD operation as follows: . To obtain the final answer, we repetitively shift the vector elements and perform addition. For example, if then we shift the vector by 2 elements and add as follows:
Then we rotate the added vector by 1 element and add it again as follows:
Now the first element of the vector contains the correct answer for the dot product computation. One of the condition for this repeated rotation and addition is that the should be a power of two. This condition can be easily met by concatenating zeros at the end of the vectors. We need to perform repeated rotation and addition. Finally the dot product computation consumes only one multiplication level for the element-wise multiplication. The rotation and addition doesn’t consume any multiplication level.
The Level 2 computation described above produces an encrypted CKKS scalar (not vector). Now for the Level 3 and Level 4, we only need to perform encrypted-scalar-encrypted-scalar multiplication which is straightforward to compute. Using these computations, we can obtain the approximated score in encrypted domain.
Now this encrypted score will be sent to the client application. The client application decrypts it and check if the score is above the threshold to authenticate the user. While this approach can be used for different applications (i.e., if the underlying application is about measuring the medical condition, then this score will represent the severity), this paper will only consider speaker verification use case.
V Parameter Selection and Performance Analysis
This section describes the dataset used for the experiments, results and the complexity, security, and privacy analysis of the proposed algorithm.
V-a Parameter Selection
We start with selecting parameters for the CKKS encryption. In this experiment, we stick with bit security. We select three sets of parameters as shown in Table II. Set I considers the smallest possible suitable for the application. Since is limited to when , the maximum number of multiplication levels we can do is limited to 4 without loosing a lot of accuracy. Therefore, we will only use one Newton-Raphson iteration to find the inverse square root. If we set the base prime size and the special prime size to 41 i.e., , then we are left with -bit. We split this into four 34-bit required for the four multiplication levels. Since we are using all the available bits, the security of Set I is 128-bit.
|No. Multiplication Levels||4||4||6|
|Used size of||218||280||360|
Set II and Set III use higher order polynomial with degree . This supports maximum size of , which gives a lot of flexibility on prime sizes and multiplication levels. Set II considers only one Newton-Rapshon iteration, hence four multiplication levels are required. We set high bits sizes for base prime, special prime and scales (i.e., 60, 60, and 40, respectively), totalling only 280 bits which is smaller than the allowable 438 bits. Therefore the security of Set II is higher than 128-bit.
To increase the accuracy of finding inverse square root of encrypted number, we need to go for the result of second Newton-Rapshon iteration which requires 6 multiplication levels. The parameters for this is shown in Set III in Table II. Similar to Set II, the security of Set III is higher than 128-bit.
V-B The dataset
TIMIT speech corpus has been used to evaluate the accuracy and reliability of the proposed algorithm . The TIMIT speech corpus contains broadband recordings (each recording lasts for around 3 seconds) of speakers of eight major dialects of American English. Each speaker has speech samples. Out of samples, were used to extract feature vector for enrolment. We use GMM/UBM based i-Vector for the experiments. However, as described earlier, the proposed model can be used for DNN/UBM based x- or d- vector speaker verification systems.
For experiment, we follow the same approach used in  as a baseline. In , the TIMIT data corpus has been split into two: 1) the first two dialect regions with speakers are used for testing and 2) the last four dialect region with speakers were used to build background model. Table III shows the statistics of the TIMIT dataset.
Since, speech samples from the speakers are used for enrolling the user in server, the remaining samples per user have been used for verification. Initially we perform the following two baseline tests in plain domain using (2):
V-B1 1. Genuine Attempts:- Client-Client
In this test, for each speaker, the score is calculated using the speaker’s enrolled data against the speaker’s two test utterances. Hence, the scores for tests are obtained using (2).
V-B2 2. Imposter Attempts:- Imposter-Client
In this test, each speaker’s test utterances are tested against other users’ entolled feature vector. This leads to tests and the score for each test has been obtained using (2).
Before we present the results, let us define False Acceptance Rate (FAR), False Rejection Rate (FRR) and Accuracy.
FAR = ,
FRR = ,
where FAR and FRR are the two types of errors and False Acceptance means the system grants access to an impostor, and False Rejection means the system denies access to an enrolled speaker. From FRR and FAR, we can get Equal Error Rate (EER). EER represents the operating point where the FAR is equal to the FRR.
Using these definition, we can present the baseline results as shown in Fig. 2. Since number of imposter attempts are significantly higher than the genuine attempts, the Accuracy curve in Fig. 2 might be misleading (i.e., it’s approaching as it rejects large number of imposter attempts). Hence, we will stick to EER to compare the performance. The EER of the baseline model is around when the threshold is . In the following section, we analyse the proposed scheme.
Vi Experimental Results
We implement the proposed algorithm in Python using TenSEAL library  to interact with the C++ SEAL FHE library111https://github.com/Microsoft/SEAL. The source code of our implementation can be found here: https://github.com/rahulay1/iVectorTenSEAL/tree/master. We essentially repeat the same steps that we used to evaluate the baseline model. We tested all 3 sets of CKKS parameters shown in Table II. We compare the time requirements using a high end and medium end laptops. For the high end, we use a Razor laptop with 16GB ram and 6 cores (12 CPUs) with upto 4.1GHz speed. This can be treated as server. For the medium end, we use a MacBook Pro laptop with 8GB ram and 2 cores (4 CPUs) with upto 2.5GHZ speed. The specification of the medium end laptop is comparable to the specifications of medium end smartphones (i.e., Samsung Galaxy A Series phones), hence, can be considered for running the client application.
Vi-a Initialisation of Newton Rapshon parameter
Before we start the experiment, its very important to initialise the variable in (4) for Newton-Rapshon method. As explained in Section III-E, if is closer to the actual inverse-square root, then the convergence is much faster. Therefore, finding the distribution of within this context is important. According to (6), and we can obtain the distribution of this value using the TIMIT dataset. Using all 630 speakers we could obtain more than 0.7 million sample values for i.e.,(). Using these samples, we plot the distribution of in Fig. 3. From this, we can clearly see that should be initialised between 400 and 900. Instead of initialising the average of 400 and 900, we initialised as as bigger chunk of data is around .
Now we calculate the inverse square-root of using the iterative approach and compare it with the actual answer in the same Figure Fig. 3. The two convex graphs shows the relative error percentage of iterative approach compared to actual value (i.e., ) for the first and second iterations. While there are no significant differences between the first and second iterations, the error is less than for bigger chunk of . Therefore, we can safely use the Newton-Rapshon method to compute the inverse square-root of encrypted number as we proposed. We will evaluate the loss of EER due to this approximation in the next section.
Vi-B EER loss comparison of the proposed scheme against baseline approach
Before we start experimenting the encrypted speaker verification algorithm proposed in Section IV-D, we need to check the loss of EER when we replace the actual inverse square-root function with Newton-Raphshon iterative function. The result of this experiment is presented in Fig. 4 (see the first and third bars in Fig. 4). Use the result of 2 iterations of Newton-Rapshon method leads to loss in EER while 1 iteration leads to loss of EER compared to the baseline EER. Thanks to the careful selection of the initialisation value, these EER losses are negligible.
Now we can compare the results of the proposed encrypted speaker verification system. These results are depicted in the remaining three bars of Fig. 4 (second, fourth and fifth bars). These bars correspond to the CKKS parameters in Set-III, Set-II, and Set-I in Table II, respectively. With Set-III parameters ( with 2 iterations), the loss of EER compared to baseline approach is around . For other two sets, the EER losses are and , respectively. The main reason for this is due to approximation scaling factor of the CKKS scheme. Since Set I uses small compared to Set-II, the experiment based on Set-I loses more precision of the underlying values hence lose in EER compared to Set-II. Nevertheless, loss in EER is not significant when we consider the time required for this verification is near-real time as discussed below.
Vi-C Computational time and processing requirements
One of the challenges that hinders the adoption of FHE in real application is it’s ability to perform computation in real-time. As we discussed in the literature review section, FHE has reached an inflection point where real-time application can be implement fully using FHE schemes. The result of the proposed scheme also support this statement. For all three sets of CKKS parameters in Table II, we measured the time required to complete the key generation, enrolment, verification and decryption. These results are presented in Table IV. Table IV depicts results for four sets of experiments for each CKKS parameter, totalling 12 experiments.
For each CKKS parameter set, the experiment was conducted in high-end (C1) and medium-end (C2) laptops for two different feature dimensions ( and ). From Table IV, we can observe that significant amount of time is spent on generating public and secret keys. However, this is one time effort and can be done in offline. The other three operations impact the real-time performance. The time required to encrypt a speech template (noted as Enrol) is between 11ms and 55ms. As expected, the time consuming operation is verification. For Set-I CKKS parameters, the verification can be done within 1.3 seconds (the EER loss of this set is ). For Set-III, while EER loss is limited to , the time required to perform verification on C1 laptop is around 7 seconds which may be suitable for near-real-time application. The most efficient operation is decryption and require between 1ms and 12 ms for all 12 experiments.
If we consider a typical scenario where users uses a medium-end hardware and the server uses high-end hardware then the total delay due to FHE scheme could be 11ms + 1.2 seconds + 2ms 1.3 seconds which is suitable for many real-time applications such as mobile banking, healthcare monitoring, etc.
When it comes to processing power, key generation, encryption and decryption do not require much CPU power (refer to the screenshot in Fig. 6). However, almost of the available processing power will be consumed by the verification. Since the verification involves several vector dot products, these can be highly parallelized to exploit all the CPUs. As shown in Fig. 6, all 12 CPUs in C1 is being used to complete the verification. Since C2, has only four CPUs, its performance is almost 3 times slower than C1 (refer Table IV).
|Client||Server||Inverse Square-root from Iteration 1||Inverse Square-root from Iteration 2|
Vi-D Analysis of storage, memory and bandwidth requirements
In the proposed scheme, both the client and server need storage, memory and communication bandwidth to exchange data between them. The client on user device must keep the secret key and share the pubic key with the server who keeps the public key for verification. Moreover, the client needs to share the encrypted templates with the server during the enrolment and verification. Hence, the server needs more storage to keep the encrypted templates. Fig. 5 shows the storage requirements for several of the components discussed above. The storage required for both secret and templates are less than 5MB for both the order of polynomial degree considered for CKKS scheme. However, higher order polynomial require higher storage. Since these polynomials can contain several slots for input vector, there is no difference on storage when the feature dimension increases from 100 to 1000. The dominant element that require a large storage is the public key (110MB for and 0.75GB for ). Since this key must be communicated to the server, we also need relatively high bandwidth during the enrolment process. The main reason for this is that these public keys contains several keys for rescaling and rotation operations in the encrypted domain.
The usage of RAM is shown in Fig. 6. When the process started, only about 20% of the total available memory is being used in contrast to the usage of CPUs. Hence the efficiency of the proposed algorithms (mainly the use of FHE) is not dominated by the available memory. Hence, medium-end devices with up to 4GM RAM is sufficient to run the client.
Vii Privacy and Security Analysis
This section analyses the privacy of the stored speech features followed by the security of the whole system.
Vii-a Privacy Analysis
The aim of the proposed algorithm is to stop the server from learning the result of the inference. The proposed algorithm exploits CKKS FHE scheme where public key is being used for encryption, rescaling and rotation operations while secret key is being used for decryption. The proposed scheme requires only public keys to be sent to the server and secret key is never leave the user’s device. Therefore, the server cannot obtain the inference results.
Another potential privacy vulnerabilities is the identity linkage attack i.e., if the users enrol their biometrics in multiple services, the service providers might collude and profile the users using the similarities of the speech features. However, this attack is not possible because the CKKS encryption is probabilistic hence, the server cannot distinguish the encrypted messages even if they contain the same message and encrypted using the same keys . As long as the users’ secret keys are protected, it is infeasible for the rogue service providers to profile the users.
Vii-B Security Analysis
While it is infeasible to decrypt the CKKS ciphertext without secret keys, there might be other ways the system can be compromised. For example, the attacker might have stolen the user device with secret keys or the attacker compromised the encrypted templates stored on the server or obtain the speech recording of the user. In this section we investigate each of this scenario and show how the proposed scheme mitigates the security vulnerabilities.
Vii-B1 Compromised user device attacks
In this attack, the adversary has access to the user device and the CKKS parameters stored during the enrolment. But do not have access to the user’s speech to generate legitimate speech feature. Hence, the adversary tries to combine the parameters from the compromised user device with the features of other users. Then the adversary tries to verify against the compromised user’s enrolled template residing at the server. To evaluate this, 2 × 150 × 151 tests [300 test utterances from other users are combined with the parameters of the compromised user device and this is repeated for all the users] are conducted and the corresponding decision scores are obtained. Essentially the result of this experiment is already presented by the EER loss comparison in Fig. 4. The loss of EER means that the FAR curve in Fig. 2 is shifted by leading to accepting 28 more false claims for every 1000 impostor attempts. However, this can be reduced by using a small threshold for verification which will impact the FRR.
Sometimes the adversary generates completely a random feature vector or a patterned feature vector to maximise the score. For the patterned feature vector, we generated a vector with ones. Now these artificial feature vectors can be encrypted and decrypted by the stolen credential and used to conduct speaker verification. The result of this experiment is shown in Fig. 7. The FAR of these attacks are lower than the baseline approach hence the adversary is worse off with these attacks.
Vii-B2 Compromised server attacks
In this attack, the adversary has access to the enrolled encrypted data and public keys of all the users stored at the server. Hence, the adversary might attempt to modify the encrypted templates using FHE properties. Or he might use those encrypted templates during the verification process. However, none of these attacks will succeed without the secret keys. Moreover, if there is a compromise, the users can re-enrol using different set of public and secret keys. Since compromised speech vectors are encrypted, they can be revoked (similar to passwords) even though the underlying data is biometric and unique to the user.
Vii-B3 Compromised user voice attacks
In this attack, the attacker has access to the user’s voice recording but does not have access to the parameters stored at the user device. Now the attacker generates random public and secret key pairs and tries to impersonate. The success of this attack is equivalent to breaking the CKKS FHE scheme hence this attack is also infeasible.
This paper presents a novel algorithm to process encrypted speech features using fully homomorphic encryption suitable for real-time speaker verification systems. The proposed algorithm exploits fully homomorphic encryption for arithmetic of approximate numbers (aka CKKS scheme) to achieve 128-bit security against classical and quantum computers. To measure the performance, a well known speech corpus was used to conduct rigorous experiments. The end-to-end encrypted privacy-preserving scheme only requires 1.3 seconds to complete the verification in FHE domain. The accuracy in terms of equal-error-rate, the proposed scheme is off by only 2.8%. Privacy analysis shows that the proposed scheme mitigates the privacy vulnerabilities such as tracking and profiling that exists in traditional system. Moreover, the proposed scheme is secure and the system cannot be exploited without accessing the secret keys.
-  J. Rusz et al., “Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 8, pp. 1495–1507, Aug. 2018.
-  R. Xu et al., “A voice-based automated system for PTSD screening and monitoring,” in Proc. of Med. Meets Virtual R XII, pp. 552–558, 2012.
-  Y. Yamada, K. Shinkawa, and K. Shimmei, “Atypical repetition in daily conversation on different days for detecting Alzheimer disease: Evaluation of phone-call data from a regular monitoring service,” JMIR Ment. Health, vol. 7, no. 1, Art. no. e16790, Jan., 2020.
-  D. Shibata, S. Wakamiya, K. Ito, M. Miyabe, A. Kinoshita, and E. Aramaki, “Vocabchecker: Measuring language abilities for detecting early stage Dementia,” in Proc. Int. Conf. Intell. User Interfaces Companion, pp. 1–2, 2018.
-  A. K., Jain, and K. Nandakumar, ”Biometric Authentication: System Security and User Privacy,” Computer, vol. 45, no. 11, 87-92, 2012.
-  D.A. Reynolds, R.C. Rose, ”Robust text-independent speaker identification using Gaussian mixture speaker models,”IEEE Trans. Speech and Audio Processing , vol.3, no.1, pp.72,83, Jan 1995.
-  Z. Bai, and X. L. Zhang, ”Speaker recognition based on deep learning: An overview. Neural Networks2, 140, pp.65-99, 2021.
-  N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, ”Front-end factor analysis for speaker verification”, IEEE Trans. Audio, Speech, and Language Processing, 19(4), pp.788-798, 2011.
-  Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, ”A novel scheme for speaker recognition using a phonetically-aware deep neural network. In IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP) (pp. 1695-1699), May, 2014.
-  D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, ”Deep neural network embeddings for text-independent speaker verification”, In INTERSPEECH, pp. 999–1003, 2017.
-  E. Variani, X. Lei, E. McDermott, I.L. Moreno, and J. Gonzalez-Dominguez, ”Deep neural networks for small footprint text-dependent speaker verification. In IEEE Int’l Conf. Acoustics, Speech and Signal processing, pp. 4052-4056, 2014.
-  I. Newton. Methodus Fluxionem et Serierum Infinit, 1966.
-  C. Gentry, ”A fully homomorphic encryption scheme”, Ph.D. thesis, Stanford Universityhttps://crypto.stanford.edu/craig, 2009.
M. Ajtai, ”Generating hard instances of lattice problems”, In Proc. of the twenty-eighth Annual ACM symposium on Theory of computing (pp. 99-108), July 1996.
-  O. Regev, ”On lattices, learning with errors, random linear codes, and cryptography”, Journal of the ACM (JACM), 56(6), p.34, 2009.
-  Y. Rahulamathavan, S. Dogan, X. Shi, R. Lu, M. Rajarajan,and A. Kondoz, ”Scalar product lattice computation for efficient privacy-preserving systems”, IEEE Internet of Things Journal, 8(3), pp.1417-1427, 2020.
-  N. Dowlin et al. “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy” In Int’l Conf. Machine Learning, 2016.
-  F. Bourse, M. Minelli, M. Minihold, and P. Paillier, ”Fast homomorphic evaluation of deep discretized neural networks. In Annual Int’l Cryptology Conf. (pp. 483-512). Springer, Cham, Aug., 2018.
-  F. Bergamaschi et al. ”Homomorphic Training of 30,000 Logistic Regression Models.” International Conference on Applied Cryptography and Network Security. Springer, Cham, 2019.
-  N. P. Smart, and F. Vercauteren, ”Fully homomorphic SIMD operations”, Designs, codes and cryptography, 71(1), pp.57-81, 2014
-  J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomorphic encryption for arithmetic of approximate numbers. In ASIACRYPT’17, pages 409–437, 2017.
-  J. Garofolo, et al. ”TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1,” Web Download. Philadelphia: Linguistic Data Consortium, 1993.
-  Y. Rahulamathavan, K. R. Sutharsini, I. G. Ray, R. Lu, and M. Rajarajan,. Privacy-Preserving iVector-Based Speaker Verification. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 27(3), pp.496-506, 2019.
A. Benaissa, B. Retiat, B. Cebere, A.E. Belfedhal, ”TenSEAL: A Library for Encrypted Tensor Operations Using Homomorphic Encryption”, Int’l Conf. Learning Representations, Workshop on Distributed and Private Machine Learning, 2021.
-  P. Kornerup, and J.M. Muller,” Choosing starting values for certain Newton–Raphson iterations. Theoretical computer science, 351(1), pp.101-110, 2006.
-  A. Phipps, K. Ouazzane, and V. Vassilev, ”Your password is music to my ears: cloud-based authentication using sound”., 11th Int’l Conf. Cloud Computing, 2021.
-  H. Isyanto, A. Arifin, and M. Suryanegara, ”Design and implementation of IoT-based smart home voice commands for disabled people using Google Assistant. In IEEE Int’l Conf. Smart Technology and Applications (ICoSTA) (pp. 1-6), Feb., 2020.
-  P. Smaragdis, and M.V.S. Shashanka, ”A Framework for Secure Speech Recognition,” in IEEE Int’l Conf. Acoustics, Speech and Signal Processing, vol.4, no., pp.IV-969,IV-972, 15-20 April 2007.
-  O. Goldreich, “Secure multiparty computation”, (working draft), available: http://www.wisdom.wei zmann.ac.il/ oded/pp.html. (Sep. 1998)
-  M. Pathak, and B. Raj, ”Privacy-Preserving Speaker Verification and Identification Using Gaussian Mixture Models,” IEEE Trans. Audio, Speech, and Language Processing , vol.21, no.2, pp.397-406, Feb., 2013.
Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, and T. Toft, “Privacy-preserving face recognition,” inProc. 9th International Symposium on Privacy Enhancing Technologies, PETS ’09, pp. 235–253, 2009.
-  Y. Rahulamathavan, and M. Rajarajan, Efficient privacy-preserving facial expression classification. IEEE Transactions on Dependable and Secure Computing, 14(3), pp.326-338., 2015.
Y. Rahulamathavan, R. Phan, S. Veluru, K. Cumanan, and M. Rajarajan, “Privacy-preserving multi-class support vector machine for outsourcing the data classification in cloud,”IEEE Trans. Dependable Secure Computing, vol. 11, no. 5, pp. 467–479, Sept., 2014.
-  Y. Rahulamathavan, S. Veluru, R. Phan, J. Chambers, and M. Rajarajan, “Privacy-preserving clinical decision support system using gaussian kernel based classification,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 1, pp. 56–66, Jan., 2014.
-  Y. Rahulamathavan, R. Phan, J. Chambers, and D. Parish, “Facial expression recognition in the encrypted domain based on local fisher discriminant analysis,” IEEE Trans. Affective Computing, vol. 4, no. 1, pp. 83–92, Jan.-Mar., 2012.
-  J. Portêlo, B. Raj, A. Abad and I. Trancoso, ”Privacy-preserving speaker verification using secure binary embeddings,” Information and Communication Technology, Electronics and Microelectronics (MIPRO), 37th International Convention on, Opatija, 2014, pp. 1268-1272, 2014.
-  O. Regev, ”On Lattices, Learning with Errors, Random Linear Codes, and Cryptography,” In Proc. 37th ACM Symp. on Theory of computing (STOC), pages 84–93, 2005.