I Introduction
Automatic Speech Recognition (ASR) plays a major role in several emerging smart applications and services. Recent studies show that ASR can be used to detect emerging medical conditions such as Parkinson’s disease [2], PostTraumatic Stress Disorder (PTSD) [3] and neurodegenerative diseases such as Alzheimer’s disease [4] and dementia [5] by continuously and passively observing the user’s speech. ASR is also used in banking and financial sector for biometric verification purposes [6]. Moreover, several smart devices (i.e., smart TVs and speakers) are now embedded with ASR functionality. The commonality across all of these applications and services is that they all require user speech features to be sent to servers (or cloud) for classification purposes. These speech features are fed into machine learning models and matched to a known class. Fig. 1 shows a typical ASR system. If the ASR is for speaker verification then the user’s feature is matched against the enrolled identity of the claimed speaker.
While these technologies are very useful for healthcare monitoring, to enhance security in banking and finance and to improve user experience, continuously sending the speech features to servers pose serious privacy threats to the users. The users can be tracked by the service providers or their medical conditions can be inferred and sold to insurance companies or the speech biometrics can be hacked by adversaries. These are irrevocable problems. Within this context, this paper develops a privacypreserving solution to redesign the backend of the speaker verification system where user’s speech features remain in encrypted domain during the transmission, storage and processing.
Speaker verification is a task of verifying a user using their voice biometrics. As shown in Fig. 1, the speaker verification has two parts: 1) enrolment and 2) matching. During the first stage, the user needs to enrol their speech biometrics via speaker enrolment process. These enrolled biometric templates might be stored in an authentication server (which resides along with other servers). During the verification, a fresh speech feature is extracted and sent to the server who performs a comparison against the stored template using a machine learning technique. If the comparison is successful then the authentication server allows the user to access the service.
Traditionally the user’s speech features (or the templates) are only encrypted during the storage but decrypted during the processing (i.e., verification) stage. This means that the server (or the adversaries who compromise the server) can access the features and track the users. This is where the privacy risk and this paper develops a novel technique to transforms the backend processing in the encrypted domain. This paper proposes a technique where users encrypt their speech biometric templates using their own keys prior to enrolling them in the authentication server. During the matching stage, the the user again encrypts the freshly generated speech feature using it’s own encryption key and send only the encrypted feature for matching. Since the server has got only the encrypted features, it has to perform the matching process in encrypted domain. Hence, this paper redesign the backend matching process to support the encrypted domain processing.
This can be achieved via fully homomorphic encryption (FHE) techniques. The FHE was invented by Craig Gentry in 2009 [14]
to perform both multiplication and addition in encrypted domain without the need to decrypt the data. The stateoftheart FHE schemes are efficient and used to redesign several machine algorithms to process encrypted data. Therefore, if the speech features are encrypted with FHE, the server should be able to perform the verification without the need to decrypt the feature vectors. Hence, this paper proposes a methodology to exploit the properties of FHE scheme to develop a privacypreserving speaker verification system in the encrypted domain.
Ia Notations
We use bold lowercase letters like to denote column vectors; for row vectors we use the transpose . We use bold uppercase letters like to denote matrices, and identify a matrix with its ordered set of column vectors. Real numbers are denoted as and a real number matrix with is denoted as . We use to denote the ring of integers modulo , to denote the set of matrix with entries in . An integer polynomial ring with degree is denoted as where the coefficients of the polynomial is always bounded by . denotes the vector norm where .
IB Paper organisation
The rest of this paper is organised as follows: stateoftheart works related to the proposed scheme are summarised in Section II. Building blocks required for the proposed work is provided in Section III. Section IV proposes the privacypreserving speaker verification system using CKKS homomorphic encryption. Testing environment, dataset, and parameter selection to achieve bit security are provided in Section V. Experimental results and efficiency compared with the traditional scheme are given in Section VI. The security and privacy analysis is given in Section VII followed by conclusions are discussed in Section VIII.
Ii Related Works
Several services are now exploiting unique features of speech for healthcare monitoring [2, 3, 4], authenticating banking applications [27], and smart home applications[28]. These services need to collect and store users’ speech data over the Internet. At the same time, privacy regulations like GDPR in Europe are enforcing organisations to provide sufficient privacy guarantee when they use, process and store customer data. Since speech data is considered as unique and contain personnel information, the privacy of the voice data should be guaranteed.
To achieve this, we require novel techniques to redesign the speech processing backend systems to protect the privacy while ensuring the utility of the data. There are several privacypreserving techniques in literature that transform various types of data into encrypted domain using traditional homomorphic encryption or randomisation techniques i.e., facial biometric [32, 33], emotions [34, 36, 35], or voice biometric [31].
In the domain of speech processing, there are only a few notable privacypreserving works exist [29, 31, 37, 24]. Smaragdis and Shashanka proposed the first application of secure multiparty computation (SMC) concepts for privacyconstrained speech technology [29]
. In their work, they realised secure speech recognition using the hidden Markov model (HMM) and a generalised version of the Paillier publickey scheme, which allowed training and classification between multiple parties and achieved perfect accuracy.
Pathak et. al redesigned Gaussian Mixture Model (GMM) based speaker recognition
[31] to achieve a similar privacy goal. This work relies on homomorphic cryptosystems such as BGN and Paillier encryption. This work has shown a proofofconcept of privacypreserving speaker recognition without compromising the accuracy. However, the shortcoming of the above cryptographic approaches is that far too much time is spent on the encryption i.e., few minutes required for processing.Recently, the work in [24] used randomisation technique from information theory to develop a privacypreserving speaker verification scheme. This work is neither computationally inefficient nor compromises privacy. The solution presented in [24] is significantly advanced than the existing solution in terms of accuracy, privacy and speed. However, [24] is interactive and requires multiple rounds of computations and it cannot be used for different frontend systems. Moreover, the security of all the schemes mentioned above relies on a mathematically intractable problems such as integer factorisation (schemes based on randomisation) and discrete logarithm (schemes based on homomorphic encryption). As we started to see the rise of quantum computers, the security of all these might be broken soon [38].
In contrast to the traditional partial homomorphic encryption schemes (i.e., Paillier, BGN, etc), the rise of fully homomorphic schemes (FHEs) show promising results recently in terms of efficiency. While FHE resists attacks arising from quantum computers (due to lattice hard problems [38]), they also support noninteractive computation on encrypted domain. Some of the notables works in the intersection of FHE and machine learning are [20, 18, 19] and many more. The work in [20]
, trains 30,000 logistics regression models in encrypted domain within 20 minutes but performs encrypted domain inference in 5 seconds using CKKS FHE scheme. The work in
[18], jointly done by Princeton University and Microsoft in 2016, transforms a trained Convolutional Neural Network (CNN) into a model suitable for encrypted domain inferencing. The work uses a simple CNN with 5 layers and 28x28 input dimension for MNIST dataset and requires 400MB bandwidth and 5 minutes to perform inference in encrypted domain. Finally the work in
[19], uses a novel discretization approach to transform neural networks suitable for advanced FHE scheme. A simple Neural network with 3 layers (with hidden layer of 100 neurons) took only 1.7 seconds to perform image classification at 96% accuracy for 128bit security
[19]. There are several other works in this domain that are focusing on redesigning the traditional machine learning (mainly deep learning) algorithms to work on FHE domain.
However, to the best of our knowledge, there are no FHE based speech processing machine learning algorithms exist in literature to achieve endtoend privacy in realtime. Within this context, we develop a novel algorithm that changes the backend of speaker verification system to process an encrypted speech data in realtime without the need for multiple rounds of communications. Moreover, the proposed algorithm supports realtime endtoend encrypted speaker verification for negligible loss of accuracy at 128bit security.
Iii Background Information
This section briefly describes the four building blocks required for the proposed algorithm.
Iiia The Speaker Verification Systems
As shown in Fig. 1, the speaker verification systems composed of two components: 1) frontend and 2) backend. The frontend is mainly focused on extracting feature vectors from speech. The backend performs noise reduction and similarity calculation of speaker features.
The frontend extracts a number of acoustic features such as linear predictive cepstral coefficients, perceptual linear prediction coefficient, and melfrequency cepstral coefficients. Then several techniques used to enhance these features to get a better verification accuracy. In 1995, Reynolds et al. [7] applied Gaussian Mixture Model technique based on Universal Background Model (GMMUBM) on these features to increase the accuracy by a significant percentage. Since then the GMMUBM based speaker verification became the foundation of speaker verification research. Fifteen years later, Dehak et al. [9] proposed a ground breaking model called iVector to further decrease the speaker and channel variation while increasing the verification accuracy. Moreover, the iVectors are significantly lower dimension (iVectors are around 200x1 size) than the GMMUBM models (GMMUBM supervectors are around 40,000x1 size).
Recently, motivated by the powerful feature extraction capability of deep neural networks (DNNs), a lot of deep learning based speaker recognition methods were proposed
[10, 8]. The DNN based schemes boost the performance of the speaker verification to a new level even in the wild environment. Similar to the iVector, the DNN based feature extraction methods output xVector [11] and dVector [12]. The dimensions of these vectors are very similar to iVectors.As depicted in Fig. 1, the frontend can either use GMMUBM or DNNUBM to obtain i, x or dVectors [10, 8]. Hence, only these features are sent to the server for enrolment and matching. This paper focuses on protecting these feature vectors stored and processed in the backend. One of the dominant techniques used in the backend to perform similarity calculation is Cosine distance between the enrolled (or claimed) and test feature vectors of the user [9, 11, 12].
IiiB Cosine Distance Calculation
Lets suppose, the user enrolled a feature vector at the server. During the verification, the user is sending . Now the server calculate the cosine distance between the target and test vectors as follows:
(1) 
where dimension is the size of the i, d or xVectors (the is between and in the stateoftheart works). To further reduce the channeland speaker depended noise, a projection matrix is used as follows [9]:
(2)  
where .
IiiC Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) schemes support homomorphic properties such as addition and multiplication in encrypted domain. To explain this briefly, lets denote two numbers in plain domain as and and the corresponding homomorpically encrypted values in encrypted domain as and . Denote the encryption and decryption functions as and . The encryption function takes and in plain domain and public key as inputs and outputs the corresponding encrypted value i.e., and . The decryption function takes the encrypted value and secret key as inputs and outputs the plain domain values i.e., and . Within this context, FHE properties allows to compute addition and multiplication in encrypted domain without the need to decrypt the value i.e., , . Therefore, mathematical functions can be computed in encrypted domain using only encrypted values. For example, if a cloud wants to compute a function but only has encrypted inputs and , the cloud can exploit the FHE to evaluate the function as follows: where . Since the cloud is not holding the secret key , the evaluated function remains in encrypted domain.
An encryption scheme with the above FHE properties was invented by Craig Gentry in 2009 [14]. The scheme is based on Latticebased cryptography hence secure against attacks arising from quantum computers [15, 16, 17]. Since Gentry’s ground breaking work, there are numerous improvements were made by several researchers to improve efficiency and scalability. Currently FHE has reached an inflection point where several relatively complex algorithms can be evaluated in encrypted domain in nearreal time [18, 19, 20]. Singleinstructionmultipledata (SIMD) is one of the powerful techniques that has enhanced the efficiency of FHE by more than 3 orders of magnitude [21]. While there are handful of FHE schemes, this paper focuses on FHE scheme based CheonKimKimSong (CKKS) [22] since it is the most efficient method to perform approximate homomorphic computations over real and complex numbers.
IiiD CKKS FHE Scheme
CKKS scheme supports real numbers and SIMD operation, therefore, its a suitable candidate for applications rely on vectors of realnumbers. CKKS works with polynomials because they provide a good tradeoff between security and efficiency as compared to standard computations on vectors.
Given a message , a vector of real values, it is first encoded into a plaintext integer polynomial where and denotes the degree of the polynomial. The CKKS encryption encrypts into two ciphertext polynomials where is the size of the ciphertext modulo. In ciphertext domain, CKKS supports homomorphic addition, multiplication, and rotation operations. The rotation operation homomorphically performs a cyclic shift of the vector by some step. The multiplication and rotation operations in the CKKS scheme need additional corresponding evaluation keys and the keyswitching procedures.
Moreover, each real number data is scaled with some big integer , called the scaling factor, and then rounded to an integer prior to encrypting the data. When the two data encrypted with the CKKS scheme are multiplied homomorphically, the scaling factors of the two data are also multiplied. This scaling factor should be reduced to the original value using the rescaling operation (i.e., dividing by ).
In CKKS, the size of the ciphertext are big (i.e., is big) hence it requires higher computational complexity. To reduce the complexity, the residue number system can be used. In the residue number system, the big integer is split into several small integers, and the addition and the multiplication of the original big integers are equivalent to the corresponding componentwise operations of the small integers i.e., where , , and . The denotes the number of multiplications can be performed to a ciphertext correctly. For example, if there are four CKKS ciphertexts , , , and then requires one level of multiplication and requires two levels of multiplications. Instead of performing via three multiplications, computing and followed by will require only two levels of multiplications. The efficiency of an algorithm is depend on circuits with smaller multiplicative depths.
The security of the CKKS scheme relies on the polynomial degree and the ciphertext modulo . Table I shows the parameters for achieving 128, 192 and 256bit security. For a given , the maximum size for is decreasing with the increasing security level. If the application requires more levels of multiplication in ciphertext domain then it requires larger . For a given security model, only way to increase the size of is by increasing the size of . The increasing the has consequences in terms of computational complexity.
128bit Security  192bit Security  256bit Security  

N  Max. size of q  Max. size of q  Max. size of q 
1024  27  19  14 
2048  54  37  29 
4098  109  75  58 
8192  218  152  118 
16384  438  305  237 
32768  881  611  476 
IiiE Newton Rapshon Method for Inverse Square Root Calculation
While FHE computes multiplication and addition in encrypted domain, several fundamental mathematical operations such as finding an inverse or a square root of a number is not feasible. However, we can use Newton iterative method introduced by Isaac Newton in 1669 [13] to calculate these in a FHE friendly way. Since the cosine distance calculation in (2) requires inverse square root operation, this section describes the Newton iterative method to perform this operation using just multiplication and addition.
Let’s define a function , where the root of this function gives the inverse squareroot of i.e., leads to . The Newton iterative formula for finding the root is given by the following equation [26]:
(3) 
where is the derivation of at . Hence, using this derivation, the equation (3) can be modified into:
(4) 
To find an inverse square root of , the equation (4) must be repetitively computed. The number of iteration required is heavily dependent on i.e., the initial value for (4). If the is bounded by and (i.e., ) then a good starting point is the average of the bounds i.e., [26]. With this initialisation, (4) can be computed using only multiplications, hence, it is FHE friendly replacement for inverse square root operation.
Iv The Proposed Scheme
In this section, we put together all the techniques explained in Section III to develop a privacypreserving speaker verification technique using CKKS based fully homomorphic encryption scheme. The user will be provided with a client application to extract features from their speech, generate secret and public keys required for CKKS FHE scheme, and interact with the server.
Iva Feature Extraction
As shown in Fig. 1, the speech data can be converted into a feature vector with dimension . Regardless of the feature extraction models, the dimension of the feature vector is around . The raw speech data goes through several speech processing modules to get Mel frequency cepstral coefficients (MFCC) followed by GMM supervectors with large dimension. These high dimensional vectors can be reduced via several advanced techniques such as iVector models (GMM/UBM iVectors), d and x vectors via Deep Neural Networks (GMM/UBM DNN) . Since this work focuses on the backend of the speaker verification system, we selected a computationally efficient GMM/UBM based iVector model for feature extraction. The proposed scheme is directly applied to any frontend feature extraction scheme that outputs a lowdimensional vector (i.e., is around ).
IvB Key generation for CKKS FHE scheme
The security key generation relies on several factors and depends on the underlying application. As shown in Table I, the highlevel parameters and must be selected by considering the efficiency and security. Moreover, scaling factor and number of multiplication levels must be set in advance. Since the application might be used by several users, the server presets these parameters common for all the users. Given these global parameters (, , and ), each user (i.e., the client application running on the user’s device) generates publickey and secretkey. The public key can be used for encryption, rescaling and rotation and will be sent to the server. The secret key never leaves the user device.
IvC Enrolling the feature vector
Using the client application, the user can extract speech features, generate keys for encryption, and start the enrolment process. The enrolment process is simple and require executing the following 4 steps:

Extract a speech feature vector from speech

Obtain the initialisation variable (more details about this will be provided in the next section)

Generate and store secretpublic key pairs ( and )

Apply CKKS encryption to get the following encrypted vectors and
Now user sends to the server for enrolment where denotes the user ID. The server stores the data in a database against the user ID .
IvD Speaker verification
The speaker verification part is the core contribution of this proposed work. Similar to the enrolment, the user extracts a feature vector and applies CKKS encryption to get . For the encryption, the user uses the same key that is being generated during the enrolment stage. To complete the verification stage, the user sends to the server. Now server retrieves the stored data from database using and evaluate (2) to obtain the verification score. The projection matrix in (2) is available to the server in plain domain. Please note that the matrix is obtained by the server during the training process and it doesn’t derived from the user’s speech data (see [9, 24] for more details).
If we closely look at verification equation (2), the server computes the numerator to get a scalar, then computes the denominator to get a scalar followed by division between these scalars. Hence, we can reformulate (2) as follows:
(5) 
where
(6) 
Since in (6) is encrypted, it’s not possible to directly compute the inverse squareroot of required for (5). Hence, we exploit the NewtonRapshon method as explained in Section IIIE. NewtonRapshon method is iterative, hence, using (4), the approximated result after first iteration is given by:
(7) 
and after second iteration:
(8) 
and so on and so forth ( and denote approximation after first and second iterations, respectively). Using the first approximated value in (7), we can get approximated value for (5) as follows:
(9) 
(10)  
(11)  
(12) 
Using (5), we can expand (9) into (10) (shown in the top of the next page). As described in Section IIIE, in (10) is the initialisation variable and its already supplied by the user to the server during the enrolment. Hence, equation (10) can be revised as (11). As shown in (12), the server requires four multiplication levels to compute (11). Similarity, we can incorporate the second iterative result in (8) which consumes six multiplicative levels. The result of third iteration consumes seven multiplication levels and nine levels for fourth iteration and so on and so forth. Increasing the number of multiplication levels lead to larger parameters for CKKS encryption which will directly impact the efficiency of the scheme. Given this context, lets focus on how server can compute the score using only the first iteration result as shown in (12).
The server first computes all four Level 1 multiplications shown in (12) which is a encryptedvectorplainmatrix computation involving plain matrix . The SIMD feature in CKKS supports elementwise multiplication and addition operation. To exploit this feature, the server needs to reassemble the matrix into vectors , , , as described below. If
then
where diagonals of are reassembled as vectors in a cyclic manner in . Hence we can perform the vectormatrix computation as follows:
(13) 
where denotes the elementwise Hadamard product operation. Since the vector is encrypted, the result of (13) is a dimensional encrypted vector. Moreover, in (13), multiplications are elementwise hence the whole operation consumes only one CKKS multiplication level.
All four Level 1 computations provide encrypted vectors which will be used for Level 2 computation. At Level 2, there are four multiplications which are encryptedvectorencryptedvector dot product computation. To perform this dot product computation, we exploit CKKS SIMD elementwise multiplication and rotation features. Lets suppose, and , then to compute , we first perform elementwise Hadamard product using CKKS SIMD operation as follows: . To obtain the final answer, we repetitively shift the vector elements and perform addition. For example, if then we shift the vector by 2 elements and add as follows:
Then we rotate the added vector by 1 element and add it again as follows:
Now the first element of the vector contains the correct answer for the dot product computation. One of the condition for this repeated rotation and addition is that the should be a power of two. This condition can be easily met by concatenating zeros at the end of the vectors. We need to perform repeated rotation and addition. Finally the dot product computation consumes only one multiplication level for the elementwise multiplication. The rotation and addition doesn’t consume any multiplication level.
The Level 2 computation described above produces an encrypted CKKS scalar (not vector). Now for the Level 3 and Level 4, we only need to perform encryptedscalarencryptedscalar multiplication which is straightforward to compute. Using these computations, we can obtain the approximated score in encrypted domain.
Now this encrypted score will be sent to the client application. The client application decrypts it and check if the score is above the threshold to authenticate the user. While this approach can be used for different applications (i.e., if the underlying application is about measuring the medical condition, then this score will represent the severity), this paper will only consider speaker verification use case.
V Parameter Selection and Performance Analysis
This section describes the dataset used for the experiments, results and the complexity, security, and privacy analysis of the proposed algorithm.
Va Parameter Selection
We start with selecting parameters for the CKKS encryption. In this experiment, we stick with bit security. We select three sets of parameters as shown in Table II. Set I considers the smallest possible suitable for the application. Since is limited to when , the maximum number of multiplication levels we can do is limited to 4 without loosing a lot of accuracy. Therefore, we will only use one NewtonRaphson iteration to find the inverse square root. If we set the base prime size and the special prime size to 41 i.e., , then we are left with bit. We split this into four 34bit required for the four multiplication levels. Since we are using all the available bits, the security of Set I is 128bit.




No. Iterations  1  1  2  
No. Multiplication Levels  4  4  6  

218  438  438  
41  60  60  
34  40  40  
Used size of  218  280  360  
Security  128bit  >128bit  >128bit 
Set II and Set III use higher order polynomial with degree . This supports maximum size of , which gives a lot of flexibility on prime sizes and multiplication levels. Set II considers only one NewtonRapshon iteration, hence four multiplication levels are required. We set high bits sizes for base prime, special prime and scales (i.e., 60, 60, and 40, respectively), totalling only 280 bits which is smaller than the allowable 438 bits. Therefore the security of Set II is higher than 128bit.
To increase the accuracy of finding inverse square root of encrypted number, we need to go for the result of second NewtonRapshon iteration which requires 6 multiplication levels. The parameters for this is shown in Set III in Table II. Similar to Set II, the security of Set III is higher than 128bit.
VB The dataset
TIMIT speech corpus has been used to evaluate the accuracy and reliability of the proposed algorithm [23]. The TIMIT speech corpus contains broadband recordings (each recording lasts for around 3 seconds) of speakers of eight major dialects of American English. Each speaker has speech samples. Out of samples, were used to extract feature vector for enrolment. We use GMM/UBM based iVector for the experiments. However, as described earlier, the proposed model can be used for DNN/UBM based x or d vector speaker verification systems.
For experiment, we follow the same approach used in [24] as a baseline. In [24], the TIMIT data corpus has been split into two: 1) the first two dialect regions with speakers are used for testing and 2) the last four dialect region with speakers were used to build background model. Table III shows the statistics of the TIMIT dataset.

#Male  #Female  Total  

DR1  31  18  49  
DR2  71  31  102  
DR3  79  23  102  
DR4  69  31  100  
DR5  62  36  98  
DR6  30  16  46  
DR7  74  26  100  
DR8  22  11  33  
Total  438  192  630 
Since, speech samples from the speakers are used for enrolling the user in server, the remaining samples per user have been used for verification. Initially we perform the following two baseline tests in plain domain using (2):
VB1 1. Genuine Attempts: ClientClient
In this test, for each speaker, the score is calculated using the speaker’s enrolled data against the speaker’s two test utterances. Hence, the scores for tests are obtained using (2).
VB2 2. Imposter Attempts: ImposterClient
In this test, each speaker’s test utterances are tested against other users’ entolled feature vector. This leads to tests and the score for each test has been obtained using (2).
Before we present the results, let us define False Acceptance Rate (FAR), False Rejection Rate (FRR) and Accuracy.

FAR = ,

FRR = ,

Accuracy= ,
where FAR and FRR are the two types of errors and False Acceptance means the system grants access to an impostor, and False Rejection means the system denies access to an enrolled speaker. From FRR and FAR, we can get Equal Error Rate (EER). EER represents the operating point where the FAR is equal to the FRR.
Using these definition, we can present the baseline results as shown in Fig. 2. Since number of imposter attempts are significantly higher than the genuine attempts, the Accuracy curve in Fig. 2 might be misleading (i.e., it’s approaching as it rejects large number of imposter attempts). Hence, we will stick to EER to compare the performance. The EER of the baseline model is around when the threshold is . In the following section, we analyse the proposed scheme.
Vi Experimental Results
We implement the proposed algorithm in Python using TenSEAL library [25] to interact with the C++ SEAL FHE library^{1}^{1}1https://github.com/Microsoft/SEAL. The source code of our implementation can be found here: https://github.com/rahulay1/iVectorTenSEAL/tree/master. We essentially repeat the same steps that we used to evaluate the baseline model. We tested all 3 sets of CKKS parameters shown in Table II. We compare the time requirements using a high end and medium end laptops. For the high end, we use a Razor laptop with 16GB ram and 6 cores (12 CPUs) with upto 4.1GHz speed. This can be treated as server. For the medium end, we use a MacBook Pro laptop with 8GB ram and 2 cores (4 CPUs) with upto 2.5GHZ speed. The specification of the medium end laptop is comparable to the specifications of medium end smartphones (i.e., Samsung Galaxy A Series phones), hence, can be considered for running the client application.
Via Initialisation of Newton Rapshon parameter
Before we start the experiment, its very important to initialise the variable in (4) for NewtonRapshon method. As explained in Section IIIE, if is closer to the actual inversesquare root, then the convergence is much faster. Therefore, finding the distribution of within this context is important. According to (6), and we can obtain the distribution of this value using the TIMIT dataset. Using all 630 speakers we could obtain more than 0.7 million sample values for i.e.,(). Using these samples, we plot the distribution of in Fig. 3. From this, we can clearly see that should be initialised between 400 and 900. Instead of initialising the average of 400 and 900, we initialised as as bigger chunk of data is around .
Now we calculate the inverse squareroot of using the iterative approach and compare it with the actual answer in the same Figure Fig. 3. The two convex graphs shows the relative error percentage of iterative approach compared to actual value (i.e., ) for the first and second iterations. While there are no significant differences between the first and second iterations, the error is less than for bigger chunk of . Therefore, we can safely use the NewtonRapshon method to compute the inverse squareroot of encrypted number as we proposed. We will evaluate the loss of EER due to this approximation in the next section.
ViB EER loss comparison of the proposed scheme against baseline approach
Before we start experimenting the encrypted speaker verification algorithm proposed in Section IVD, we need to check the loss of EER when we replace the actual inverse squareroot function with NewtonRaphshon iterative function. The result of this experiment is presented in Fig. 4 (see the first and third bars in Fig. 4). Use the result of 2 iterations of NewtonRapshon method leads to loss in EER while 1 iteration leads to loss of EER compared to the baseline EER. Thanks to the careful selection of the initialisation value, these EER losses are negligible.
Now we can compare the results of the proposed encrypted speaker verification system. These results are depicted in the remaining three bars of Fig. 4 (second, fourth and fifth bars). These bars correspond to the CKKS parameters in SetIII, SetII, and SetI in Table II, respectively. With SetIII parameters ( with 2 iterations), the loss of EER compared to baseline approach is around . For other two sets, the EER losses are and , respectively. The main reason for this is due to approximation scaling factor of the CKKS scheme. Since Set I uses small compared to SetII, the experiment based on SetI loses more precision of the underlying values hence lose in EER compared to SetII. Nevertheless, loss in EER is not significant when we consider the time required for this verification is nearreal time as discussed below.
ViC Computational time and processing requirements
One of the challenges that hinders the adoption of FHE in real application is it’s ability to perform computation in realtime. As we discussed in the literature review section, FHE has reached an inflection point where realtime application can be implement fully using FHE schemes. The result of the proposed scheme also support this statement. For all three sets of CKKS parameters in Table II, we measured the time required to complete the key generation, enrolment, verification and decryption. These results are presented in Table IV. Table IV depicts results for four sets of experiments for each CKKS parameter, totalling 12 experiments.
For each CKKS parameter set, the experiment was conducted in highend (C1) and mediumend (C2) laptops for two different feature dimensions ( and ). From Table IV, we can observe that significant amount of time is spent on generating public and secret keys. However, this is one time effort and can be done in offline. The other three operations impact the realtime performance. The time required to encrypt a speech template (noted as Enrol) is between 11ms and 55ms. As expected, the time consuming operation is verification. For SetI CKKS parameters, the verification can be done within 1.3 seconds (the EER loss of this set is ). For SetIII, while EER loss is limited to , the time required to perform verification on C1 laptop is around 7 seconds which may be suitable for nearrealtime application. The most efficient operation is decryption and require between 1ms and 12 ms for all 12 experiments.
If we consider a typical scenario where users uses a mediumend hardware and the server uses highend hardware then the total delay due to FHE scheme could be 11ms + 1.2 seconds + 2ms 1.3 seconds which is suitable for many realtime applications such as mobile banking, healthcare monitoring, etc.
When it comes to processing power, key generation, encryption and decryption do not require much CPU power (refer to the screenshot in Fig. 6). However, almost of the available processing power will be consumed by the verification. Since the verification involves several vector dot products, these can be highly parallelized to exploit all the CPUs. As shown in Fig. 6, all 12 CPUs in C1 is being used to complete the verification. Since C2, has only four CPUs, its performance is almost 3 times slower than C1 (refer Table IV).
Dim. 

N=16384  
Client  Server  Inverse Squareroot from Iteration 1  Inverse Squareroot from Iteration 2  
Client  Server  Client  Server  
KG  Enrol  Dec.  Veri.  KG  Enrol  Dec.  Veri.  KG  Enrol  Dec.  Veri.  
C1  d=200  2.282  0.016  0.001  1.247  12.775  0.052  0.007  6.442  12.760  0.052  0.004  6.858  
d=100  2.467  0.017  0.001  0.684  12.875  0.052  0.004  3.504  12.704  0.052  0.001  3.557  
C2  d=200  5.861  0.011  0.002  4.305  35.708  0.043  0.012  19.737  40.972  0.045  0.009  19.875  
d=100  5.629  0.011  0.002  2.156  36.060  0.042  0.009  9.759  40.028  0.041  0.004  9.855 
ViD Analysis of storage, memory and bandwidth requirements
In the proposed scheme, both the client and server need storage, memory and communication bandwidth to exchange data between them. The client on user device must keep the secret key and share the pubic key with the server who keeps the public key for verification. Moreover, the client needs to share the encrypted templates with the server during the enrolment and verification. Hence, the server needs more storage to keep the encrypted templates. Fig. 5 shows the storage requirements for several of the components discussed above. The storage required for both secret and templates are less than 5MB for both the order of polynomial degree considered for CKKS scheme. However, higher order polynomial require higher storage. Since these polynomials can contain several slots for input vector, there is no difference on storage when the feature dimension increases from 100 to 1000. The dominant element that require a large storage is the public key (110MB for and 0.75GB for ). Since this key must be communicated to the server, we also need relatively high bandwidth during the enrolment process. The main reason for this is that these public keys contains several keys for rescaling and rotation operations in the encrypted domain.
The usage of RAM is shown in Fig. 6. When the process started, only about 20% of the total available memory is being used in contrast to the usage of CPUs. Hence the efficiency of the proposed algorithms (mainly the use of FHE) is not dominated by the available memory. Hence, mediumend devices with up to 4GM RAM is sufficient to run the client.
Vii Privacy and Security Analysis
This section analyses the privacy of the stored speech features followed by the security of the whole system.
Viia Privacy Analysis
The aim of the proposed algorithm is to stop the server from learning the result of the inference. The proposed algorithm exploits CKKS FHE scheme where public key is being used for encryption, rescaling and rotation operations while secret key is being used for decryption. The proposed scheme requires only public keys to be sent to the server and secret key is never leave the user’s device. Therefore, the server cannot obtain the inference results.
Another potential privacy vulnerabilities is the identity linkage attack i.e., if the users enrol their biometrics in multiple services, the service providers might collude and profile the users using the similarities of the speech features. However, this attack is not possible because the CKKS encryption is probabilistic hence, the server cannot distinguish the encrypted messages even if they contain the same message and encrypted using the same keys [16]. As long as the users’ secret keys are protected, it is infeasible for the rogue service providers to profile the users.
ViiB Security Analysis
While it is infeasible to decrypt the CKKS ciphertext without secret keys, there might be other ways the system can be compromised. For example, the attacker might have stolen the user device with secret keys or the attacker compromised the encrypted templates stored on the server or obtain the speech recording of the user. In this section we investigate each of this scenario and show how the proposed scheme mitigates the security vulnerabilities.
ViiB1 Compromised user device attacks
In this attack, the adversary has access to the user device and the CKKS parameters stored during the enrolment. But do not have access to the user’s speech to generate legitimate speech feature. Hence, the adversary tries to combine the parameters from the compromised user device with the features of other users. Then the adversary tries to verify against the compromised user’s enrolled template residing at the server. To evaluate this, 2 × 150 × 151 tests [300 test utterances from other users are combined with the parameters of the compromised user device and this is repeated for all the users] are conducted and the corresponding decision scores are obtained. Essentially the result of this experiment is already presented by the EER loss comparison in Fig. 4. The loss of EER means that the FAR curve in Fig. 2 is shifted by leading to accepting 28 more false claims for every 1000 impostor attempts. However, this can be reduced by using a small threshold for verification which will impact the FRR.
Sometimes the adversary generates completely a random feature vector or a patterned feature vector to maximise the score. For the patterned feature vector, we generated a vector with ones. Now these artificial feature vectors can be encrypted and decrypted by the stolen credential and used to conduct speaker verification. The result of this experiment is shown in Fig. 7. The FAR of these attacks are lower than the baseline approach hence the adversary is worse off with these attacks.
ViiB2 Compromised server attacks
In this attack, the adversary has access to the enrolled encrypted data and public keys of all the users stored at the server. Hence, the adversary might attempt to modify the encrypted templates using FHE properties. Or he might use those encrypted templates during the verification process. However, none of these attacks will succeed without the secret keys. Moreover, if there is a compromise, the users can reenrol using different set of public and secret keys. Since compromised speech vectors are encrypted, they can be revoked (similar to passwords) even though the underlying data is biometric and unique to the user.
ViiB3 Compromised user voice attacks
In this attack, the attacker has access to the user’s voice recording but does not have access to the parameters stored at the user device. Now the attacker generates random public and secret key pairs and tries to impersonate. The success of this attack is equivalent to breaking the CKKS FHE scheme hence this attack is also infeasible.
Viii Conclusion
This paper presents a novel algorithm to process encrypted speech features using fully homomorphic encryption suitable for realtime speaker verification systems. The proposed algorithm exploits fully homomorphic encryption for arithmetic of approximate numbers (aka CKKS scheme) to achieve 128bit security against classical and quantum computers. To measure the performance, a well known speech corpus was used to conduct rigorous experiments. The endtoend encrypted privacypreserving scheme only requires 1.3 seconds to complete the verification in FHE domain. The accuracy in terms of equalerrorrate, the proposed scheme is off by only 2.8%. Privacy analysis shows that the proposed scheme mitigates the privacy vulnerabilities such as tracking and profiling that exists in traditional system. Moreover, the proposed scheme is secure and the system cannot be exploited without accessing the secret keys.
References
 [1]
 [2] J. Rusz et al., “Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 8, pp. 1495–1507, Aug. 2018.
 [3] R. Xu et al., “A voicebased automated system for PTSD screening and monitoring,” in Proc. of Med. Meets Virtual R XII, pp. 552–558, 2012.
 [4] Y. Yamada, K. Shinkawa, and K. Shimmei, “Atypical repetition in daily conversation on different days for detecting Alzheimer disease: Evaluation of phonecall data from a regular monitoring service,” JMIR Ment. Health, vol. 7, no. 1, Art. no. e16790, Jan., 2020.
 [5] D. Shibata, S. Wakamiya, K. Ito, M. Miyabe, A. Kinoshita, and E. Aramaki, “Vocabchecker: Measuring language abilities for detecting early stage Dementia,” in Proc. Int. Conf. Intell. User Interfaces Companion, pp. 1–2, 2018.
 [6] A. K., Jain, and K. Nandakumar, ”Biometric Authentication: System Security and User Privacy,” Computer, vol. 45, no. 11, 8792, 2012.
 [7] D.A. Reynolds, R.C. Rose, ”Robust textindependent speaker identification using Gaussian mixture speaker models,”IEEE Trans. Speech and Audio Processing , vol.3, no.1, pp.72,83, Jan 1995.
 [8] Z. Bai, and X. L. Zhang, ”Speaker recognition based on deep learning: An overview. Neural Networks2, 140, pp.6599, 2021.
 [9] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, ”Frontend factor analysis for speaker verification”, IEEE Trans. Audio, Speech, and Language Processing, 19(4), pp.788798, 2011.
 [10] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, ”A novel scheme for speaker recognition using a phoneticallyaware deep neural network. In IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP) (pp. 16951699), May, 2014.
 [11] D. Snyder, D. GarciaRomero, D. Povey, and S. Khudanpur, ”Deep neural network embeddings for textindependent speaker verification”, In INTERSPEECH, pp. 999–1003, 2017.
 [12] E. Variani, X. Lei, E. McDermott, I.L. Moreno, and J. GonzalezDominguez, ”Deep neural networks for small footprint textdependent speaker verification. In IEEE Int’l Conf. Acoustics, Speech and Signal processing, pp. 40524056, 2014.
 [13] I. Newton. Methodus Fluxionem et Serierum Infinit, 1966.
 [14] C. Gentry, ”A fully homomorphic encryption scheme”, Ph.D. thesis, Stanford Universityhttps://crypto.stanford.edu/craig, 2009.

[15]
M. Ajtai, ”Generating hard instances of lattice problems”, In Proc. of the twentyeighth Annual ACM symposium on Theory of computing (pp. 99108), July 1996.
 [16] O. Regev, ”On lattices, learning with errors, random linear codes, and cryptography”, Journal of the ACM (JACM), 56(6), p.34, 2009.
 [17] Y. Rahulamathavan, S. Dogan, X. Shi, R. Lu, M. Rajarajan,and A. Kondoz, ”Scalar product lattice computation for efficient privacypreserving systems”, IEEE Internet of Things Journal, 8(3), pp.14171427, 2020.
 [18] N. Dowlin et al. “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy” In Int’l Conf. Machine Learning, 2016.
 [19] F. Bourse, M. Minelli, M. Minihold, and P. Paillier, ”Fast homomorphic evaluation of deep discretized neural networks. In Annual Int’l Cryptology Conf. (pp. 483512). Springer, Cham, Aug., 2018.
 [20] F. Bergamaschi et al. ”Homomorphic Training of 30,000 Logistic Regression Models.” International Conference on Applied Cryptography and Network Security. Springer, Cham, 2019.
 [21] N. P. Smart, and F. Vercauteren, ”Fully homomorphic SIMD operations”, Designs, codes and cryptography, 71(1), pp.5781, 2014
 [22] J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomorphic encryption for arithmetic of approximate numbers. In ASIACRYPT’17, pages 409–437, 2017.
 [23] J. Garofolo, et al. ”TIMIT AcousticPhonetic Continuous Speech Corpus LDC93S1,” Web Download. Philadelphia: Linguistic Data Consortium, 1993.
 [24] Y. Rahulamathavan, K. R. Sutharsini, I. G. Ray, R. Lu, and M. Rajarajan,. PrivacyPreserving iVectorBased Speaker Verification. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 27(3), pp.496506, 2019.

[25]
A. Benaissa, B. Retiat, B. Cebere, A.E. Belfedhal, ”TenSEAL: A Library for Encrypted Tensor Operations Using Homomorphic Encryption”, Int’l Conf. Learning Representations, Workshop on Distributed and Private Machine Learning, 2021.
 [26] P. Kornerup, and J.M. Muller,” Choosing starting values for certain Newton–Raphson iterations. Theoretical computer science, 351(1), pp.101110, 2006.
 [27] A. Phipps, K. Ouazzane, and V. Vassilev, ”Your password is music to my ears: cloudbased authentication using sound”., 11th Int’l Conf. Cloud Computing, 2021.
 [28] H. Isyanto, A. Arifin, and M. Suryanegara, ”Design and implementation of IoTbased smart home voice commands for disabled people using Google Assistant. In IEEE Int’l Conf. Smart Technology and Applications (ICoSTA) (pp. 16), Feb., 2020.
 [29] P. Smaragdis, and M.V.S. Shashanka, ”A Framework for Secure Speech Recognition,” in IEEE Int’l Conf. Acoustics, Speech and Signal Processing, vol.4, no., pp.IV969,IV972, 1520 April 2007.
 [30] O. Goldreich, “Secure multiparty computation”, (working draft), available: http://www.wisdom.wei zmann.ac.il/ oded/pp.html. (Sep. 1998)
 [31] M. Pathak, and B. Raj, ”PrivacyPreserving Speaker Verification and Identification Using Gaussian Mixture Models,” IEEE Trans. Audio, Speech, and Language Processing , vol.21, no.2, pp.397406, Feb., 2013.

[32]
Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, and T. Toft, “Privacypreserving face recognition,” in
Proc. 9th International Symposium on Privacy Enhancing Technologies, PETS ’09, pp. 235–253, 2009.  [33] Y. Rahulamathavan, and M. Rajarajan, Efficient privacypreserving facial expression classification. IEEE Transactions on Dependable and Secure Computing, 14(3), pp.326338., 2015.

[34]
Y. Rahulamathavan, R. Phan, S. Veluru, K. Cumanan, and M. Rajarajan, “Privacypreserving multiclass support vector machine for outsourcing the data classification in cloud,”
IEEE Trans. Dependable Secure Computing, vol. 11, no. 5, pp. 467–479, Sept., 2014.  [35] Y. Rahulamathavan, S. Veluru, R. Phan, J. Chambers, and M. Rajarajan, “Privacypreserving clinical decision support system using gaussian kernel based classification,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 1, pp. 56–66, Jan., 2014.
 [36] Y. Rahulamathavan, R. Phan, J. Chambers, and D. Parish, “Facial expression recognition in the encrypted domain based on local fisher discriminant analysis,” IEEE Trans. Affective Computing, vol. 4, no. 1, pp. 83–92, Jan.Mar., 2012.
 [37] J. Portêlo, B. Raj, A. Abad and I. Trancoso, ”Privacypreserving speaker verification using secure binary embeddings,” Information and Communication Technology, Electronics and Microelectronics (MIPRO), 37th International Convention on, Opatija, 2014, pp. 12681272, 2014.
 [38] O. Regev, ”On Lattices, Learning with Errors, Random Linear Codes, and Cryptography,” In Proc. 37th ACM Symp. on Theory of computing (STOC), pages 84–93, 2005.