Today there is a growing drive to bring privacy preservation to the realm of speech processing. Following new privacy regulation such as the European GDPR , technology to protect sensitive data, including voice data, is attracting the attention of researchers and industrial stakeholders alike. Perhaps the most compelling argument to preserve privacy in speech signals is because they represent inherently personal and private information. Examples include paralinguistic and extralinguistic information, attributes and characteristics, e.g., gender, age, language, dialect, accent, health status, general well-being and emotional state—and the biometric identity.
This paper concerns the protection of privacy for voice biometric applications, e.g. speaker recognition. Recent years have seen rapid progress in privacy-preserving speaker recognition, e.g. [2, 3]. The most recent contribution to the field 
reported the first i-vector-based solution using homomorphic encryption (HE). HE supports computation upon sensitive biometric voice datain the encrypted domain and is a popular tool for privacy preservation. However, the computational demands of HE are prohibitive. This is especially true in the case of speaker recognition systems that employ some form of cohort score normalisation. When operating in unconstrained environments cohort score normalisation is key to performance and is a feature of any state-of-the-art solution. Unfortunately, cohort score normalisation only compounds the computational burden of encryption since it typically involves many thousands of biometric comparisons in the scoring of a single utterance. The scale of the computational demands are currently a bottleneck to privacy preservation for speaker recognition.
The work reported in this paper aims to overcome this bottleneck with an alternative, efficient approach to cohort score normalisation. Using an efficient approach to speaker modelling , we propose to replace the speaker representation used in cohort score normalisation with an alternative binary key (BK) representation. As a native binary representation, BKs are readily suited to efficient computation in the encrypted domain. The paper shows that the computational overhead of operating upon encrypted representations can then be reduced greatly, meaning that probabilistic linear discriminant analysis (PLDA) comparisons can, for the first time, be performed in the encrypted domain with realistic computational resources.
This paper is organised as follows. Section 2 describes the related work in privacy-preserving speaker recognition. Section 3 describes BK voice representations. Section 4 describes the proposed efficient cohort pruning scheme using BK representations and shows how it can be employed for privacy-preserving score normalisation. Section 5 presents an experimental validation. Conclusions are provided in Section 6.
2 Preliminaries and Related Work
There is an extensive body of literature concerning the preservation of privacy in biometrics. Unfortunately, most relates not to speaker recognition, but to other biometric characteristics, e.g. fingerprint, iris, and face recognition[6, 7]. Whatever the characteristic, the requirements for effective privacy preservation are the same. These are outlined in the ISO/IEC 24745 standard  which stipulates that biometric information must be unlinkable (data of protected databases are not relatable), irreversible (neither embeddings nor audio can be recreated from protected data), and renewable (no biometric voice data needs to be recaptured to update a privacy-preservation algorithm)(when updating a privacy-preservation algorithm, no biometric voice data needs to be recaptured).
The conventional approach to meet these requirements involves some form of encryption. Since the late 1980s, the focus of the cryptographic community is secure computation [9, 10], specifically the evaluation of a function in ways that do not reveal any information about the inputs of the involved parties, except for the results. Secure computation mechanisms may be harnessed to retain the functionality of an application without compromising the privacy of the involved parties. The main techniques include homomorphic encryption (HE) which enables computations to be carried out on ciphertexts, and secure multi-party computation (SMPC) which allows interactive computations on data that is secretly shared111E.g., in the Boolean Goldreich-Micali-Wigderson (GMW)GMW protocol  for two parties that we will use in our work, an input bitstring can be secretly shared among the parties by sending a random bitstring of the same length to one party and sending to the other party. Then, the GMW protocol can be executed to securely compute any functionality on using just the shares of . The inputs stay hidden because neither nor reveal any information about . between the parties.222 Depending on the use case, a party could be a client device, an authentication, or a database/processing server. In contrast to the plaintext domain (one party is sufficient to carry out a computation), security in SMPC is established by splitting computations in a distributed system architecture, where each party computes only on secretly shared data.
Recent advances in state-of-the-art implementations of secure computation protocols (cf. ) have shown to be efficient solutions to privacy preservation in a wide variety of applications . Even so, different solutions offer different levels of computational complexity. SMPC protocols typically involve multiple rounds of interaction (communications between parties involved in the secure computation). While not necessarily requiring interaction, HE usually incurs a higher computational overhead. It follows that, while deployed secure computation techniques can be highly efficient and scalable, it depends on the use case and the employed mechanisms.
. This body of work explores privacy preservation in traditional Gaussian mixture model (GMM) and hidden Markov model (HMM) architectures. Typically, HE is used to hide biometric information, while scoring is sometimes performed using SMPC. The solution reported in[15, 16] preserves privacy in an HMM framework by storing the corresponding secret shares among multiple servers, a technique known as outsourced SMPC . Of course, software solutions are not the only approach to privacy preservation. The work in  shows how privacy can also be preserved by using trusted execution environments such as the Intel SGX architecture .
Recently, in , an HE-based solution to privacy preservation in the form of the Paillier cryptosystem has been applied to state-of-the-art speaker recognition architectures including i-vector systems using PLDA. This work shows that a one-to-one PLDA comparison can be computed in a few hundred milliseconds, depending on whether the speaker model is also protected. Unfortunately, while the solution delivers privacy preservation with no degradation to computational precision, it does not scale well. Protection of a cohort score normalisation process which requires many thousands of comparisons is computationally prohibitive; a runtime in the order of minutes would be needed to process one reference-probe comparison involving 10 000 cohort comparisons, a representative number for today’s state-of-the-art techniques.
With cohort score normalisation being a feature of any state-of-the-art approach to speaker recognition, and with performance degradation being the cost of its omission, there is hence an interest to devise computationally manageable solutions. With no previous work having considered this problem thus far, this is the goal of the research reported in this paper.
3 Binary Key Voice Representations
Binary voice representations have been reported previously in the context of privacy preservation. Cryptobiometric (extraction/binding of cryptographic keys from biometric data)333 In contrast, HE uses cryptographic keys for de-/encrypting biometric data (biometrics in the encrypted domain). systems based upon the binarisation444 The term binarisation is potentially misleading. It refers to a higher level binary representation (under the acceptance of precision loss) of digital speech data (which is itself already stored in binary bit form). of GMM-based supervectors are reported in [20, 3]. The work in this paper uses an alternative, more elaborate approach based upon binary keys (BKs), originally proposed in [5, 21]. The BK approach takes a more speaker-discriminatory approach to modelling, much like the idea behind anchor models [22, 23]. The same versatile approach has been applied successfully to a number of related problems including emotion recognition , speaker change detection , speaker diarization [26, 27] and privacy preservation  in the context of cancelable biometric systems (irreversible feature transforms). Full details of the implementation used in this paper can be found in .
The extraction of BKs is performed using a so-called binary key background model (KBM). The KBM plays a role similar to that of a conventional universal background model (UBM) but, instead of representing the acoustic space in an expected sense, it is formed from the concatenation of a number of speaker-dependent models (learned using traditional UBM maximum a-posteriori adaptation). The role of the KBM anchors is similar in principle to a latent speaker subspace (PLDA alike; in a rough approximation), namelyhence the extraction of discriminant BKs.
The BK extraction process is illustrated in Figure 1. From acoustic features (a), KBM component likelihoods are computed (b). Similarly to i-vector extraction , in which component posteriors are pooled to zero order statistics, top- likelihoods (which at the frame level equals the top- component posteriors) are used to determine the most frequently activated components (c); again, a rough approximation. An even more compressed speaker representation (d) is obtained with the final BK representation which indicates simply the elements with the highest pooled mean statistics.
The research hypothesis under investigation is: the loss in precision will be tolerable given their use only for cohort pruning; their use will cause only marginal degradation to the benefit of score normalisation while nonetheless facilitating privacy.
4 Privacy-Preserving Cohort Pruning
The contribution in this paper is an efficient, privacy-preserving approach to score normalisation. It is based upon cohort pruning using BK speaker representations that allow for efficient computation in the encrypted domain. The use of HE-protected i-vectors here is too slow; unprotected i-vectors are not unlinkable. The following describes the approach and shows how computation is performed using SMPC while preserving the privacy of both data subjects and cohort speakers.
4.1 Score normalisation
Score normalisation is a processing step of any state-of-the-art approach to ASV. It is applied to remove nuisance bias and variation that would otherwise influence comparison scores in diverse environmental conditions. The general approach to normalisation is based upon a set of auxiliary scores resulting from comparisons between references, probes, and cohort data. A score is normalised to according to , where the mean
are derived from (Gaussian distributed) scores of comparisons with cohort data.
In the case that comparisons involve reference data, this approach is referred to as zero normalisation (z-norm). In this case, cohort data characteristics are assumed to match those of the probe (which are fixed for one quality condition). Normalisation is then performed using the mean and standard deviation derived from the set of comparison scores .
Normalisation can also be applied using probe data. This is known as test normalisation (t-norm). Here, cohort data characteristics are assumed to match those of the reference, in which case the cohort data consists of reference representations. Normalisation is then performed using the mean and standard deviation derived from the set of comparison scores .
Typically, cohort score distributions are rarely Gaussian-distributed. Adaptive z-norm (az-norm) and t-norm (at-norm) are commonly applied instead in order to account for this discrepancy. In practice, normalisation is performed with only the top- scores of and and, for i-vectors, both (a)z-norm and (a)t-norm are usually combined in symmetric fashion, giving (a)s-norm: . The normalisation process too needs to preserve privacy.
4.2 Privacy preservation
By using , privacy-preserving score normalisation can be performed using reference, probe, and cohort embeddings, all processed in the encrypted domain via HE-based PLDA (HE-PLDA) . The resulting, encrypted scores , , and , none of which reveal any sensitive information, can then be decrypted by an authentication server in order that normalised scores can be computed in the plaintext domain.
Assessments of computing demands were performed with a Python implementation of HE-PLDA with two 400-dimensional embeddings, 64-bit floating point precision and a key size of 3072 bits (recommended by NIST  in order to support adequate security given advances in computing power until 2030 and beyond) running on an Intel Core i9-7960X CPU with 128 GB of RAM. Computations require 320 ms per comparison when only subject data is encrypted (target in this paper), and 973 109 ms per comparison when both subject data and PLDA model parameters are encrypted (the second architecture in ). Since a cohort size exceeding some few thousand voice samples is not unusual, the privacy-preserving computation of is computationally prohibitive.
4.3 Cohort pruning
The research hypothesis under investigation here is that the selection of top- relevant cohort comparisons can be performed more efficiently by accepting some modest degradation to computational precision, while still preserving privacy. Instead of selecting the top- cohort comparisons using ASV scores, they are selected instead using measures of acoustic similarity derived from BK representations of reference, probe, and cohort samples. As illustrated in Figure 2, using BK representations, HE-PLDA can then be made efficient via secure bit-wise AND operations and top- cohort pruning.
More precisely, we employ the Boolean GMW protocol  (cf. Section 2) in the case of two involved parties to securely compute our proposed secure cohort pruning technique. Probe, reference, and cohort BKs are secretly shared between two servers that jointly and securely compute the top- pruning. This principle is referred to ascalled outsourced secure computation and was described in . Assuming non-colluding servers as inand similarly to , this approach can tolerate the corruption of one server without any privacy leakage. This assumptionThe assumption of non-colluding servers can be seen as realistic, given that one server could be supplied by an independent provider. Since we use protocols with semi-honest security here, the secure pruning requires servers that honestly follow the protocol.
Using the GMW protocol, we can easily compute an AND between the secretly shared sample and all secretly shared cohort data. On the resulting shares, we then securely compute the Hamming weight using the circuit of  and perform a secure top- pruning as optimised in . As a result, the identifiers of the top- embeddings are revealed and can be used for score normalisation. Apart from this information, nothing else is leaked about the sample and cohort voice data.
5 Experimental Validation
Given the research objective to demonstrate improvements in computational efficiency, rather than improved performance, only brief details of the text-independent speaker recognition system are provided here. It is based on 400-dimensional i-vectors, extracted from conventional acoustic features using time delay deep neural network (TDNN) for estimating UBM posteriors. The TDNN is trained using the KALDI toolkit with SRE’04-08 and SWBD data (not x-vectors). The Python backend is based on PLDA with mean and length normalisation, and trained with SRE’04-08 data. The KBM used for BK extraction is learned in Matlab using a 2 048-component UBM trained with conventional acoustic features and anchors (10 fe/male each). KBM optimisation is performed using a subset of the cohort set containing data from 71 speakers. For BK extraction, at feature level the top components are activated, while at sample level the top bits are set. The cohort set is a subset of the PLDA training set with 11 640 voice samples of 3 812 speakers. The proposed approach is evaluated onusing the 2010 NIST SRE common condition (CC) 5 task, particularly the core-core and core-10s protocolsconditions. In order to report on diverse data, we pooledmixed the scores of both protocolssets (core-core/10s).
5.1 Recognition results
Results are reported in terms of , the minimum decision cost function (minDCF; effective prior 0.01) and the equal-error rate (EER). Table 1 shows that conventional as-norm gives the same or better performance than the baseline system (without any score normalisation). The proposed privacy-preserving as-norm solution gives slightly worse results in terms of minDCF even though, curiously, improvements are observed in terms of and EER. For minDCF, an improvement over the baseline is also observed but without reaching the performance of the unprotected AS-norm. This result confirms our research hypothesis: privacy preservation incurs only a modest performance degradation (in the minDCF sense) and, encouragingly, also improves upon the baseline system without any score normalisation (in the and EER sense).
5.2 Proof of biometric information protection
The sample embeddings as well as the cohort embeddings used for PLDA comparisons are protected via the original privacy-preserving PLDA system. As such, if biometric information protection in the form of unlinkability, irreversibility, and renewability is given by the original system (as in ), then the embeddings are protected as well. The BKs of samples and cohorts are protected by the Boolean secret sharing of the GMW protocol between two servers (cf. Section 2). Because of the information-theoretic indistinguishability of any two secret shares, unlinkablitity and irreversibility are guaranteed. Due to the nature of secret sharing, the protected data is also renewable; secret shares can be re-randomised with a new random bitstring.
5.3 Complexity analysis
We implemented our secure cohort pruning architecture using the state-of-the-art SMPC framework ABY . We ran our implementations on two machines with Intel Core i9-7960X CPUs and 128 GBs of RAM. To simulate real-world network conditions of the involved servers, we restricted the connection between the servers to 1 Gbit/s bandwith and 1 ms round trip time. Results are presented in Table 1. Note that these are the online runtimes and that some additional input-independent pre-computation is required; we account for the BK extraction time555 BKs are extracted with Matlab on a DELL R620 with two Intel Xeon E5-2630L CPUs and 128 GBs of RAM. (28.3 s and 3.2 s for a core-core and for a core-10s probe, respectively; for core-core/10s, the average is 16.9 s). The largest gain in real-world network conditions for privacy-preserving score normalisation are observed for small cohorts with to gains in runtimes. In other words, rather than runtimes in the order of 50 minutes only a few minutes are necessary. In the privacy-preserving cohort pruning and as-norm, the BK extraction takes 6.4-20.0% of the runtime and the GMW pruning takes 39.2-78.0% (their time share is lower on higher cohort sizes as the runtime share of HE-PLDA increases). For the GMW pruning, all privacy-preserving az/at-norm comparisons (using BKs against the entire cohort) are carried out in less than 157 s and 52 s, respectively. These times already include the sorting of the top-50 cohort indices for pruning the HE-PLDA cohort comparisons. To prune larger cohort sizes, the privacy-preserving sorting requires additional time, e.g. from top-50 to top-400 in az-norm, an additional 126 s are necessary.
|Baseline ( / minDCF / EER)||0.161 / 0.410 / 4.6|
|Runtime top- HE-PLDA (necessary)||16 s||32 s||48 s||64 s||80 s||96 s||128 s|
|HE-PLDA (z-norm)||3 725 s (for all 11 640 reference-cohort comparisons)|
|GMW pruning (BK: 28 s)||157 s||177 s||198 s||220 s||247 s||269 s||283 s|
|HE-PLDA (t-norm)||1 220 s (for all 3 812 cohort-probe comparisons)|
|GMW pruning (BK: 17 s)||52 s||59 s||66 s||73 s||82 s||89 s||94 s|
This paper reports the first approach to computationally manageable (yet demanding) privacy-preserving speaker recognition with cohort score normalisation. Prior to this work, the latter was a computational bottleneck for PLDA with Paillier homomorphic encryption, with normalisation strategies that require many thousands of biometric comparisons being computationally prohibitive when performed in the encrypted domain. The set of cohort data used for score normalisation is pruned using a native binary speaker representation. Privacy is outsourced via secure multi-party computation through which a top- cohort set is pruned securely. Privacy-insensitive cohort scores can then be decrypted and treated in the usual way. The cohort list is revealed to the sites capturing the reference and the probe data, respectively. This could be used by a security (not privacy) adversary to mount hill-climbing attacks; instead, if the top- lists are in the province of the biometric service owner, these top- indices serve the intended recognition purpose. Future work could investigate the use of Yao’s garbled circuits and arithmetic SMPC protocols for carrying out ranking and matrix operations in the protected domain (including PLDA) demanding less computational but more server communication overheads.
This work was supported by the BMBF and the HMWK within CRISP, the DFG as part of project E4 within the CRC 1119 CROSSING and A.1 within RTG 2050, by Omilia – Conversational Intelligence, and by the Voice Personae and RESPECT projects, both funded by the French ANR.
-  European Council, “Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation),” April 2016.
-  J. Portêlo, B. Raj, A. Abad, and I. Trancoso, “Privacy-preserving speaker verification using garbled GMMs,” in Proc. European Signal Processing Conf. (EUSIPCO). IEEE, 2014, pp. 2070–2074.
-  M. Paulini, C. Rathgeb, A. Nautsch, H. Reichau, H. Reininger, and C. Busch, “Multi-bit allocation: Preparing voice biometrics for template protection,” in Proc. The Speaker and Language Recognition Workshop (Odyssey), 2016, pp. 291–296.
-  A. Nautsch, S. Isadskiy, J. Kolberg, M. Gomez-Barrero, and C. Busch, “Homomorphic encryption for speaker recognition: Protection of biometric templates and vendor model parameters,” in Proc. The Speaker and Language Recognition Workshop (Odyssey). ISCA, 2018, pp. 16–23.
-  X. Anguera and J.-F. Bonastre, “A novel speaker binary key derived from anchor models,” in Proc. Annual Conf. of the Intl. Speech Communication Association (INTERSPEECH). ISCA, 2010, pp. 2118–2121.
-  M. Blanton and P. Gasti, “Secure and efficient protocols for iris and fingerprint identification,” in Proc. European Symposium on Research in Computer Security (ESORICS). Springer, 2011, pp. 190–209.
-  J. Bringer, H. Chabanne, M. Favre, A. Patey, T. Schneider, and M. Zohner, “GSHADE: faster privacy-preserving distance computation and biometric identification,” in Proc. ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec). ACM, 2014, pp. 187–198.
-  ISO/IEC JTC1 SC27 Security Techniques, ISO/IEC 24745:2011. Information Technology - Security Techniques - Biometric Information Protection, International Organization for Standardization, 2011.
-  A. C. Yao, “Protocols for secure computations,” in Proc. Annual Symposium on Foundations of Computer Science (SFCS). IEEE, 1982, pp. 160–164.
O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game,” in
Proc. ACM Symposium on Theory of Computing (STOC). ACM, 1987, pp. 218–229.
-  M. Hastings, B. Hemenway, D. Noble, and S. Zdancewic, “SoK: General-purpose compilers for secure multi-party computation,” in Proc. IEEE Symposium on Security and Privacy (S&P). IEEE, 2019, full version: https://marsella.github.io/static/mpcsok.pdf.
-  D. Demmler, T. Schneider, and M. Zohner, “ABY-A framework for efficient mixed-protocol secure two-party computation,” in Proc. Network and Distributed System Security Symposium (NDSS). The Internet Society, 2015.
-  P. Smaragdis and M. Shashanka, “A framework for secure speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing (TASLP), vol. 15, no. 4, pp. 1404–1413, 2007.
-  M. Pathak and B. Raj, “Privacy-preserving speaker verification and identification using Gaussian mixture models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 21, no. 2, pp. 397–406, 2013.
-  M. Aliasgari and M. Blanton, “Secure computation of hidden Markov models,” in Proc. Intl. Conf. on Security and Cryptography (SECRYPT). IEEE, 2013, pp. 1–12.
M. Aliasgari, M. Blanton, and F. Bayatbabolghani, “Secure computation of hidden Markov models and secure floating-point arithmetic in the malicious model,”Intl. Journal of Information Security, vol. 16, no. 6, pp. 577–601, 2017.
-  S. Kamara and M. Raykova, “Secure outsourced computation in a multi-tenant cloud,” in Proc. IBM Workshop on Cryptography and Security in Clouds, 2011, pp. 15–16.
-  F. Brasser, T. Frassetto, K. Riedhammer, A.-R. Sadeghi, T. Schneider, and C. Weinert, “VoiceGuard: Secure and private speech processing,” in Proc. Annual Conf. of the Intl. Speech Communication Association (INTERSPEECH). ISCA, 2018, pp. 1303–1307.
-  F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi, V. Shanbhogue, and U. R. Savagaonkar, “Innovative instructions and software model for isolated execution,” in Proc. Workshop on Hardware and Architectural Support for Security and Privacy (HASP). ACM, 2013.
-  S. Billeb, C. Rathgeb, H. Reininger, K. Kasper, and C. Busch, “Biometric template protection for speaker recognition based on universal background models,” IET Biometrics, vol. 4, no. 2, pp. 116–126, 2015.
-  J.-F. Bonastre, P.-M. Bousquet, D. Matrouf, and X. Anguera, “Discriminant binary data representation for speaker recognition,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 5284–5287.
-  T. Merlin, J.-F. Bonastre, and C. Fredouille, “Non directly acoustic process for costless speaker recognition and indexation,” in Proc. Intl. Workshop on Intelligent Communication Technologies and Applications, vol. 29, 1999.
-  Y. Mami and D. Charlet, “Speaker identification by location in an optimal space of anchor models,” in Proc. Intl. Conf. on Spoken Language Processing (ICSLP), 2002.
-  J. Luque and X. Anguera, “On the modeling of natural vocal emotion expressions through binary key,” in Proc. European Signal Processing Conference (EUSIPCO). IEEE, 2014, pp. 1562–1566.
-  J. Patino, H. Delgado, and N. Evans, “Speaker change detection using binary key modelling with contextual information,” in Proc. Intl. Conf. on Statistical Language and Speech Processing (ICSLP). Springer, 2017, pp. 250–261.
-  H. Delgado, X. Anguera, C. Fredouille, and J. Serrano, “Fast single-and cross-show speaker diarization using binary key speaker modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2286–2297, 2015.
-  J. Patino, H. Delgado, and N. Evans, “The EURECOM submission to the first DIHARD challenge,” in Proc. Annual Conf. of the Intl. Speech Communication Association (INTERSPEECH). ISCA, 2018, pp. 2813–2817.
-  A. Mtibaa, D. Petrovska-Delacretaz, and A. B. Hamida, “Cancelable speaker verification system based on binary Gaussian mixtures,” in Proc. Advanced Technologies for Signal and Image Processing (ATSIP), 2018, pp. 1–6.
-  N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing (TASLP), vol. 19, no. 4, pp. 788–798, 2011.
-  E. Barker, “NIST special publication 800–57 part 1, revision 4,” 2016.
-  J. Boyar and R. Peralta, “The exact multiplicative complexity of the Hamming weight function,” in Proc. Electronic Colloquium on Computational Complexity (ECCC), 2005.
-  K. Järvinen, H. Leppäkoski, E. S. Lohan, P. Richter, T. Schneider, O. Tkachenko, and Z. Yang, “PILOT: Practical privacy-preserving Indoor Localization using OuTsourcing,” in Proc. IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 2019, to appear. Preliminary version: https://encrypto.de/papers/JLLRSTY19.pdf.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.