The extensive use of private and personally identifiable information in modern statistical (and machine learning) applications can present an obstacle to individuals contributing their data to research. As just one example, when considering contribution to biobanks Kaufman et al. (2009) reported 90% of respondents had privacy concerns. Addressing these concerns is paramount if the participation rate in biomedical and genetic research is to be increased, especially for government and industry where public trust is lower (Kaufman et al., 2009). Indeed, industry is on the brink on embarking on biomedical applications on a scale never before witnessed via the impending wave of so-called ‘wearable devices’ such as smart watches, which present serious privacy concerns. Companies hope to market the ability to monitor and track vital health signs round the clock, perhaps fitting classification models to alert different health concerns of interest. However, such constrained devices will almost certainly leverage ‘cloud’ services, uploading reams of private health diagnostics to corporate servers. Herein, it is demonstrated how recent advances in cryptography allow individual privacy to be preserved, whilst still enabling researchers and industry to incorporate such data into statistical analyses.
Moreover, the current explosion in cloud computing platforms promise to enable researchers and businesses to divest themselves of complex in-house compute server setups, but require one to vest all trust in the cloud provider maintaining confidentiality of the data.
One way to ensure trust in the scenarios above is through storage and disclosure of only secure, encrypted data. Encryption is a technique whereby data, termed a message in cryptography, is mathematically transformed using an encryption key to produce a cipher text. The cipher text can only easily be decrypted to reveal the original data if the corresponding decryption key is known. Therefore, a cipher text can be stored openly without compromising privacy so long as the decryption key is kept secret.
From a data science perspective, the problem with employing cryptographic methods to improve trust is that the data must at some point be decrypted for use in a statistical analysis. However, recent cryptography research in the areas ofhomomorphic and functional encryption are showing exciting potential to bypass this. An encryption scheme is said to be homomorphic if certain mathematical operations can be applied directly to the cipher text in such a way that decrypting the result renders the same answer as applying the function to the original unencrypted data.
The remarkable properties of homomorphic encryption schemes are not without limitations, which typically include slow evaluation and the fact that the set of functions which can be computed in cipher text space is very restricted. However, by understanding the constraints and restrictions it is hoped that statistics researchers can assist in the research effort, adapting statistical techniques to be amenable to homomorphic computation by making and quantifying reasonable approximations in those situations where a traditional approach cannot be implemented homomorphically.
There are reviews and introductions to homomorphic encryption aimed at different audiences and each with a different emphasis (Gentry, 2010; Vaikuntanathan, 2011; Sen, 2013; Silverberg, 2013). The aim of this paper is to provide statisticians and machine learners with sufficient background to become involved in developing methodology specifically crafted to homomorphic computation. As part of this effort we describe an accompanying high performance R package providing an easy to use reference implementation as a core contribution of this work. In a sister publication (Aslett, Esperança and Holmes, 2015) we present some novel statistical machine learning techniques developed to be amenable to fitting and prediction encrypted.
In Section 2 homomorphic encryption is introduced covering the salient features for statistical work without drifting too far into cryptography theory unnecessarily, although full references and resources are provided for further reading. Section 3 reviews the statistical techniques which have been successfully implemented in the cryptography literature and existing software implementations of homomorphic schemes. Section 4 describes a high-level easy to use software implementation available as an R package (Aslett, 2014).
2 Homomorphic encryption
This section presents an introduction to homomorphic encryption with an emphasis on details and limitations which are pertinent to applying statistics and machine learning methodology.
2.1 Background on encryption
An unencrypted number, , is referred to as a message, while the encrypted version, , is the cipher text, where and are the message space and cipher text space respectively. Typically , the integers or similar, whilst will depend on the encryption algorithm being used. A given encryption scheme then utilises keys in order to map the message into a cipher text and to recover the message from a cipher text. There are two approaches: either there is a single secret key, or there are a public and secret key. In the single secret key scheme the same key is used to map messages to cipher texts and vice versa, so this key must be kept private at all times. Conversely, a scheme which also has a public key uses that key to map messages to cipher texts, but uses the secret key to map back: consequently the public key can be openly disclosed. Hereinafter, only public key schemes are considered.
Fundamentally encryption can be treated as simply a mapping which takes and a public key, , and produces the cipher text, . Notationally, is used to signify assignment rather than equality, since encryption is not necessarily a function in the mathematical sense: any fixed inputs and will produce many different cipher texts. Indeed, this is a desirable property for public key encryption schemes, referred to as semantic security: a scheme is semantically secure if knowledge of for some
has vanishingly small probability of revealing further information about any other encrypted message. Informally, this means repeated encryption of the same messagewill render different and seemingly unrelated cipher texts each time with high probability. Clearly, if encryption was an injective function for fixed , , then any public key encryption scheme with a modestly sized message space could be trivially compromised. Semantic security is achieved by introducing randomness into the cipher text which is sufficiently small not to interfere with correct decryption when in possession of , but, as will become apparent in the sequel, this essential feature imposes a handicap on all currently known homomorphic schemes.
Conversely, decryption is a function which renders the original message, . The crucial relation satisfied by any encryption scheme is therefore:
Consequently, the security of an encryption scheme is based on the hardness of recovering given knowledge of only and . Some schemes are based on empirical hardness assumptions about particular problems, whilst others may rely on settings where the hardness can be rigorously proven.
This is a simplification of general cryptographic schemes, since some of the most important algorithms, such as the current industry standard Advanced Encryption Standard (AES) (Daemen and Rijmen, 2002), do not normally operate value-by-value but rather on blocks of binary data. However, it encompasses the class of algorithms to be discussed in what follows.
2.2 Homomorphic encryption
The term homomorphic encryption describes a class of encryption algorithms which satisfy the homomorphic property: that is certain operations, such as addition, can be carried out on cipher texts directly so that upon decryption the same answer is obtained as operating on the original messages. In simple terms, were one to encrypt the numbers 2 and 3 separately and ‘add’ the cipher texts, then decryption of the result would yield 5. This is a special property not enjoyed by standard encryption schemes where decrypting the sum of two cipher texts would generally render nonsense.
More precisely, an encryption scheme is said to be homomorphic for some operations acting in message space (such as addition) if there are corresponding operations acting in cipher text space satisfying the property:
For example, the simple scheme in Gentry (2010) describes a method where and , though there is no restriction that the operations must correspond in all schemes. For example, Paillier encryption (Paillier, 1999) is homomorphic only for addition, with but where .
Note this is not a group homomorphism in the mathematical sense, since the property does not commute when starting instead from cipher texts, due to semantic security. That is, because the same message encrypts to different cipher texts with high probability, in general:
Moreover, generally . Another consequence of semantic security is that operations performed on the cipher text may increase the noise level, so that only a limited number of operations can be consecutively performed before the noise must be reduced.
The possibility of homomorphic encryption was proposed by Rivest, Adleman and Dertouzos (1978) and many schemes that supported either multiplication (such as RSA (Rivest, Shamir and Adleman, 1978), ElGamal (ElGamal, 1985), etc) or addition (such as Goldwasser-Micali (Goldwasser and Micali, 1982), Paillier (Paillier, 1999), etc) were found. However, in many of these the number of times one could add or multiply was limited and a scheme supporting both operations simultaneously was elusive (Boneh et al. (2005) came closest, allowing unlimited additions and a single multiplication). It was not until 2009 that the three decade old problem was solved in seminal work by Gentry (2009), where he showed addition, multiplication and control of the noise growth were all possible. This sparked a cascade of work on fully homomorphic schemes: that is, those where a theoretically unlimited number of addition and multiplication operations are possible. This modern era of homomorphic encryption is briefly summarised in Appendix A.
The advent of a scheme capable of evaluating both addition and multiplication a (theoretically) arbitrary number of times led to a surge of optimism, since then any polynomial can be computed and so the output of any suitably smooth function could in principal be arbitrarily closely approximated. Moreover, if then addition corresponds to logical XOR, and multiplication corresponds to logical AND, which is sufficient to construct arbitrary binary circuits so that, in principle, anything which can be evaluated by a computer can be represented by an algorithm which will run on homomorphically encrypted data. However, caution is needed here regarding practicality: performing just a 32-bit integer addition using a simple ripple-carry adder design involves 32 full adders, each requiring 3 XORs, 2 ANDs and an OR ( 2 XOR & 1 AND) — 256 fundamental operations just to add two integers, an avenue it will become clear is impractical with current homomorphic schemes.
2.3 The scheme of Fan and Vercauteren (2012)
To make these ideas more concrete the particular scheme of Fan and Vercauteren (2012) (hereinafter FandV) will now be described. A high performance, easy to use implementation of the same is a contribution of this technical report as discussed in Section 4.
FandV is a fully homomorphic scheme where the message space accommodates representation of large subsets of (not just binary messages), and a cipher text is a pair of large polynomials. Its security is based on the hardness of the ring Learning With Errors (LWE) problem (Lyubashevsky et al., 2010) which is connected to classical cryptography hardness results (such theory would be a diversion: for a short description see Appendix B).
To simplify the presentation for a statistics audience, some minor simplifying restrictions are made to the original scheme as will be explained. The reader may safely skip to Section 2.4 if the following mathematical details of this example encryption scheme are not of interest.
is the set of integers and denotes the unique integer in which is equal to . and denote polynomials whose coefficients belong to and respectively. Then, for a fixed value , the primary objects of interest in the scheme are the polynomial rings and , where is the -th cyclotomic polynomial111In simple terms, , the -th cyclotomic polynomial is the polynomial which: divides ; does not divide for any ; has integer coefficients; and cannot be factorised. For example, because , but it does not divide or , it has integer coefficients and it cannot be factorised.. The restriction to -th cyclotomic polynomials here is for the convenience of their form, the computational efficiencies of reducing a polynomial modulo this form, and for the simplicity of generating random polynomials modulo this form which satisfy ring LWE hardness results (although theoretically FandV can be modulo any monic irreducible polynomial).
To distinguish polynomials, they will be underscored if not written in functional form, . Polynomial multiplication will be emphasised, and all such multiplication takes place within the ring . indicates the centred reduction above applied to each coefficient of individually, so that .
The randomness to be introduced for semantic security comes via the bounded discrete Gaussian distribution, defined to be the probability mass function proportional toover the integers from to , where typically . For the special choice of polynomial modulo above, the corresponding multivariate distribution denoted on then involves simply generating each coefficient of from a bounded discrete Gaussian distribution. This simple sampling procedure arises due to the modulo , which ensures that the coefficients are all independent after modular reduction. Reducing modulo an arbitrary monic irreducible polynomial can introduce dependencies between coefficients which ceases to satisfy the assumptions underlying the hardness results of ring-LWE (Lyubashevsky et al., 2010), leading to more complex sampling procedures.
If is a uniform random draw from this is denoted , or correspondingly if is a draw from the multivariate bounded discrete Gaussian draw induced on , , this is denoted .
2.3.2 The encryption scheme
The message space of this scheme is the polynomial ring . Thus any integer message must be converted to a polynomial representation . In principle, if is small enough that , then the degree zero polynomial is sufficient. However, there are reasons which will become apparent that this is undesirable even when is small enough (or is large enough).
A better approach is to take an integer to be encrypted, write it in standard -bit binary representation, , and then simply construct where . Recovery of the original message after decryption is then simply evaluation of , because homomorphic addition and multiplication operations will correspond to operations on the polynomials preserving the end result. This representation is assumed here and is used automatically in the software contribution of Section 4.
The cipher text space is the Cartesian product of two polynomial rings , where . As will be seen, the message polynomial is essentially embedded in the most significant bits of the first polynomial in , with the random noise growing from the least significant bits. Once the noise grows under repeated operations and reaches the most significant bits the message is lost.
The parameters of the scheme are: , determining the degree of both the polynomial rings and ; and , determining the coefficient sets of the polynomial rings and ; and , determining the magnitude of the randomness used for semantic security.
An example of values which ensure good security would be ( degree polynomials), , , (Fan and Vercauteren, 2012). The software contribution of Section 4 provides functions to help select these parameters automatically based on lower bounds of security and computability they provide.
Key Generation: The secret key, , is simply a uniform random draw from (i.e. sample a
binary vector for the polynomial coefficients).
The public key, , is a vector containing two polynomials:
where and . Note is hard to extract from precisely due to ring LWE hardness results (see Appendix B).
Encryption, : An integer message is first represented as as described above. Encryption then renders a cipher text which is a vector containing two polynomials:
where and .
Decryption, : Decryption of a cipher text is by evaluating:
so that .
Addition, : Addition in message space is achieved in cipher text space by standard vector and polynomial addition with modulo reduction:
It is an easy and enlightening exercise to verify by hand that renders .
Multiplication, : Multiplication in message space produces a more complex operation in cipher text space which increases the length of the cipher text vector:
Although it is still possible to recover from one of these larger cipher texts by modifying the decryption function to be , it is preferable to perform a ‘relinearisation’ procedure which compacts the cipher text to a vector of two polynomials again and reverts to the original decryption procedure. Thus in practice multiplication is a two step procedure: cipher text multiplication followed by relinearisation. Description of relinearisation is beyond the scope of this review, but full details are in Fan and Vercauteren (2012) and it is seamlessly implemented in the software contribution described in Section 4.
2.3.3 A practical note
Above, a binary polynomial representation of integers was proposed as being preferable to a scalar (zero degree polynomial) representation (i.e. a natural number), even when the message is small enough that , the reason for which should now be clearer.
Consider the addition operation with the example parameters given above, recall that each coefficient of must lie in the range to after computation in order to decrypt correctly, and note that the addition operation results in direct addition of coefficients in the polynomial representations. Now, bearing these points in mind, if then addition will only render the correct answer so long as the overall final result also remains in the range to . However, with a binary representation the largest coefficient of any term in will be , so that at least additions (possibly more) can be performed and still guaranteed to decrypt correctly, furthermore allowing the final result, , to be much larger than . Not only is this more additions, but more importantly the binary representation allows a general hard bound for how many additions can be performed while still guaranteeing the correct value is decrypted, without knowledge of the messages.
2.4 Some limitations
At this juncture it is important to temper any building excitement. Although Gentry (2009) theoretically provided an exemplar for how fully homomorphic schemes could be constructed, the extraordinary theoretical possibilities are constrained by practical limitations. These crucial limitations mean that it is not simply a matter of taking any algorithm and converting it to run on encrypted data, so that many statistical algorithms are in fact beyond the computational reach of existing homomorphic schemes.
The limitations discussed now are in general common to all current homomorphic schemes to a varying degree, though specific homomorphic encryption algorithms may have their own additional constraints. In each case, the limitation will be highlighted in the context of the scheme described in Section 2.3.
2.4.1 Message space
There are currently no schemes which will directly encrypt arbitrary values in . Indeed, the most common message space is simply binary, , with this being of particular appeal to theoretical cryptographers because it corresponds to construction of arbitrary Boolean circuits and allows all the results in computational complexity theory to be applied to determine computability. However, from a practical standpoint this is not presently a very feasible avenue.
However, there are schemes which have an expanded message space, such as , or for some integer . These schemes generally correspond to integer rings or fields (for prime ) where ordinary rules of arithmetic can be assumed when results are bounded by . In many schemes which support expanded message spaces, increasing will impact the capabilities of the scheme (decreasing security, computation speed, computational depth or all these).
A method which can be used to increase the size of the message space is via the Chinese Remainder Theorem as a means of representing a large integer.
Chinese Remainder Theorem (Knuth, 1997, p.270) Let be pairwise coprime positive integers. Let and let . Then there is exactly one integer that satisfies the conditions:
Thus, an integer message can be uniquely represented by the collection of smaller integers , called the residues. More formally, . So, if each is chosen small enough that the scheme can encrypt it, then much larger message spaces can be achieved by encrypting the collection of residues. The process is reversible so that the value can be recovered given (Knuth, 1997, p.274). Such a representation is called a residue number system (Garner, 1959) and has the additional advantage that addition and multiplication operations (the only ones which can be performed homomorphically anyway) are embarrassingly parallel: performing the same operation according to the modular arithmetic of each residue will result in a residue representation of the corresponding result of operating on the large integers.
Related and more common in the homomorphic encryption literature, is the reverse usage of the polynomial version of the Chinese Remainder Theorem, which enables combining multiple messages into a single polynomial representation (that is, now holds multiple plain text messages before encryption), so that operations on the single cipher text performs simultaneous operations on all the messages simultaneously in a manner akin to Single Instruction Multiple Data (SIMD) instructions on a CPU (Smart and Vercauteren, 2014). This of course reduces rather than increases the possible range of individual messages which can be encrypted.
Even if using the Chinese remainder theorem to represent larger values, the issue remains of how to handle statistical data, which is commonly not binary or integer. There are at least two approaches: the first is common throughout the literature, whereby any real value is approximated by some rational number, with numerator and denominator encrypted separately and propagated through using the usual rules of arithmetic for fractions. The second is a logarithmic representation developed by Franz et al. (2010), in which division is possible but where addition and subtraction become substantially more complex to implement.
The FandV scheme has an unusual message space, being a polynomial ring. For the example parameter values given above, this means that when using the binary representation of integer values, the integers can in principle be very large (over ). As such, the limitation in message space size may seem less acute than in other homomorphic schemes (especially binary ones), but the practical issue raised in §2.3.3 means that it may still be advantageous to use a residue number system representation if there will be a lot of addition.
In the follow on to this review (Aslett, Esperança and Holmes, 2015)
, two other approaches are proposed: one where data is effectively quantile binned in a binary indicator fashion, which is shown to effectively enable simple comparison operations; and another discretisation of real values which is appropriate for linear modelling.
2.4.2 Cipher text size
Once the value to be encrypted has been appropriately represented such that only elements of need to be encrypted, there is the additional issue of a substantial inflation in the size of the message after encryption, often by several orders of magnitude.
As a concrete example, the usual representation of an integer in a computer requires 4 bytes of memory. If such a message is encrypted under the scheme presented in Section 2.3, then using the example parameters will result in cipher texts occupying bytes (4096 coefficients, each a 128-bit integer). Consequently, a 1MB data set will occupy nearly 16.4GB encrypted.
One mitigating proposal (Naehrig et al., 2011) is to initially encrypt values using a non-homomorphic, size efficient encryption algorithm such as AES, and to encrypt the AES decryption key with a homomorphic scheme. The decryption circuit for AES can then be executed homomorphically, rendering a homomorphic encryption of the original message. This would mean that communication and long term storage of encrypted values could be space efficient, with expanded homomorphic cipher texts generated by effectively ‘recrypting’ from this compact format when computation is required. AES is an industry standard, but required 36 hours to execute homomorphically (Gentry et al., 2012) (for 56 AES blocks, corresponding to 896 bytes of data), although a more recent lightweight cipher named SIMON can be recrypted homomorphically in around 12 minutes (Lepoint and Naehrig, 2014). However, these approaches operated on binary messages, so the resulting recryption is to a binary scheme with the attendant issues already discussed.
2.4.3 Computational cost
Elements of cipher text space are not only larger in memory (with an associated additional computational cost to process), but will typically also be more complex spaces. For example, in Section 2.3 the cipher text space is the ring of polynomials modulo a cyclotomic polynomial, with coefficients from a large integer ring (e.g. 128-bit integers). Consequently, arithmetic operations are substantially more costly than standard arithmetic: there is large polynomial arithmetic involving coefficients which are too large to fit in standard 32-bit or 64-bit integers, with the additional overhead of modulo operations on both the coefficients and polynomial.
Most current schemes can achieve reasonable speeds for additions, but are very constrained in speed of multiplications. The optimised scheme implemented in the R package HomomorphicEncryption (Aslett, 2014) achieves thousands of additions per second, and about 50 multiplications per second. This is mitigated as far as possible by transparently implementing full CPU parallelism.
If all the operations involved can be performed in a single instruction multiple data (SIMD) fashion then the polynomial Chinese remainder theorem alluded to above can be used when representing the messages as a polynomial prior to encryption. In this way a single cipher text operation actually operates in a SIMD manner on many messages for the same computational cost (Smart and Vercauteren, 2014). Naturally, there is a limit to how many messages can be packed into a single cipher text in this way.
2.4.4 Division and comparison operators
At present there are no homomorphic schemes capable of natively supporting division operations, only addition and multiplication. An additional serious constraint is the inability to have any conditional code flow: comparison operators such as tests of equality and inequality cannot be performed on the encrypted data. Consequently, many algorithms appear out of reach without substantial redevelopment.
2.4.5 Depth of operations
The final limitation relates to the number of operations which can be applied. As explained in the discussion on semantic security, there is randomness injected into the cipher text in these encryption schemes. When operations are performed, the noise tends to accumulate (exactly how being scheme dependent): for example, in many schemes multiplication operations result in direct multiplication of the noise components leading in the naïve case to potentially exponential increases in the magnitude of the noise over many operations. Once the noise exceeds a certain threshold then decryption will render the incorrect message.
It is important to be clear that it is not usually the total number of multiplications which is limited, but rather the depth (i.e. the maximum degree of the evaluated polynomial). For example, has multiplicative depth 2, whereas has multiplicative depth 1 . Exactly what depth a scheme can achieve will depend on the scheme itself and usually on the parameters chosen, which commonly involves a tradeoff of speed, security or memory requirements against depth of operations.
In principle, one of the breakthrough aspects of Gentry’s (2009) work was the ability to bootstrap (entirely unrelated to the statistics term) a cipher text: an operation which resets the noise to that of a freshly encrypted message. However, most bootstrapping routines are very complex to implement, extremely slow to execute, or both. As a result, it is almost universal in the applied cryptography literature to set the parameters of the scheme under consideration to be such that the necessary depth of operations can be performed without a bootstrapping step being required. The software contribution of Section 4 provide functions to help automatically select the parameters based on lower bounds in the literature for the depth of multiplications required.
To date the small number of applied cryptography papers have largely taken existing statistical techniques which can be made to directly fit within these constraints and demonstrated any minor refactoring of the algorithms that is necessary, but leave them fundamentally unaltered (some examples are reviewed in Section 3). However, statisticians and machine learners are well placed to develop principled approximations to current statistical and machine learning techniques, or entirely new techniques, where the constraints of homomorphic encryption are considered at all stages of model and algorithm development, and where uncertainties and errors introduced can be studied. Some initial contributions in this direction are presented in Aslett, Esperança and Holmes (2015).
2.5 Usage scenarios
The most obvious usage scenario is to outsource long-term storage and computation of sensitive data to a third party cloud provider. Here the ‘client’ (the owner of the data) encrypts everything prior to uploading to the ‘server’ (at the cloud provider’s data centre). Due to some of the limitations discussed above, this scenario is perhaps currently only suitable in a restricted set of situations where the added computational costs and inflated data size are not prohibitive. With homomorphic schemes improving all the time the boundary where this is a practical usage scenario will shift over time.
However, with the explosion of extremely compute, memory and battery constrained devices such as smart watches and glasses it may be that scenarios where additional server side memory and compute costs are a worthwhile trade-off are substantially broader. This is especially true given the biomedical focus of many of these recent devices which collect a lot of sensitive health data: collection of this on constrained client devices and handoff to a cryptographically secure server storage area which is capable of encrypted statistical analysis is an attractive proposition for both users and manufacturers.
An additional scenario is one in which it is desirable to be able to perform statistical analyses without the data being visible to anyone at all. To be concrete, consider a research institute requiring patient data for analysis: the research institute could widely distribute their public key to enable patients to securely donate their sensitive personal data. This data would be encrypted and sent directly to the cloud provider who would have a contractual obligation to only allow the research institute access to the results of pre-approved functions run on that data, not to the raw encrypted data itself. Peer review would be important for pre-approving certain functions to be homomorphically executed to ensure that the original data is not indirectly leaked. An interesting effect here may be increased statistical power (despite homomorphic approximations) due to the greater sample sizes which could result from increased participation because of the privacy guarantees.
There is at least one further usage scenario: that is, where there is confidential data on which a confidential algorithm must be run. In this situation, a client may encrypt their data to give to the developer of the algorithm and receive the results of the algorithm without either party compromising data or algorithm. In this situation, the constraints of homomorphic encryption are merely an opportunity cost because there may be no other way to achieve the same goal.
3 Current Methods
There are two aspects which, from the perspective of a statistician, are important to review: prior work on encrypted statistics algorithms and existing software implementations for making use of homomorphic encryption schemes.
In this section, both aspects are surveyed before the software tools documented in this paper are covered in Section 4.
3.1 Encrypted statistics
In the recent years, some work has emerged on statistical methods for homomorphically encrypted data.
Graepel et al. (2012)
proposed algorithms for binary classification, namely secure versions of the Linear Means and Fisher’s Linear Discriminant classifiers. The algorithms are rewritten in such a way that divisions are avoided but the original score function (needed for classification) is computable up to a constant. Because some operations have no counterpart in the encryption framework (like division and comparison), some of the computation is done offline by the client after decrypting results returned by the cloud. For instance, in binary classificationwith Linear Means, the class label is computed in this way as the sign of a score function. To represent real numbers as integers, the authors propose a rescaling approach which approximates real numbers with rational numbers (integer numerator and denominator) and then clears denominators by multiplying all numbers by an appropriate factor and rounding the result to the nearest integer. Approximation accuracy can be controlled in this way.
, namely the computation of mean and covariance in a multivariate scenario, using the same technique of returning separate encrypted numerators and denominators. Additionally, they also mention the possibility of implementing (and indeed implement) low-dimension linear regression () by using Cramer’s rule to invert the matrix . Because Cramer’s rule also involves a division by the determinant of , the computation can not be completely performed homomorphically and must be finished offline by the client who assembles the division factors post-decryption. Apart from the computational issues caused by division, there are additional problems here, the most important being the complexity of Cramer’s rule: for a problem with dimension , the computation of the determinant has multiplicative depth and requires multiplications. Allied to this comes the computation of the adjoint matrix, having similarly substantial computational complexity. The restriction is two-fold: firstly, in the multiplicative depth of operations; and secondly, in the computational costs of these operations. Whereas the second restriction implies possible intractability of high-dimensional linear regression, the first restriction affects correctness of decryption and so should be regarded as more serious.
Lauter et al. (2014) observed that it is possible to analyse genomic data in a privacy-preserving framework and provide some examples of algorithms in statistical genetics which are implementable under the restrictions of homomorphic encryption, including the Cochran–Armitage trend test, the expectation–maximisation algorithm and measures of goodness-of-fit and linkage disequilibrium. The main issue in implementing these methods under the homomorphic encryption framework is that divisions are not possible. The solution proposed is to write the statistics in terms of the two factors involved in a division (dividend/numerator and divisor/denominator), compute these homomorphically and send them back to the client, who decrypts each factor and performs the division offline. For complex problems where divisions can not be grouped (by combining dividends and divisors), there will be a higher number of cipher texts being passed to the client, which increases communication costs and, more importantly, may compromise privacy since more information is contained in less processed cipher texts.
Another class of privacy-preserving statistical methods has been proposed for predictive purposes: an algorithm is trained offline (say, a regression model) and the corresponding predictive model (the parameters in the regression model, ) encrypted. For prediction tasks, covariates are encrypted and sent to the server, where computations take place (e.g., the computation of the regression model predictor,
) and are then returned to the client for decryption (and potentially further transformation, as would be the case for generalised linear models). Examples of these include logistic regression(Bos et al., 2013)2011)
Crucially, in all these current methods, existing algorithms are simply refactored to run homomorphically rather than developing novel approaches to approximate otherwise currently intractable statistical techniques.
As will be clear from Section 2.3, many homomorphic schemes can be non-trivial to implement. Some public implementations are releases of software which was written for a specific paper, whilst there are a small number of libraries or packages enabling reuse. Most libraries or packages commonly interfaces in low-level languages such as C/C++. A very compact single C file library implementing Gentry (2010) is ‘libfhe’ (Minar, 2010). This implementation is based on a binary scheme, but has routines to allow encryption of integers by base-2 decomposing, encrypting each binary digit separately and then implementing binary adder arithmetic (so that even addition will involve cipher text multiplications). There is no bootstrapping implementation and at time of writing there have been no apparent updates since 2010.
‘Scarab’ (Perl et al., 2011) is another low-level C library, implementing instead another integer cipher text space scheme by Smart and Vercauteren (2010). This implementation allows only encryption of a binary message, although as well as providing addition (XOR) and multiplication (AND), there are full and half adders provided offering carry in and carry out or just carry out, respectively. A bootstrapping routine is also provided. There have not been additional updates in some time.
Another low level implementation, ‘HELib’ (Halevi and Shoup, 2014b), provides a C++ library implementing Brakerski et al. (2012), one of the early second generation of schemes (see Appendix A). It incorporates some very useful optimisations, including the work of Smart and Vercauteren (2014), which enables single-instruction multiple-data (SIMD) parallelism by packing multiple values in a single cipher text. This is under active development at the time of writing and appears the most comprehensive implementation of a modern scheme currently available. Details of the algorithms used are available in preprint (Halevi and Shoup, 2014a).
Finally, there was a recent comparison of two schemes, Fan and Vercauteren (2012) and Bos et al. (2013), in Lepoint and Naehrig (2014) which provided the C++ software used (Lepoint, 2014). Although not in the explicit form of a library it could be possible to transform this into a C++ library for the two schemes.
4 HomomorphicEncryption R package
For statistics researchers to be able to use homomorphic encryption techniques, an easy to use yet high performance library in a high level language which is popular in the community is necessary. An R language (R Core Team, 2014) package providing such an implementation is a contribution of our work.
The HomomorphicEncryption R package (Aslett, 2014)
provides an easy to use interface to begin developing and testing statistical methods in a homomorphic environment. The package has been developed to be extensible, so that as new schemes are researched by cryptographers they can be made available for use by statistics researchers with minimal additional effort. The package has a small number of generic functions for which different cryptographic backends can be used. The underlying implementation is mostly in high performance C and C++(Eddelbuettel et al., 2011), with many of the operations setup to utilise multi-core parallelism via multithreading (Allaire et al., 2014) without requiring any end-user intervention.
The first generic cryptographic function is pars. The first argument to this function designates which cryptographic backend to use and allows the user to override any of the default parameters of that scheme (for example, and of Section 2.3). Related to this, there is the alternative method of specifying parameters via the function parsHelp. This allows users to instead specify a desired minimal security level in bits and a minimal depth of multiplications required, and then computes values for and which will satisfy these requirements with high probability, by automatically optimising established bounds from the literature (Lepoint and Naehrig, 2014; Lindner and Peikert, 2011)
The second generic cryptographic function is keygen, whose sole argument is a parameter object as returned by pars or parsHelp. keygen then generates a list containing public ($pk) and private ($sk) keys, along with any scheme dependent keys (such as relinearisation keys in the case of Fan and Vercauteren (2012)), which correspond to the homomorphic scheme designated by the parameter object. At this point, the parameter object is absorbed into the keys so that it doesn’t need to be used for any other functions.
The third generic cryptographic function is enc. This requires simply the public key (as returned in the $pk list element from keygen) and the integer message to be encrypted. It then returns a cipher text encrypted under the scheme to which the public key corresponds. Crucially, the ease of use begins to become very apparent here, with enc overloaded to enable encryption of not just individual integers, but also vectors and matrices of integers defined in R. The structure of the vectors and matrices are preserved and the encryption process is fully multithreaded across all available CPU cores automatically.
The final generic cryptographic function is dec. Similarly, this requires simply the private key, as returned in the $sk list element from keygen, and the (scalar/vector/matrix) cipher text to be decrypted. It then returns the original message. Note that the structure of vector or matrix cipher texts is correctly preserved throughout.
The real simplicity becomes evident when manipulating the cipher texts. All the standard arithmetic functions (+, -, *) work as expected, implementing for example the cyclotomic polynomial ring algebra of the FandV scheme transparently. Moreover, vectors can be formed in the usual R manner using c (or extracted from the diagonal of matrix cipher texts with diag), element wise arithmetic can be performed on those vectors (with automatic multithreaded parallelism) and there is support for all the standard vector functions, such as length, sum, prod and %*% for inner products, just as one would conventionally use with unencrypted vectors in R. Indeed, such functionality extends to matrices, with formation of diagonal matrices via diag from cipher text vectors, element wise arithmetic and full matrix multiplication using the usual %*% R operator (again, automatically fully parallelised). Matrices also support the usual matrix functions (dim, length, t, etc). The package automatically dispatches these operations to the correct backend cryptographic routines to perform the corresponding cipher text space operations transparently, returning cipher text result objects which can be used in further operations or decrypted.
The following is the simplest possible instructive example. Examining the contents of k, c1, etc will show the encryption detail:
 library(HomomorphicEncryption) p ¡- pars(”FandV”) k ¡- keygen(p) c1 ¡- enc(kc(42, 34)) c2 ¡- enc(kc(7, 5)) cres1 ¡- c1 + c2 cres2 ¡- c1 * c2 cres1 dec(k dec(k
Note that indexing into vectors and matrices as provided by R via the usual  notation is fully supported, including assignment.
We hope this provides a distinctly easy-to-use software implementation in arguably the most popular high level language in use among data scientists today, including automatic help for encryption scheme parameter selection to aid non-cryptographers. Moreover, given the computational burden of homomorphic schemes, the transparent multithreaded parallelism automatically across all CPU cores in all available scenarios (encryption, decryption and arithmetic with vectors/matrices) enables focus to be on the subject matter questions.
At present, the scheme of Fan and Vercauteren (2012) (described in Section 2.3) has been implemented, making use of FLINT (Hart, 2010) for certain polynomial operations and GMP (Granlund and the GMP development team, 2012) for high performance arbitrary precision arithmetic. Backends for further homomorphic encryption schemes may be added in the future.
|scalar operations||vector operations||matrix operations|
This technical report has provided a review of homomorphic encryption with a focus on issues which are pertinent to statisticians and machine learners. It also introduces the HomomorphicEncryption R package and demonstrates the ease of getting started experimenting with homomorphic encryption.
The practical limitations of homomorphic encryption schemes means that existing techniques cannot always be directly translated into a corresponding secure algorithm. This presents an opportunity for the statistics and machine learning community to engage with research in privacy preserving methods by developing new methods which are tailored to homomorphic computation and which work within the constraints described in Section 2.4, with the sister paper to this review (Aslett, Esperança and Holmes, 2015) being an initial contribution in this direction.
The authors would like to thank the EPSRC and LSI-DTC for support. Louis Aslett and Chris Holmes were supported by the i-like project (EPSRC grant reference number EP/K014463/1). Pedro Esperança was supported by the Life Sciences Interface Doctoral Training Centre doctoral studentship (EPSRC grant reference number EP/F500394/1).
Allaire et al. (2014)
Allaire, J. J., François, R., Intel Inc. and Geelnard, M.
(2014), RcppParallel: Parallel
programming tools for Rcpp.
R package version 4.3.3.
Aslett, L. J. M. (2014), HomomorphicEncryption: Fully Homomorphic Encryption.
R package version 0.2.
- Aslett et al. (2015) Aslett, L. J. M., Esperança, P. M. and Holmes, C. C. (2015), Encrypted statistical machine learning: new privacy preserving methods, Technical report, University of Oxford.
- Boneh et al. (2005) Boneh, D., Goh, E. and Nissim, K. (2005), Evaluating 2-dnf formulas on ciphertexts, in ‘Theory of cryptography’, Springer, pp. 325–341.
- Bos et al. (2013) Bos, J. W., Lauter, K., Loftus, J. and Naehrig, M. (2013), Improved security for a ring-based fully homomorphic encryption scheme, in ‘Cryptography and Coding’, Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp. 45–64.
- Brakerski (2012) Brakerski, Z. (2012), Fully homomorphic encryption without modulus switching from classical gapsvp, in ‘Advances in Cryptology–CRYPTO 2012’, Springer, pp. 868–886.
- Brakerski et al. (2012) Brakerski, Z., Gentry, C. and Vaikuntanathan, V. (2012), (leveled) fully homomorphic encryption without bootstrapping, in ‘Proceedings of the 3rd Innovations in Theoretical Computer Science Conference’, ACM, pp. 309–325.
- Brakerski and Vaikuntanathan (2011a) Brakerski, Z. and Vaikuntanathan, V. (2011a), Efficient fully homomorphic encryption from (standard) lwe, in ‘2011 IEEE 52nd Annual Symposium on Foundations of Computer Science’, IEEE, pp. 97–106.
- Brakerski and Vaikuntanathan (2011b) Brakerski, Z. and Vaikuntanathan, V. (2011b), Fully homomorphic encryption from ring-lwe and security for key dependent messages, in ‘Advances in Cryptology–CRYPTO 2011’, Springer, pp. 505–524.
- Brakerski and Vaikuntanathan (2014) Brakerski, Z. and Vaikuntanathan, V. (2014), Lattice-based fhe as secure as pke, in ‘Proceedings of the 5th conference on Innovations in theoretical computer science’, ACM, pp. 1–12.
- Daemen and Rijmen (2002) Daemen, J. and Rijmen, V. (2002), The design of Rijndael: AES-the advanced encryption standard, Springer.
- Eddelbuettel et al. (2011) Eddelbuettel, D., François, R., Allaire, J. J., Chambers, J., Bates, D. and Ushey, K. (2011), ‘Rcpp: Seamless R and C++ integration’, Journal of Statistical Software 40(8), 1–18.
- ElGamal (1985) ElGamal, T. (1985), A public key cryptosystem and a signature scheme based on discrete logarithms, in ‘Advances in Cryptology’, Springer, pp. 10–18.
- Fan and Vercauteren (2012) Fan, J. and Vercauteren, F. (2012), ‘Somewhat practical fully homomorphic encryption’, IACR Cryptology ePrint Archive .
- Franz et al. (2010) Franz, M., Deiseroth, B., Hamacher, K., Jha, S., Katzenbeisser, S. and Schröder, H. (2010), Secure computations on non-integer values, in ‘Information Forensics and Security (WIFS), 2010 IEEE International Workshop on’, IEEE, pp. 1–6.
- Garner (1959) Garner, H. L. (1959), ‘The residue number system’, IEEE Transactions on Electronic Computers EC-8(2), 140–147.
Gentry, C. (2009), A fully homomorphic
encryption scheme, PhD thesis, Stanford University.
- Gentry (2010) Gentry, C. (2010), ‘Computing arbitrary functions of encrypted data’, Communications of the ACM 53(3), 97–105.
- Gentry and Halevi (2011) Gentry, C. and Halevi, S. (2011), Implementing gentry’s fully-homomorphic encryption scheme, in ‘Advances in Cryptology–EUROCRYPT 2011’, Springer, pp. 129–148.
- Gentry et al. (2012) Gentry, C., Halevi, S. and Smart, N. P. (2012), Homomorphic evaluation of the AES circuit, in ‘Advances in Cryptology–CRYPTO 2012’, Springer, pp. 850–867.
- Gentry et al. (2013) Gentry, C., Sahai, A. and Waters, B. (2013), Homomorphic encryption from learning with errors: Conceptually-simpler, asymptotically-faster, attribute-based, in ‘Advances in Cryptology–CRYPTO 2013’, Springer, pp. 75–92.
Goldwasser and Micali (1982)
Goldwasser, S. and Micali, S. (1982), Probabilistic encryption & how to play mental poker keeping secret all
partial information, in
‘Proceedings of the fourteenth annual ACM symposium on Theory of computing’, ACM, pp. 365–377.
- Graepel et al. (2012) Graepel, T., Lauter, K. and Naehrig, M. (2012), ML Confidential: Machine learning on encrypted data, in T. Kwon, M.-K. Lee and D. Kwon, eds, ‘Information Security and Cryptology (ICISC 2012)’, Vol. 7839 of Lecture Notes in Computer Science, Springer, pp. 1–21.
Granlund and the GMP development team (2012)
Granlund, T. and the GMP development team (2012), GNU MP: The GNU Multiple Precision
- Halevi and Shoup (2014a) Halevi, S. and Shoup, V. (2014a), ‘Algorithms in HElib’, IACR Cryptology ePrint Archive .
- Halevi and Shoup (2014b) Halevi, S. and Shoup, V. (2014b), ‘Helib’, https://github.com/shaih/HElib.
- Hart (2010) Hart, W. B. (2010), Fast library for number theory: An introduction, in ‘Proceedings of the Third International Congress on Mathematical Software’, ICMS’10, Springer-Verlag, Berlin, Heidelberg, pp. 88–91. http://flintlib.org.
- Kaufman et al. (2009) Kaufman, D. J., Murphy-Bollinger, J., Scott, J. and Hudson, K. L. (2009), ‘Public opinion about the importance of privacy in biobank research’, American Journal of Human Genetics 85(5), 643–654.
- Knuth (1997) Knuth, D. E. (1997), The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd edn, Addison-Wesley.
- Lauter et al. (2014) Lauter, K., López-Alt, A. and Naehrig, M. (2014), ‘Private computation on encrypted genomic data’, Microsoft Research, technical report MSR-TR-2014-93.
- Lauter et al. (2011) Lauter, K., Naehrig, M. and Vaikuntanathan, V. (2011), Can homomorphic encryption be practical?, in ‘Proceedings of the 3rd ACM workshop on Cloud computing security workshop’, ACM, pp. 113–124.
- Lepoint (2014) Lepoint, T. (2014), ‘A proof-of-concept implementation of the homomorphic evaluation of SIMON using FV and YASHE leveled homomorphic cryptosystems’, https://github.com/tlepoint/homomorphic-simon.
- Lepoint and Naehrig (2014) Lepoint, T. and Naehrig, M. (2014), A comparison of the homomorphic encryption schemes FV and YASHE, in ‘Progress in Cryptology–AFRICACRYPT 2014’, Springer, pp. 318–335.
- Lindner and Peikert (2011) Lindner, R. and Peikert, C. (2011), Better key sizes (and attacks) for LWE-based encryption, in ‘Topics in Cryptology–CT-RSA 2011’, Springer, pp. 319–339.
- Lyubashevsky et al. (2010) Lyubashevsky, V., Peikert, C. and Regev, O. (2010), On ideal lattices and learning with errors over rings, in ‘Proceedings of the 29th Annual international conference on Theory and Applications of Cryptographic Techniques’, Springer-Verlag.
- Minar (2010) Minar, J. (2010), ‘libfhe’, https://github.com/rdancer/fhe/tree/master/libfhe.
- Naehrig et al. (2011) Naehrig, M., Lauter, K. and Vaikuntanathan, V. (2011), Can homomorphic encryption be practical?, in ‘Proceedings of the 3rd ACM workshop on Cloud computing security workshop’, ACM, pp. 113–124.
- Paillier (1999) Paillier, P. (1999), Public-key cryptosystems based on composite degree residuosity classes, in ‘Advances in Cryptology - EUROCRYPT’99’, Springer, pp. 223–238.
- Pathak et al. (2011) Pathak, M., Rane, S., Sun, W. and Raj, B. (2011), Privacy preserving probabilistic inference with hidden Markov models, in ‘Procedings of the IEEE, ICASSP 2011’, pp. 5868–5871.
- Perl et al. (2011) Perl, H., Brenner, M. and Smith, M. (2011), ‘Scarab library’, https://hcrypt.com/scarab-library/.
R Core Team (2014)
R Core Team (2014), R: A Language and
Environment for Statistical Computing, R Foundation for Statistical
Computing, Vienna, Austria.
- Regev (2009) Regev, O. (2009), ‘On lattices, learning with errors, random linear codes, and cryptography’, Journal of the ACM (JACM) 56(6), 34.
- Rivest, Adleman and Dertouzos (1978) Rivest, R. L., Adleman, L. and Dertouzos, M. L. (1978), ‘On data banks and privacy homomorphisms’, Foundations of Secure Computation 4(11), 169–180.
- Rivest, Shamir and Adleman (1978) Rivest, R. L., Shamir, A. and Adleman, L. (1978), ‘A method for obtaining digital signatures and public-key cryptosystems’, Communications of the ACM 21(2), 120–126.
- Sen (2013) Sen, J. (2013), Homomorphic encryption: Theory & application, in J. Sen, ed., ‘Theory and Practice of Cryptography and Network Security Protocols and Technologies’, InTech.
- Silverberg (2013) Silverberg, A. (2013), ‘Fully homomorphic encryption for mathematicians’, Women in Numbers 2: Research Directions in Number Theory 606, 111.
- Smart and Vercauteren (2010) Smart, N. P. and Vercauteren, F. (2010), Fully homomorphic encryption with relatively small key and ciphertext sizes, in ‘Public Key Cryptography–PKC 2010’, Springer, pp. 420–443.
- Smart and Vercauteren (2014) Smart, N. P. and Vercauteren, F. (2014), ‘Fully homomorphic SIMD operations’, Designs, codes and cryptography 71(1), 57–81.
- Stehlé and Steinfeld (2010) Stehlé, D. and Steinfeld, R. (2010), Faster fully homomorphic encryption, in ‘Advances in Cryptology-ASIACRYPT 2010’, Springer, pp. 377–394.
- Vaikuntanathan (2011) Vaikuntanathan, V. (2011), Computing blindfolded: New developments in fully homomorphic encryption, in ‘Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on’, IEEE, pp. 5–16.
- van Dijk et al. (2010) van Dijk, M., Gentry, C., Halevi, S. and Vaikuntanathan, V. (2010), Fully homomorphic encryption over the integers, in ‘Advances in Cryptology–EUROCRYPT 2010’, Springer, pp. 24–43.
- Wu and Haven (2012) Wu, D. and Haven, J. (2012), ‘Using homomorphic encryption for large scale statistical analysis’.
Appendix A Modern homomorphic schemes
The groundbreaking work by Gentry (2009) set the stage for the modern era of homomorphic schemes where both addition and multiplication to a (theoretically) arbitrary depth are possible. In a nut shell, Gentry constructed a scheme based on ideal lattices over a polynomial ring which could perform sufficient homomorphic operations to evaluate a so-called ‘squashed’ version of its own decryption algorithm: thus, given an encrypted version of a hint about the secret key, evaluating the decryption homomorphically results in a ‘fresh’ cipher text where the noise level is reset.
This quickly spawned many other schemes which invoked these techniques. Two conceptually much simpler schemes using the technique and based on large integer cipher texts were developed in van Dijk et al. (2010) and Smart and Vercauteren (2010). Stehlé and Steinfeld (2010) directly improved on Gentry (2009) making evaluation of operations less complex. Brakerski and Vaikuntanathan (2011b) used the Gentry approach removing some untested security assumptions which had been made. These works were in a sense the ‘first generation’ of modern schemes.
Brakerski and Vaikuntanathan (2011a) triggered a second generation of schemes based on the “learning with errors” (LWE) problem (Regev, 2009) which did not rely on the poorly understood hardness assumptions of ideal lattices or ‘squashing’ of the decryption circuit to achieve full homomorphism. Moreover, it ensured that the size of the public key was independent of the depth of operations to be performed: implementations of Gentry’s original scheme required upto 2.3 gigabyte public keys (Gentry and Halevi, 2011)! This second generation of schemes includes Brakerski et al. (2012) which introduced ‘leveled’ schemes, where noise grows linearly; Brakerski (2012) which introduced scale-invariance reducing the number of keys that must be stored; Fan and Vercauteren (2012) which provided a practical scheme, porting scale invariance to the Brakerski et al. (2012) scheme and setting it in a ring-LWE context (Lyubashevsky et al., 2010); Gentry et al. (2013) which introduced a highly novel LWE approach where cipher texts are matrices and operations follow standard matrix arithmetic; and Brakerski and Vaikuntanathan (2014) where they focus on matching security levels of non-homomorphic schemes, among others.
Appendix B Ring Learning With Errors (LWE)
The ring LWE hardness result underlies the homomorphic encryption scheme reviewed in Section 2.3. It is a ring based extension of the original LWE result due to Regev (2009). For the interested reader this appendix provides a short simplified explanation of the problem the security of the scheme relies upon. The notation here follows that of Section 2.3.
The original LWE problem requires reconstruction of a secret vector , for some , when only in possession of a collection of approximate random linear equations. First, imagine forming the results of many linear equations, , by choosing uniformly random vectors . Then, given realisations of it is a simple matter of solving a system of linear equations to recover .
However, consider the approximate version of this problem: given a uniformly random vector , form instead the perturbed inner products where is a scalar discrete random Gaussian draw. Then, given many realisations of the objective is to solve for . For appropriate choices of the error this can be shown to be an exceptionally hard problem: certainly as hard as traditional worst-case lattice problems which have been well studied.
Ring LWE (Lyubashevsky et al., 2010) ports the same results to the more complex polynomial ring setting, but the formulation is essentially unchanged in that it is now simply solution of a system of perturbed linear equations in an algebraic ring.
Notice that the public key in Section 2.3 is precisely the ring LWE problem: the public key contains a masked version of the secret key, with the security of doing this based on the difficulty of recovering it due to the ring LWE problem hardness.