1 Introduction
The success of machine learning is often built on data, and it is often desirable to integrate data from multiple sources for better learning results. However, the unrestricted exchanging of sensitive data may threaten users’ privacy and is often prohibited by laws or business practices. How to protect privacy while allowing the integration of multiple data sources demands prompt solutions.
Privacypreserving computation is a wellstudied topic. People have built theoretical protocols with different efficiency and security assumptions. However, we are yet to see these frameworks practical enough to support realworld machine learning algorithms. We believe there are two reasons: 1) most existing computation engines is less optimized for machine learning tasks which usually require massive realnumber matrix operations; 2) there is no elegant programming interfaces that is friendly enough to machine learning programmers to implement common algorithms.
A computation engine is based on privacypreserving computation techniques, which can be divided into two main categories: randomization and cryptography [2]. The former protects privacy by introducing uncertainty. Recent works on differential privacy [31], such as [76, 18, 72, 8], are the latest representatives. They are often efficient. However, in a scenario where many sources are contributing data for many iterations, one usually must add too much noise, making the accuracy far from useful. There are some noiseless models (e.g. [9]), but they rely on strong assumptions about the data distribution.
Cryptographybased privacypreserving computation, or secure multiparty computation (SMC), allows players to collectively compute a function without revealing private information except for the final output. SMC often uses various cryptographic primitives such as garbled circuit (GC) [81], homomorphic encryption (HE) [34], or secret sharing (SS) [71]. Theoretically, fully homomorphic encryption (FHE) can compute any given function. However, even the latest implementations [35, 20] are impractical with respect of performance. GCbased approaches, such as [78, 79, 59, 17], yield constantround protocols, and are efficient for specific boolean circuits (such as AES and comparison). However, GC usually incurs large amount of bandwidth for complex arithmetic circuits (such as machine learning algorithms) with large inputs. Secret sharing, on the other hand, has the participants to interact for every gate and the round complexity is proportional to the depth of the circuit. For a small circuit and highlatency network, GC outperforms secret sharing, especially when low latency is the central concern. However, secret sharing usually requires lower bandwidth (and thus higher throughput) for arithmetic circuits and provides practical performance especially in a lowlatency network such as LAN [4, 79]. Thus, while deciding which SMC scheme to use, one should take several factors into account: network latency, type and size of the circuit, and the core concern etc. For example, for machine learning tasks which usually involve a large amount of realnumber arithmetical operations such as multiplication, the throughput would become the bottleneck in a lowlatency network, and one can use secret sharing as the main scheme. Of course, one can also integrate different schemes together to leverage the advantages of different schemes to build a more pratical SMC system.
Another important issue hindering SMC’s adoption is programmability, especially for “big data” applications. Existing SMC solutions often ignore the core requirements of machine learning applications. They either require considerable expertise in cryptography to understand the cost of each operation, or use special programming languages with high learning curves [70, 39, 25, 60, 10, 56]. Some useful solutions, such as [77], though providing rich interfaces for SMC, mainly focus on basic SMC operations, including not only basic arithmetics but also lowlevel cryptography tools such as oblivious transfer [69]
. In contrast, machine learning programmers use frameworks like PyTorch
[63][1] and Scikitlearn [65]with builtin support of highlevel data types like real numbers, vectors and matrices, as well as nonlinear functions such as the logistic function and ReLu. It is almost impossible for machine learning programmers to rebuild and optimize all these often takenforgranted primitives in a modern machine learning package in an SMC language. On the other hand, it is also costly for SMC experts to rewrite all the machine learning algorithm packages. Thus, it is essential to design an SMC frontend that is friendly with the machine learning community, which is Python with NumPy nowadays. Actually, the above popular machine learning frameworks all use Python frontends and provide Numpystyle array operations to ease machine learning programming.
In this paper, we focus on the client/server model and assume that there are stable and lowlatency network among the servers. Based on this, we introduce PrivPy, an efficient and easytoprogram framework for privacypreserving collaborative computation. The PrivPy frontend provides familiar Python interfaces with native NumPy array type support and common functions in machine learning. The PrivPy computation engine uses secret sharing as the main scheme, and integrate garbled circuit into secret sharing. We also design new protocols to accelerate realnumber operations.
By proposing PrivPy, we do not target at providing a theoretically elegant solution or overturning lowlevel cryptographic tools. Instead, we focus on how to build a practical programming system that hides these cryptography usage from the programmers and provides intuitive interfaces for arithmetics, especially for machine learning algorithms. Our design philosophy is analogous to a database: hiding the variety of data access methods under intuitive programming interfaces. User programs go through automatic codelevel optimizer into the computation engine. Concretely, we make the following contributions:
1) Python programming interface with highlevel data types. We provide a very clean Python language integration with privacyenabled common operations and highlevel primitives, including broadcasting that manipulates arrays of different shapes, and the ndarray methods, two Numpy [75] features widely utilized to implement machine learning algorithms, with which developers can port complex machine learning algorithms onto PrivPy with minimal effort.
2) Automatic code check and optimization. Our frontend will help the programmers avoid “performance pitfalls”, by checking the code and optimizing it automatically to improve performance.
3) Decoupling programming frontend with computation backends. We introduce a general private operator layer to allow the same language interface to support multiple computation backends, allowing tradeoffs among different performance and security assumptions. Our current implementation supports both the SPDZ backend and our own computation engine.
4) Efficient real number operations. We design and implement two protocols with native support for real numbers using four semihonest servers: one for multiplication and the other for comparison, both with provable accuracy and privacy and are efficient enough.
5) Validation on largescale machine learning tasks. We demonstrate the practicality of our system for machine learning algorithms, such as logistic regression and convolutional neural network (CNN), on realworld datasets (including a 5000by1million matrix) and application scenarios, including model training and inference. It only takes about 1 second for the CNN inference on an image using complex models. To our knowledge, this is the first practical modern CNN implementation using noisefree privacypreserving method.
2 Related Work
There are also many SMC frameworks using secret sharing. Some are algorithmspecific, such as [36, 64, 29, 30]. Implementations for generalpurpose arithmetics include VIFF [24], SEPIA [74], SPDZ [25] and Sharemind [11]. Recent work provides optimization for these frameworks [6, 45, 4]. For example, [4] performs integer/bit multiplication with 1 round and optimal communication cost using three semihonest servers. While natively supporting efficient computation for integers, these approaches do not provide builtin support for realnumber operation. To solve this, most of them parse each shared integer into field elements in ( is the bit length of ) and use secure bitlevel operations, such as bit decomposition and shifting [23, 16, 50, 44], to simulate fixed/floating point operations, and thus the throughput becomes . SecureML [60] provides builtin support for real numbers, while requiring interactive precomputation to generate random multiplication triplets, which relies heavily on cryptographic tools and is much more expensive than the online computation. In comparison, PrivPy computation engine avoids the secure bitlevel operations and precomputation.
We emphasize that programmability is as the same importance of the computation engine. SecureML [60], though providing native support for real numbers, does not provide any language frontend. TASTY [39] and ABY [27] provide interfaces for programmers to convert between different schemes. However, they only expose lowlevel interfaces and the programmers should decide by themselves which cryptographic tools to choose and when to convert them, making the learning curve steep. L1 [70] is an intermediate language for SMC and supports basic operations. But L1 is a domainspecific language and does not provide highlevel primitives to ease array/matrix operations frequently used in machine learning algorithms. [26] and [12] suffer from similar problems. PICCO [85] supports additive secret sharing and provides customized Clike interfaces. But the interfaces are not intuitive enough and only support simple operations for array. Also, according to their report, the performance is not practical enough for largescale arithmetical tasks. KSS [49] and ObliVM [56] also suffer from these issues. PrivPy, on the other hand, stays compatible with Python and provides highlevel primitives (e.g. broadcasting) with automatic code check and optimization, requiring no learning curve on the application programmer side, making it possible to implement machine learning algorithms conveniently in a privacypreserving situation.
3 System Design
3.1 Problem formulation
We can formulate the problem that PrivPy solves as the following SMC problem: there are clients . Each has a set of private data as its input. The goal is to use the union of all ’s to compute some function . can be records collected independently by , and ’s can use them to jointly train a model. can be even a secret model held by . In this situation, can perform inference on others’ private data without revealing its model.
For the computation, we have the following requirements: 1) Privacy: During the computation, no information, other than the output , is revealed to the participants. 2) Precision: The output should be (almost) the same as the cleartext version. 3) Generality: can be a combination of any numerical operations including both calculation and comparison. 4) Efficiency: can be evaluated fast enough. 5) Scalability: The solution should be able to scale to support a large number of participants and data items.
3.2 System architecture overview
Fig. 1 shows an overview of PrivPy architecture, which has two main components: the language frontend and the computation engine backend. The frontend runs at the client side, providing programming interfaces and code optimizations. The backend runs on the servers, performing the privacypreserving computation. We decouple the frontend and backend so that we can leverage multiple implementations for private computation.
3.2.1 Programming language frontend
The frontend provides convenient Python APIs. A PrivPy program is a valid Python program with NumPystyle data type definitions. Fig. 2 shows a PrivPy program that computes the logistic function using the Euler method [73]. The public variable start is the initial point, and iter_cnt is the number of iterations.
Basic semantics. Unlike many domainspecific frontends [39, 70, 11], which require the programmers to have knowledge about cryptography and use customized languages, the program itself (lines 29) is a plain Python program, which can run in a raw Python environment with cleartext input, and the user only needs to add two things to make it privatepreserving in PrivPy:
Declaring the private variables. Line 1 declares a private variable x as the input from the client .
Getting results back. The function reveal() in line 10 allows clients to recover the cleartext of the private variable result.
Programmers not familiar with cryptography, such as machine learning programmers, can thus implement algorithms with minimal effort.
Supporting both scalar and array data types. PrivPy supports scalars, as well as arrays of any shape. Supporting array operations is essential for writing and optimizing machine learning algorithms which rely heavily on arrays. While invoking the ss method, PrivPy detects the type and the shape of x automatically. If x is an array, the program returns an array of the same shape, containing the function on every element in x. Following the NumPy [75] semantics, we also provide broadcasting that allows operations between a scalar and an array, as well as between arrays of different shapes, two widely used idioms. That is why the logistic function in Fig. 2 works correctly even when x is a private array. As far as we know, existing SMC frontends, such as [11, 70, 85, 25], do not support such elegant program. For example, PICCO [85] only supports operations for arrays of equal shape.
Parsing and optimizing automatically. With the program written by users, the interpreter of our frontend parses it to basic privacypreserving operations supported by the backend, and the optimizer automatically rewrites the program to improve efficiency (see Section 6 for details). This optimization can help programmers aovid performance “pit falls” in SMC situation.
3.2.2 Computation engines
We realize that applications have very distinct requirements on security assumptions as well as performance. Thus, we decide to decouple the language frontend from the backend computation engines. We can support any computation engine as long as they support basic operations such as addition, multiplication and comparison. And as the frontend only uses interfaces and does not modify the underlying protocols, the security is guaranteed as long as the computation engine provides secure and composable interfaces. We can easily port the language frontend to reuse the language tools and even machine learning algorithms. In our current prototype, we support a legacy computation engine as well as our own. Of course there are many kinds of engines which are based on different assumptions and provide different levels of security, and we will support more in the future.
Active security with SPDZ engine. We support SPDZ by adding a very thin wrapper to handle the communications with our language frontend. The advantage of SPDZ is that it provides active security by generating and keeping Message Authentication Codes (MACs) of private data. However, SPDZ uses inefficient secure bitlevel operations for real numbers. Also, SPDZ needs time and memoryconsuming compilation and even fail to compile complex algorithms (we will show this in Section 7). Thus, for applications without requiring active security, we provide our efficient computation engine.
Highperformance passive security with PrivPy engine. We have developed a computation engine combining secret sharing scheme and garbled circuits. Our model provides passive security like Sharemind [11]. However, our computation protocols provide better performance for realnumber arithmetics. Moreover, we optimize the computation engine for batch operations that are essential in largescale data mining problems. We detail the engine in the next two sections.
4 The PrivPy computation engine
4.1 Threat Model
Our computation engine uses four semihonest servers for computation. And we focus on the client/server model in this paper, as [4, 53, 30, 11, 56] do. Moving the computation from clients to these servers effectively reduces the number of parties and thus improves system performance and scalability. Clients send secretly shared data to the servers, then the servers perform privacypreserving computation on these shares.
Our engine is based on two assumptions: 1) All the servers are semihonest, which means all servers follow the protocol and would not conspire with other servers, but they are curious about the users’ privacy and would mine information as much as possible; and 2) all communication channels are secure so that adversaries cannot see/modify anything in these channels. Note that a server mainly refer to a domain, and it can be a cluster of servers.
The assumption of noncolluding cloud servers is rational and common, and many wellknown frameworks are based on this kind of assumption. For example, P4P [30] and PEM [53] uses semihonest servers to enable additive computation, while Sharemind [11] and [4] uses three semihonest servers to enable general computation. Our computation engine introduces one more semihonest server to enable much more efficient privacypreserving general computation for arithmetics. Actually, as there is a growing number of independent and competing cloud computing providers, it is feasible to find a small number of noncolluding computing servers.
Some frameworks can tolerate one or more malicious advasaries, which means they can detect biasing from the protocols, by maintaining authenticated information for each private variable [25, 33, 78, 79] or performing redundant computation [59, 17]. However, this comes at a cost. In this paper, we focus on building a practical SMC framework for machine learning tasks, and aim to make a right tradeoff between security assumption and efficiency. Thus, we choose the assumption of semihonest servers, which is rational and feasible. We leave extensions of detecting malicious advasaries for future work.
4.2 Computation engine architecture
We use the client/server model: each client breaks its private variable into two shares: and , and sends the shares to the servers shown in Fig. 3. and are the primary servers, which work as both secret share storage and computation engines. only touches all the ’s, while only handles ’s. To implement operations, we adopt two assistant servers, and , which only provide computation but no storage.
Our engine includes two subsystems. The secret sharing storage (SS store) subsystem, which resides on and , provides (temporary) storage of shares of private inputs and intermediate results. The private operation (PO) subsystem runs on the four servers and provides an execution environment for private operations. The servers read shares from the SS store, execute a PO, and write the shares of the result back to the SS store.
4.3 The PrivPy workflow
With our computation engine, we present the PrivPy workflow.
Step 1: Program preparation. Once ready, the program is parsed into basic operations and optmized by the optimizer. Then it is passed to the backend for computation.
Step 2: Pooling secret shares. Each client computes the secret shares for private variables from itself, and sends the resulting shares to the SS store on the servers.
Step 3: Executing the POs on servers. After receiving all expected shares of secrets, the servers start the private computation without any client involvement.
Step 4: Revealing the final results to the clients. When the server invokes the reveal() method, the clients are notified to find the result shares in the SS store, and finally recover the cleartext result.
5 PO Protocols
5.1 Mathematical preliminaries
We first review some mathematical preliminaries used in PrivPy for readers not familiar with the area.
Additive secret sharing. PrivPy uses the classic additive secret sharing scheme. Assume is a nonnegative integer and is a large integer. Let be the additive group of integers module . Given and in , we call them two shares of as long as . We then define the function , where is a randomly picked integer in . Apparently, generates two shares of . We can see that is additively homomorphic in , i.e. the output of is the shares of . It is easy to see that both and are uniformly random in , and in PrivPy, the two noncolluding servers and manage these shares independently. Thus, neither server learns anything about the private variable .
Supporting real numbers. Note that the function above is defined on nonnegative integers only. To map real numbers to , we use a typical discretization approach. Given a real number where is the bound of the input with , we denote as the corresponding integer in of :
(1) 
where is the scaling factor with , which implies that the precision of this representation of real numbers is . Finally, we can define as the secret sharing function of a real number . Ignoring the precision loss, remains additively homomorphic.
Actually, our representation is close to the fixedpoint representation [16]. The reason why we do not use the floatingpoint representation is due to its inefficiency for basic operations (e.g. to perform secure addition with floatingpoint representation, we need to align the radix points, which is timeconsuming) [44, 3]. Meanwhile, fixedpoint representations have been proved to be practical enough even for complex data mining tasks [58, 40]. Note that negative numbers are mapped into the range . Assuming all inputs and (accumulated) intermediate results are all within , there will never be any unexpected sign flipping caused by overflowing . We will show later that combining this frequently used mapping scheme with our PO design, we can get much more efficient POs than existing approaches.
Getting rid of wrapround. Given the above secret sharing and realnumber representation schemes, we can get rid of the wrapround problem [23, 46, 16], so that we can apply arithmetical operations directly on the shares. First, we define as an indicator function that converts an integer in to the signed representation: If , then ; else if , then . We then introduce two theorems (we leave the formal proofs in our technical report [54]).
Theorem 1.
Given , if , then we have .
Theorem 2.
Given a private real number , denoting the event as FAIL, we have .
Combining Theorem 1 and Theorem 2, we can see that, if
is small enough, the probability
is extremely high. For example, given , and , we have , which is negligible. In practice, the negligible does not cause any failure even for complex algorithms. With this conclusion, we will show that we can improve the performance for multipilication and comparison of real numbers.5.2 The addition PO
Based on the definition of , to compute the secret shares of the sum, the servers and only need to independently add up the shares they have.
5.3 The multiplication PO
Realnumber multiplication is a basic operation for machine learning algorithms. For example, neural networks frequently use matrix multiplications for network propagation. Such algorithms usually involve massive realnumber multiplications, and the throughput of realnumber multiplication is vital for these algorithms.
Our solution avoids the secure bitlevel operations in integer/bitbased schemes (e.g. [11, 25]) and precomputation in [60]. We believe that the performance gain deserves the price of finding one or two more semihonest servers. Protocol 1 shows our realnumber multiplication protocol. An important detail is that the mapping scheme scales its input using a scaling factor , thus each product term like contains the factor . Thus we need to adjust the result by dividing the factor.
Analysis. (Correctness). We can verify that and in step are the shares of and respectively.
Ignoring the rounding effect, and with the conclusion of the above theorems, we have and . Thus , which means the modular sum of and produced in step is , i.e. . Steps to calculate this sum and result shares are and produced in step .
(Security). We leave the detailed proof in Appendix .3. Informally speaking, only steps and
involve communications. And the numbers transfered between the servers are all uniformly distributed integers generated by the secret sharing function. Thus we can see that, no information is leaked to the servers.
(Complexity). Note that the random integers sharing in step can be optimized by letting and use the same random number generator with the same seed so that we can avoid the transfer for these random integers. The same applies to step . Thus there are invocations: in step and in step .
5.4 The comparison PO
Comparison is common in many basic components, such as ReLU function and means, and mostly works together with other operations. The (improved) Garbled Circuit (GC) [81] is one of the most communicationefficient secure comparison protocols [21, 47]. The measurement in [27] also demonstrates this. Therefore, unlike some existing secretsharingbased frameworks (e.g.[25, 11]) which use expensive secure bitlevel operations to implement comparison, PrivPy integrates GC with the secret sharing scheme to get more efficient comparison
There are also some previous approaches, such as [27, 32], integrating GC with secret sharing. They perform comparison on and as follows: 1) First, they perform two addition circuits on and respectively to get the Yao’s circuit shares of and ; 2) Then they evaluate the comparison circuit using the two Yao’s circuit shares as input; 3) Finally they convert the comparison result (represented using Yao’s circuit sharing) back to arithmetic sharing. Our comparison protocol performs GC directly on the arithmetic shares, thus avoids the addition circuits. And with the two assistant servers, we can efficiently convert the comparison result back to arithmetic shares. As the communicaton cost of garbled circuits domains the performance of the comparison (especially for batch comparison) and the addition circuit is more costly than the comparison circuit [48], eliminating the addition circuits can improve the throughput by about . Protocol 2 details the process.
To implement GC, we use the free XOR construction [48, 47]. Besides, we use the halfand trick [84] to reduce the size of garbled AND gates. We can further optimize GC performance by moving the circuit generation and transfer to the setup phase, like [47] does.
Analysis. (Correctness). It is obvious that and at step are shares of . The purpose of step is to convert integers to regular signed integers so that we can use GC to compare. This step produces . Note that as Theorem 2 establishes, we have . Clearly, has the same sign as , so . Thus, we can now perform GC directly on and to get the comparison result, without using the conversion in [27].
At step , the comparison result is masked by two random bits and generated independently by and , like [47] does. After receiving and the shares of (step ), and recover the shares of the original comparison result (step ). When , they perform to flip the bit. Finally, is the share of the result.
(Security). Like in Protocol 1, we can also define a functionality in an ideal model and construct simlators to show that each server only sees independent random numbers or bits in Protocol 2. We omit the details here.
Informally speaking, during the entire process, and only generate and receive secret shares. and perform the comparison between themselves, and the comparison result is masked by the random bits and . Each server thus only sees independent random integers or garbled bits.
(Complexity). Apart from the communication triggered by GC, there are 8 invocations: in step and in step . In GC, following [47], we move the circuit generation and transfer to the setup phase. Thus, in the online phase of GC, only needs to send bits to , where is the bit length of and is the security parameter of . In this paper, we set .
5.5 Derived POs
We can compose multiple basic POs and form more complex derived POs commonly used in machine learning algorithms. For example, we can use the NewtonRaphson algorithm [82] to implement division. To implement the logistic function , we use the Euler method [73]. We also implement other common maths functions using similar numerical methods, such as sqrt, log, exp and max_pooling. We omit the details due to space limitation.
5.6 POs for performance optimization
We provide the following three sets of POs whose functionality is already covered by the basic POs, but the separate versions can significantly improve performance in certain cases. Programmers can use these POs directly.
Array POs. Batch up is a commonly used optimization in SMC frameworks [11, 25, 85], which batches up independent data transfers among the servers and thus reduce the fixed overhead. Array POs natively support batch up. And as many machine learning algorithms heavily utilize array operations, this optimization reduces communication rounds and can improve performance significantly.
Multiply by public variables. In a case where an operation involves both public and private variables, we can optimize performance by revealing the public variables. Multiplication benefits from the optimization the most, as and only need to multiply their shares by the public variables directly and there is no necessary communication between servers.
Dot and outer product. Dot and outer product of arrays are freqently used in common machine learning algorithms. For example, logistic regression and neural networks use dot product for forward propagation, represented as . Outer product is often used for calculating graduations. While implementing them using forloops, there are too many duplicated trasfers for each element, as each element will be multiplied by several other elements in a multidimensional situation. We thus provide builtin optimized dot and outer product. Specifically, for two private arrays and ( and can have any dimensions, as long as their shapes match after broadcasting, like that in Numpy), we can calculate the dot product as . The process is similar as the multiplication protocol: and calculate the former two terms locally while and calculate the latter two terms. The security can also be proved in a similar way as the multiplication protocol. This optimization significantly reduces communication cost. As an example, given two arrays, a forloop for dot product triggers multiplications in Protocol 1 and the communication cost is , while the optimized one only incurs communication cost of . In comparison, wellknown SMC frontends, such as PICCO and SPDZ, does not provide such builtin optimization.
6 Python Frontend and Optimizations
As discussed in Section 3, PrivPy provides a programming interface that is compatibility with plain Python code. In this section, we focus on the implementation and optimization of the programming interfaces. Our goal is to provide intuitive interfaces and automatic optimizations to avoid steep learning curves and enable programmers focus on the machine learning algorithm itself.
6.1 Python Interfaces
Private array types. The private array class in PrivPy encapsulates arrays of any shape. Users only need to pass a private array to the constructor, then the constructor automatically detects the shape. Like the array type in Numpy [75], our private array supports broadcasting, i.e. PrivPy can handle arithmetic operations with arrays of different shapes by “broadcasting” the smaller arrays. We also implement the ndarray methods in Numpy. Our technical report [54] lists the ndarray methods we have implemented. Broadcasting and ndarray methods are rather useful for implementing common machine learning algorithms which usually handles arrays of different shapes.
Operator overloading. We overload operators for private data classes, so standard operators such as work on both private and public data, or a combination. The implementation of these overloaded operators chooses the right POs to use based on data types.
Support for large arrays. In this “big data” age, realworld machine learning tasks usually use large arryas as inputs. However, mapping the data onto secret shares unavoidably increases the data size. Thus, realworld datasets that fit in memory in cleartext may fail to load in the private version. For example, the matrix we use in our experiments require over 200 GB memory when mapped into a 256bit long integer space. It is hard for application programmers to design new algorithms to handle the memory limit. Thus we provide a LargeArray class that transparently uses disks as the back storage for arrays too large to fit in memory.
Array operation examples. In addition to Fig. 2, Fig. 4 shows an extra example of matrix factorization, which decomposes a large private matrix to two latent matrices and . Both demonstrate ndarray methods in PrivPy. Users can implement the algorithms in plain Python, then just replace the Numpy package with PrivPy package and add private variables declaration. Actually, by replacing all privpy with numpy, line 213 of Fig. 4 can run directly in raw Python environment with cleartext inputs.
6.2 Code analysis and optimization
Comparing to the computation on cleartext, private operations have very distinct cost, and many familiar programming constructs may lead to bad performance, creating “performance pitfalls”. Thus, we provide aggressive code analysis and rewriting to help avoid these pitfalls. For example, it is fine to write an elementwise multiplication of two vectors in plain Python program.
However, this is a typical antipattern causing performance overhead due to the multiplication POs involved, comparing to a single array PO (Section 5.6).
To solve the problem, we build a source code analyzer and optimizer based on Python’s abstract syntax tree (AST) package [61]. Before the servers execute the user code, our analyzer scans the AST and rewrites antipatterns into more efficient ones. In this paper, we implement three such examples:
Forloops vectorization. Vectorization [80] is a wellknown complier optimization. This analyzer rewrites the above forloop into a vector form .
Common factor extraction. We convert expressions with pattern to . In this way, we reduce the number of from to 1, saving significant communication time.
Common expression vectorization. Programmers often write vector expressions explicitly, like , especially for short vectors. The optimizer extracts two vectors and , and rewrite the expression into a vector dot product of .
6.3 Rejecting unsupported statements
We allow users to write legal Python code that we cannot run correctly, such as branches with private conditions (actually, most SMC tools do not support private conditions [85, 56], or only support limited scenarios [85, 83]). In order to minimize users’ surprises at runtime, we perform ASTlevel static checking to either rewrite or reject unsupported statements at the initialization phase. For example, for an expression containing private variables, if it is a simple case like res = a if cond else b, we will automatically rewrite it to res = b + cond * (a  b). On more complex cases, we prompt the user at the initialization phase whether they want to reveal the condition value to the servers. If so, we automatically rewrite the code to add a reveal procedure, and otherwise, we terminate with an error.
7 Evaluation
7.1 Experiment setup
Testbed. We use four servers for the experiments, each of which is a KVMbased virtual machine running in a private OpenStack environment. Each server has 8 virtual CPU cores (based on Ivy Bridge Xeon running at 2.0 Ghz), 64 GB RAM and 1 GE network connection.
PrivPy implementation. We implement the main program of PrivPy with Python. We handcode GCbased comparison in C++ with the Crypto++ library. We compile the C++ code using gcc O3, and wrap it into Python code. We use SSL with 1024bit key to protect all communications. We measure that the roundtrip time of sending a 10byte message with SSL is about 0.1 ms.
Parameter setting. In the following benchmarks, we fix the bit length of to 256 for simplicity. We also set the scaling factor , thus the precision is . This is enough for most common applications. We repeat each experiment 100 times and report the average values.
7.2 Microbenchmarks
7.2.1 Basic operations
To demonstrate the performance of basic operations in PrivPy, we evaluate the fundamental POs, as well as the basic secret sharing process.
Fundamental POs. The fundamental POs are , and . We compare two versions of these POs based on the operand types: two private variables, and one private variable with one public variable. Table 1 shows the result, and we have the following observations:
a) and are slower than , due to communication.
b) with a public is faster than the twoprivate version, due to the multiplybypublic optimization.
c) with a public number is slower than the normal version. This is because the servers should map public variables during the computation. We are developing a variable analysis tool to automatically identify POs involving public variables, so that the client can preprocess them.





1.83e3  0.3  0.87  9.07e3  9.0e3  0.89 
Derived POs with numerical methods. Derived POs in Section 5.5 approximate nonlinear operations using iterative numerical methods. We evaluate the relative error and execution time with different numbers of iterations.
For division, we use the default initial value of , and evaluate the expression of . For the logistic function, we use the default starting value of . Figure 5 plots the relative error and execution time with different number of iterations and different values of the input . We can see the periteration time is reasonably short and the algorithm converges fast.
Note that different from cleartext algorithms, as the servers do not see the outcome of each iteration, they cannot tell whether the result has converged or not. Thus, we need to set a conservative iteration limit as a tradeoff between result accuracy and computation time. For all our experiments, we use 50 iterations, and it takes about 30 ms to compute a division or logistic function.
Clientserver interaction overload We evaluate the client time consumption (including computation and communication) of the secret sharing process and the result recover process . Figure 6 shows that even with 1000 clients and 1000dimension vectors, it takes only less 0.6 seconds for the servers to collect/reveal all the vectors from/to all the clients.
7.2.2 Effectiveness of optimizations
Now we show the improvement with batch operations and the code optimizer.
Batch array operations. We evaluate the effectiveness of batching up, using two common operations: elementwise multiplication and elementwise comparison on vectors. Figure 7 shows the result. The batchedup multiplication is about faster than the unbatched one, while the batchedup comparison is about faster.
Dot and outer product. We evaluate the dot product on two square matrices and evaluate the outer product on two vectors. We vary the number of total elements and measure the time consumption. The optimized outer product needs multiplications, while the optimized dot product of two square matrices needs multiplications. As Fig. 8 shows, when , speedup is for the outer product.
Code optimizations using AST. We evaluate the common factor extraction and expression vectorization. As these handwritten antipatterns are usually small, we range the expression size from 2 to 10. Figure 9 shows that the optimizations lead to performance improvement for fiveterm expressions in both situations.
7.2.3 Comparison with existing approaches
We compare the performance of PrivPy with existing approaches. We only choose systems that are open source and at least support both additions and multiplications. The goal of this comparison is to show the performance gap of different underlying protocols for common arithmetical operations over real numbers.
HElib [35] is an implementation of (leveled) FHE, which has a parameter controlling the depth of circuits. We use a choice of small only allowing shallow circuits. Even so, HElib is less efficient than PrivPy.
OblivC [83] is a highly optimized garbled circuit implementation. And to support real numbers, we convert each real number to a large integer by multiplying a scaling factor, following the sample code [38].
P4P+HE P4P [30] is a practical additive secret sharing based system. Following [5], we add a partialHE (i.e. 1024bit Paillier [62]) based multiplication on to P4P.
SPDZ [25] As described in Section 3, we port SPDZ into the PrivPy framework. We evaluate both the native SPDZ (the Raw SPDZ in the following discussion), and SPDZ with PrivPy frontend (SPDZ + PrivPy). In this paper, SPDZ runs on two servers for twoparty computation.
We compare the performance of , and . Table 2 presents the latency of basic scalar operations, and Table 3 shows the throughput. Note that not all frameworks above support multicore CPUs like PrivPy. Therefore, to evaluate their throughputs, we run multiple independent processes of these frameworks and add up the throughput. As different frameworks perform different processing during startup time, to avoid the effect of unrelated factors, we also ignore the startup time for them, including the time for compilation, program loading and precomputation, though it is nonnegligible. Even so, PrivPy still performs much better. Our key observations include:
1) For , PrivPy, P4P and SPDZ have similar performance, as they are all secretsharingbased. HElib and OblivC need to evaluate encrypted or garbled circuits and handle carry bits for secure addition, thus slower than the secretsharingbased tools.
2) For , as Table 2 shows, secretsharingbased tools, such as SPDZ and PrivPy, show a improvement over OblivC for a single multiplication, by using secret sharing servers instead of HE or garbled circuits. Moreover, as Table 3 shows, the multiplication throughput of PrivPy is over better than others, thanks to our efficient multiplication PO.
3) SPDZ uses expensive secure bitlevel operations for comparison, while PrivPy and OblivC use GC which is more communicationefficient. The main reason that PrivPy is slower than OblivC for a single comparison is that PrivPy compares 256bit integers while OblivC works only on 64bit ones. However, thanks to our optimization, the throughput of PrivPy is higher than that of OblivC and higher than that of SPDZ.
4) SPDZ + PrivPy has similar performance as the raw SPDZ, showing minimal cost of our frontend porting.
Note that although SPDZ provides active security by generating and keeping MACs of private data and may introduce extra cost, the cost mainly resides at the precomputation phase, while updating MACs in the online phase can be done efficiently with simple operations and involves no extra communication and thus brings little impact on the overall performance. As we evaluate the performance of online phase and ignore the time for precomputation, the big gap of performance between SPDZ and PrivPy presented above still comes from the computaion protocol itself, as we analyzed in Section 5: that is, even without the MACs, the online performance of SPDZ will not improve much and PrivPy will still perform much better than SPDZ.
Another thing to note is that, in the above evaluation we do not compare with Sharemind [11], which is as far as we know stateoftheart SMC framework based on passive security, as it is a closedsource commercial product and we do not have the access. However, according to the report of [28], PrivPy performs about better than Sharemind for the throughput of fixedpoint multiplication, even if their experiments ran on faster servers. Another notable system is SecureML [60], which provides builtin support for real numbers like PrivPy. But it does not provide any language frontend. However, according to the report in [60], the overall throughput of PrivPy for multiplication is about better than SecureML, as it requires precomputation to generate multiplication triplets.
HElib  OblivC 






4.0e2  1.3e2  1.8e3  1.85e3  1.86e3  1.8e3  
31  1.6  1.8  0.348  0.344  0.3  
  0.1    1.35  1.35  0.87 
HElib  OblivC 










2,583,158  
 

 


150,125 
7.3 Performance in real algorithms












1  80 (60)  70 (50)  190 ()  150 (120)  150 (60)  770 ()  20 (1006)  10.7 (240)  820 ()  
10  11 (13)  9 (7)  79 ()  23 (70)  16 (19)  318 ()  12.8 (226)  3.0 (64.5)  460 ()  
100  3.7 (10.7)  2.3 (4.5)  62.3 ()  12.2 ()  6.5 (14.8)  268 ()  12.4 ()  3.0 (56.1)  429.4 ()  
1000  2.5 ()  1.39 ()  74.4 ()  9.37 ()  4.69 ()  271.9 ()  11.4 ()  2.7 ()  439.8 () 
PrivPy supports real learning algorithms with largescale datasets. Here we evaluate both model training and inference. For training, we use three real datasets and three algorithms. For inference, we evaluate traditional feedforward neural network and convolutional neural network. And as our frontend supports both PrivPy engine and SPDZ engine, we run the same code on both PrivPy and PrivPy + SPDZ. It takes a long time to precompile the code to run on SPDZ (e.g. it takes 16 minutes to compile logistic regression on 1000 instances of the Adult dataset in SPDZ) or even crashes. PrivPy does not suffer from this problem. Again, we ignore the compilation and precomputation time in the evaluation. Note that, by comparing with SPDZ, we are not aiming to simply compare the performance, as we have shown this in the previous section. Instead, we mainly want to show that employing an endtoend implementation with active security (e.g. SPDZ) requires expensive price to pay, and in realworld applications, we should make a right tradeoff and choose a practical solution like PrivPy.
7.3.1 Model training on secret datasets
We use three realworld datasets. We treat the records in these datasets private and train models using them.
1) Adult [55] contains records of information about individuals. There are dimensions per record.
2) CreditCard [22] consists of credit card transactions with numeric features each.
3) Movielens [37] contains 1 million movie ratings from thousands of users. We encode it to a matrix. As it is too large to fit into memory, we treat it as a diskbacked LargeArray.
Moreover, we evaluate the following three algorithms.
Logistic regression (LR) [36]. We train logistic regression using Stochastic Gradient Descent (SGD), which calculates the gradients of the weights in each iteration.
means [64]. means is a method for unsupervised clustering, which updates the centroids and the clusters of the instances in each iteration. In all means evaluations, we set the number of clusters to .
Matrix factorization (MF) [8]. It decomposes a large matrix into two smaller latent matrices for efficient prediction, which performs several matrix multiplications in each iteration. In this paper, we decompose each matrix to a matrix and a matrix.
Number of POs  


elementwise vector additions, 3 elementwise vector multiply, 2 multiplybypublic, 1 dot product of and matrixes, and 1 logistic function for dimension vector.  

additions, 3 elementwise vector multiply, 2 multiplybypublic, 1 dot product of and matrixes, and 1 logistic function for dimension vector.  

additions, 3 elementwise vector multiply, 2 multiplybypublic, 1 dot product of and matrixes, and 1 logistic function for dimension vector.  

1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, elementwise comparisons for matrix.  

1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, elementwise comparisons for matrix.  

1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, elementwise comparisons for matrix. 
Table 4 summarizes the average time consumed by each instance with different batch sizes in an iteration. The key observations are: 1) batch operations bring perinstance performance improvements in all algorithms; 2) SPDZ fails to handle larger scale cases, as its precompilation module runs out of memory and crashes; and 3) PrivPy uses the LargeArray to handle the largest Movielens dataset, and the program works ok.
To verify the accuracy, we compute the Relative Root Mean Squared Error (RRMSE) of the resulting model parameters between PrivPy and the cleartext version after each iteration for the Adult and Creditcard dataset. Unlike [60]
, which suffers from large precision loss due to simplified versions of activation functions, PrivPy supports direct approximations of these functions (Section
5.5) and the precision loss is negligible. Figure 10 shows that the RRMSE is small ( range and on average) even after iterations. We verify that the computation error is negligible and the prediction accuracy on the two datasets is the same as the cleartext version.7.3.2 Neural network (NN) inference
We use MNIST dataset [51] with labeled handwritten digits [19] with pixels each. We use three example neural networks for handwritten digits recognition to evaluate the inference performance, by treating both the model and the data as private. Note that we do not present the results of neural networks on SPDZ, as SPDZ is able to compile none of these cases successfully.
Feedforward neural network. The network consists of a 784dimension input layer, two 625dimension hidden layers and a 10dimension output layer. Finally, we pass the output vector to an argmin function to get the output. The activation function is ReLU.
Convolutional neural network (CNN). We use the wellknown LeNet5 [52] model to demonstrate CNN. LeNet5 has a 784dimension input layer, 3 convolutional layers with a
kernel, 2 sumpooling layers, 3 sigmoid layers, 1 dot product layer and 1 Radial Basis Function layer. Also, LeNet5 performs an
argmin function on a 10dimension vector to get the output. This is a quite heavy computation, involving a large number of POs including multiplications and comparisons.CNN + batch normalization (BN).
Based on the LeNet5 model, we add a batch normalization
[41] layer to each sigmoid layer. Thus we add 3 BN layers to the CNN model. BN mainly introduces some secure multiplications to the computation.Table 6 shows the average time to infer an image. Batching up still brings significant speedup for all algorithms. Even with complex neural network models such as CNN, it takes only 1.1 seconds to process a single image, and about 0.1 seconds for an image on average when processing images in batch. This is acceptable considering the privacy guarantee. We verify that the classification result is the same as the cleartext version. To the best of our knowledge, this is the first practical implementation of a real convolutional neural network using a noisefree privacypreserving method.






1  1.48  1.57  1.58  
10  0.31  0.23  0.32  
100  0.06  0.1  0.17  
1000  0.04  0.1  0.16 
8 Conclusion and Future Work
Over thirty years of SMC literature provides an ocean of protocols and systems, and many work great on certain aspects of performance, security or ease of programming. We believe it is time to integrate these techniques into an applicationdriven and coherent system for machine learning tasks. PrivPy is a framework with topdown design. At the top, it provides familiar Python programming interfaces with essential data types like real numbers and arrays, and use code optimizer/checkers to avoid common mistakes. In the middle, using secret shares as both storage and communication intermediary, we build a composable PO system that helps decoupling the frontend with backend. At the low level, we design new protocols that improve real number and array computation speed. PrivPy shows great potential: it handles large data set (1Mby5K) and complex algorithms (CNN) fast, with minimal program porting effort.
PrivPy opens up many future directions. Firstly, we are improving the PrivPy backend to provide active security while preserving high efficiency. Secondly, we would like to port existing machine learning libraries to our frontend. Third, although we focus on SMC in this work, we will introduce randomization to protect the final results [66, 53]. Last but not least, we will also improve fault tolerance mechanism to the servers.
.1 Proof of Theorem 1
the big prime that determins the field  
the additive group of integers module  
the bound of numbers in the computation  
the scaling factor  














For convenience, Table 7 summarizes the notations we use throughout the paper. First we consider the case . As , it is impossible that and are both negative. Thus, there are three possibilities: 1) and ; 2) and ; 3) and . In case 1), both and fall into , and (or will be negative). Then we have . In case 2), falls into and falls into . In case 3), falls into and falls into . In either case of 2) and 3), . Meanwhile, in this case, to ensure for , we have . Thus still holds.
Similarly, for the case , there are three possibilities: 1) and ; 2) and ; 3) and . In case 1), both and fall into , and (as ). As for , we have . In either case of 2) and 3), (or will be positive). Thus .
.2 Proof of Theorem 2
Before we start the proof, we introduce the following three lemmas.
Lemma 1.
Given , if and , or and , then .
Proof If and , and . Similarly, if and , and . In either case, . Given the ranges of and , we know .
First consider the case , it must be that (otherwise will be shares of a negative number). This means that is just . Therefore .
Then suppose , to represent a negative number, it must be that (otherwise will be shares of a positive number). In this case , we still have .
Lemma 2.
For a private variable , given , we have 1) if and , then ; 2) if and , then .
Proof As , if and , then and . Thus we have .
On the other hand, if and , then and . Thus we have .
Lemma 3.
For a private variable , given , we have 1) if and , then ; 2) if and , then .
Proof As , if and , and , then . Thus we have .
On the other hand, if and , then and . Thus we have .
Now let us back to the proof of Theorem 2. From Lemma 1, Lemma 2 and Lemma 3, we can see that, is equivalent to and for , or and for . Thus can be calculated as follows:
First consider the case . Since and are random shares and , there are three possibilities: 1) ; 2) ; and 3) . In the second case, and will fall into as . This happens with probability . In case 3), will never fall into . To see this, notice that
Comments
There are no comments yet.