PrivPy: Enabling Scalable and General Privacy-Preserving Computation

01/30/2018 ∙ by Yi Li, et al. ∙ 0

We introduce PrivPy, a practical privacy-preserving collaborative computation framework. PrivPy provides an easy-to-use and highly compatible Python programming front-end which supports high-level array operations and different secure computation engines to allow for security assumptions and performance trade-offs. We also design and implement a new secret-sharing-based computation engine with highly efficient protocols for private arithmetics over real numbers: a fast secure multiplication protocol, a garbled-circuit-based secure comparison protocol, and optimized array/matrix operations that are essential for big data applications. PrivPy provides provable privacy and supports general computation. We demonstrate the scalability of PrivPy using machine learning models (e.g. logistic regression and convolutional neural networks) and real-world datasets (including a 5000-by-1-million private matrix).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of machine learning is often built on data, and it is often desirable to integrate data from multiple sources for better learning results. However, the unrestricted exchanging of sensitive data may threaten users’ privacy and is often prohibited by laws or business practices. How to protect privacy while allowing the integration of multiple data sources demands prompt solutions.

Privacy-preserving computation is a well-studied topic. People have built theoretical protocols with different efficiency and security assumptions. However, we are yet to see these frameworks practical enough to support real-world machine learning algorithms. We believe there are two reasons: 1) most existing computation engines is less optimized for machine learning tasks which usually require massive real-number matrix operations; 2) there is no elegant programming interfaces that is friendly enough to machine learning programmers to implement common algorithms.

A computation engine is based on privacy-preserving computation techniques, which can be divided into two main categories: randomization and cryptography [2]. The former protects privacy by introducing uncertainty. Recent works on differential privacy [31], such as [76, 18, 72, 8], are the latest representatives. They are often efficient. However, in a scenario where many sources are contributing data for many iterations, one usually must add too much noise, making the accuracy far from useful. There are some noiseless models (e.g. [9]), but they rely on strong assumptions about the data distribution.

Cryptography-based privacy-preserving computation, or secure multi-party computation (SMC), allows players to collectively compute a function without revealing private information except for the final output. SMC often uses various cryptographic primitives such as garbled circuit (GC) [81], homomorphic encryption (HE) [34], or secret sharing (SS) [71]. Theoretically, fully homomorphic encryption (FHE) can compute any given function. However, even the latest implementations [35, 20] are impractical with respect of performance. GC-based approaches, such as [78, 79, 59, 17], yield constant-round protocols, and are efficient for specific boolean circuits (such as AES and comparison). However, GC usually incurs large amount of bandwidth for complex arithmetic circuits (such as machine learning algorithms) with large inputs. Secret sharing, on the other hand, has the participants to interact for every gate and the round complexity is proportional to the depth of the circuit. For a small circuit and high-latency network, GC outperforms secret sharing, especially when low latency is the central concern. However, secret sharing usually requires lower bandwidth (and thus higher throughput) for arithmetic circuits and provides practical performance especially in a low-latency network such as LAN [4, 79]. Thus, while deciding which SMC scheme to use, one should take several factors into account: network latency, type and size of the circuit, and the core concern etc. For example, for machine learning tasks which usually involve a large amount of real-number arithmetical operations such as multiplication, the throughput would become the bottleneck in a low-latency network, and one can use secret sharing as the main scheme. Of course, one can also integrate different schemes together to leverage the advantages of different schemes to build a more pratical SMC system.

Another important issue hindering SMC’s adoption is programmability, especially for “big data” applications. Existing SMC solutions often ignore the core requirements of machine learning applications. They either require considerable expertise in cryptography to understand the cost of each operation, or use special programming languages with high learning curves [70, 39, 25, 60, 10, 56]. Some useful solutions, such as [77], though providing rich interfaces for SMC, mainly focus on basic SMC operations, including not only basic arithmetics but also low-level cryptography tools such as oblivious transfer [69]

. In contrast, machine learning programmers use frameworks like PyTorch 

[63]

, Tensorflow 

[1] and Scikit-learn [65]

with built-in support of high-level data types like real numbers, vectors and matrices, as well as non-linear functions such as the logistic function and ReLu. It is almost impossible for machine learning programmers to rebuild and optimize all these often taken-for-granted primitives in a modern machine learning package in an SMC language. On the other hand, it is also costly for SMC experts to rewrite all the machine learning algorithm packages. Thus, it is essential to design an SMC front-end that is friendly with the machine learning community, which is Python with NumPy nowadays. Actually, the above popular machine learning frameworks all use Python front-ends and provide Numpy-style array operations to ease machine learning programming.

In this paper, we focus on the client/server model and assume that there are stable and low-latency network among the servers. Based on this, we introduce PrivPy, an efficient and easy-to-program framework for privacy-preserving collaborative computation. The PrivPy front-end provides familiar Python interfaces with native NumPy array type support and common functions in machine learning. The PrivPy computation engine uses secret sharing as the main scheme, and integrate garbled circuit into secret sharing. We also design new protocols to accelerate real-number operations.

By proposing PrivPy, we do not target at providing a theoretically elegant solution or overturning low-level cryptographic tools. Instead, we focus on how to build a practical programming system that hides these cryptography usage from the programmers and provides intuitive interfaces for arithmetics, especially for machine learning algorithms. Our design philosophy is analogous to a database: hiding the variety of data access methods under intuitive programming interfaces. User programs go through automatic code-level optimizer into the computation engine. Concretely, we make the following contributions:

1) Python programming interface with high-level data types. We provide a very clean Python language integration with privacy-enabled common operations and high-level primitives, including broadcasting that manipulates arrays of different shapes, and the ndarray methods, two Numpy [75] features widely utilized to implement machine learning algorithms, with which developers can port complex machine learning algorithms onto PrivPy with minimal effort.

2) Automatic code check and optimization. Our front-end will help the programmers avoid “performance pitfalls”, by checking the code and optimizing it automatically to improve performance.

3) Decoupling programming front-end with computation back-ends. We introduce a general private operator layer to allow the same language interface to support multiple computation back-ends, allowing trade-offs among different performance and security assumptions. Our current implementation supports both the SPDZ back-end and our own computation engine.

4) Efficient real number operations. We design and implement two protocols with native support for real numbers using four semi-honest servers: one for multiplication and the other for comparison, both with provable accuracy and privacy and are efficient enough.

5) Validation on large-scale machine learning tasks. We demonstrate the practicality of our system for machine learning algorithms, such as logistic regression and convolutional neural network (CNN), on real-world datasets (including a 5000-by-1-million matrix) and application scenarios, including model training and inference. It only takes about 1 second for the CNN inference on an image using complex models. To our knowledge, this is the first practical modern CNN implementation using noise-free privacy-preserving method.

2 Related Work

There are also many SMC frameworks using secret sharing. Some are algorithm-specific, such as [36, 64, 29, 30]. Implementations for general-purpose arithmetics include VIFF [24], SEPIA [74], SPDZ [25] and Sharemind [11]. Recent work provides optimization for these frameworks [6, 45, 4]. For example, [4] performs integer/bit multiplication with 1 round and optimal communication cost using three semi-honest servers. While natively supporting efficient computation for integers, these approaches do not provide built-in support for real-number operation. To solve this, most of them parse each shared integer into field elements in ( is the bit length of ) and use secure bit-level operations, such as bit decomposition and shifting [23, 16, 50, 44], to simulate fixed/floating point operations, and thus the throughput becomes . SecureML [60] provides built-in support for real numbers, while requiring interactive precomputation to generate random multiplication triplets, which relies heavily on cryptographic tools and is much more expensive than the online computation. In comparison, PrivPy computation engine avoids the secure bit-level operations and precomputation.

We emphasize that programmability is as the same importance of the computation engine. SecureML [60], though providing native support for real numbers, does not provide any language front-end. TASTY [39] and ABY [27] provide interfaces for programmers to convert between different schemes. However, they only expose low-level interfaces and the programmers should decide by themselves which cryptographic tools to choose and when to convert them, making the learning curve steep. L1 [70] is an intermediate language for SMC and supports basic operations. But L1 is a domain-specific language and does not provide high-level primitives to ease array/matrix operations frequently used in machine learning algorithms. [26] and [12] suffer from similar problems. PICCO [85] supports additive secret sharing and provides customized C-like interfaces. But the interfaces are not intuitive enough and only support simple operations for array. Also, according to their report, the performance is not practical enough for large-scale arithmetical tasks. KSS [49] and ObliVM [56] also suffer from these issues. PrivPy, on the other hand, stays compatible with Python and provides high-level primitives (e.g. broadcasting) with automatic code check and optimization, requiring no learning curve on the application programmer side, making it possible to implement machine learning algorithms conveniently in a privacy-preserving situation.

3 System Design

3.1 Problem formulation

We can formulate the problem that PrivPy solves as the following SMC problem: there are clients . Each has a set of private data as its input. The goal is to use the union of all ’s to compute some function . can be records collected independently by , and ’s can use them to jointly train a model. can be even a secret model held by . In this situation, can perform inference on others’ private data without revealing its model.

For the computation, we have the following requirements: 1) Privacy: During the computation, no information, other than the output , is revealed to the participants. 2) Precision: The output should be (almost) the same as the cleartext version. 3) Generality: can be a combination of any numerical operations including both calculation and comparison. 4) Efficiency: can be evaluated fast enough. 5) Scalability: The solution should be able to scale to support a large number of participants and data items.

3.2 System architecture overview

Fig. 1 shows an overview of PrivPy architecture, which has two main components: the language front-end and the computation engine back-end. The front-end runs at the client side, providing programming interfaces and code optimizations. The back-end runs on the servers, performing the privacy-preserving computation. We decouple the front-end and back-end so that we can leverage multiple implementations for private computation.

Figure 1: The overview of PrivPy architecture.

3.2.1 Programming language front-end

The front-end provides convenient Python APIs. A PrivPy program is a valid Python program with NumPy-style data type definitions. Fig. 2 shows a PrivPy program that computes the logistic function using the Euler method [73]. The public variable start is the initial point, and iter_cnt is the number of iterations.

1x = privpy.ss(clientID)
2def logistic(x, start, iter_cnt):
3  result = 1.0 / (1 + math.exp(-start))
4  deltaX = (x - start) / iter_cnt
5  for i in range(iter_cnt):
6    derivate = result * (1 - result)
7    result += deltaX * derivate
8  return result
9result = logistic(x, 0, 100) # main()
10result.reveal()
Figure 2: Example PrivPy code: logistic function.

Basic semantics. Unlike many domain-specific front-ends [39, 70, 11], which require the programmers to have knowledge about cryptography and use customized languages, the program itself (lines 2-9) is a plain Python program, which can run in a raw Python environment with cleartext input, and the user only needs to add two things to make it private-preserving in PrivPy:

Declaring the private variables. Line 1 declares a private variable x as the input from the client .

Getting results back. The function  reveal() in line 10 allows clients to recover the cleartext of the private variable result.

Programmers not familiar with cryptography, such as machine learning programmers, can thus implement algorithms with minimal effort.

Supporting both scalar and array data types. PrivPy supports scalars, as well as arrays of any shape. Supporting array operations is essential for writing and optimizing machine learning algorithms which rely heavily on arrays. While invoking the ss method, PrivPy detects the type and the shape of x automatically. If x is an array, the program returns an array of the same shape, containing the function on every element in x. Following the NumPy [75] semantics, we also provide broadcasting that allows operations between a scalar and an array, as well as between arrays of different shapes, two widely used idioms. That is why the logistic function in Fig. 2 works correctly even when x is a private array. As far as we know, existing SMC front-ends, such as [11, 70, 85, 25], do not support such elegant program. For example, PICCO [85] only supports operations for arrays of equal shape.

Parsing and optimizing automatically. With the program written by users, the interpreter of our front-end parses it to basic privacy-preserving operations supported by the back-end, and the optimizer automatically rewrites the program to improve efficiency (see Section 6 for details). This optimization can help programmers aovid performance “pit falls” in SMC situation.

3.2.2 Computation engines

We realize that applications have very distinct requirements on security assumptions as well as performance. Thus, we decide to decouple the language front-end from the back-end computation engines. We can support any computation engine as long as they support basic operations such as addition, multiplication and comparison. And as the front-end only uses interfaces and does not modify the underlying protocols, the security is guaranteed as long as the computation engine provides secure and composable interfaces. We can easily port the language front-end to reuse the language tools and even machine learning algorithms. In our current prototype, we support a legacy computation engine as well as our own. Of course there are many kinds of engines which are based on different assumptions and provide different levels of security, and we will support more in the future.

Active security with SPDZ engine. We support SPDZ by adding a very thin wrapper to handle the communications with our language front-end. The advantage of SPDZ is that it provides active security by generating and keeping Message Authentication Codes (MACs) of private data. However, SPDZ uses inefficient secure bit-level operations for real numbers. Also, SPDZ needs time- and memory-consuming compilation and even fail to compile complex algorithms (we will show this in Section 7). Thus, for applications without requiring active security, we provide our efficient computation engine.

High-performance passive security with PrivPy engine. We have developed a computation engine combining secret sharing scheme and garbled circuits. Our model provides passive security like Sharemind [11]. However, our computation protocols provide better performance for real-number arithmetics. Moreover, we optimize the computation engine for batch operations that are essential in large-scale data mining problems. We detail the engine in the next two sections.

4 The PrivPy computation engine

4.1 Threat Model

Our computation engine uses four semi-honest servers for computation. And we focus on the client/server model in this paper, as [4, 53, 30, 11, 56] do. Moving the computation from clients to these servers effectively reduces the number of parties and thus improves system performance and scalability. Clients send secretly shared data to the servers, then the servers perform privacy-preserving computation on these shares.

Our engine is based on two assumptions: 1) All the servers are semi-honest, which means all servers follow the protocol and would not conspire with other servers, but they are curious about the users’ privacy and would mine information as much as possible; and 2) all communication channels are secure so that adversaries cannot see/modify anything in these channels. Note that a server mainly refer to a domain, and it can be a cluster of servers.

The assumption of non-colluding cloud servers is rational and common, and many well-known frameworks are based on this kind of assumption. For example, P4P [30] and PEM [53] uses semi-honest servers to enable additive computation, while Sharemind [11] and [4] uses three semi-honest servers to enable general computation. Our computation engine introduces one more semi-honest server to enable much more efficient privacy-preserving general computation for arithmetics. Actually, as there is a growing number of independent and competing cloud computing providers, it is feasible to find a small number of non-colluding computing servers.

Some frameworks can tolerate one or more malicious advasaries, which means they can detect biasing from the protocols, by maintaining authenticated information for each private variable [25, 33, 78, 79] or performing redundant computation [59, 17]. However, this comes at a cost. In this paper, we focus on building a practical SMC framework for machine learning tasks, and aim to make a right trade-off between security assumption and efficiency. Thus, we choose the assumption of semi-honest servers, which is rational and feasible. We leave extensions of detecting malicious advasaries for future work.

4.2 Computation engine architecture

We use the client/server model: each client breaks its private variable into two shares: and , and sends the shares to the servers shown in Fig. 3. and are the primary servers, which work as both secret share storage and computation engines. only touches all the ’s, while only handles ’s. To implement operations, we adopt two assistant servers, and , which only provide computation but no storage.

Our engine includes two subsystems. The secret sharing storage (SS store) subsystem, which resides on and , provides (temporary) storage of shares of private inputs and intermediate results. The private operation (PO) subsystem runs on the four servers and provides an execution environment for private operations. The servers read shares from the SS store, execute a PO, and write the shares of the result back to the SS store.

Figure 3: The overview of PrivPy computation engine.

4.3 The PrivPy workflow

With our computation engine, we present the PrivPy workflow.

Step 1: Program preparation. Once ready, the program is parsed into basic operations and optmized by the optimizer. Then it is passed to the back-end for computation.

Step 2: Pooling secret shares. Each client computes the secret shares for private variables from itself, and sends the resulting shares to the SS store on the servers.

Step 3: Executing the POs on servers. After receiving all expected shares of secrets, the servers start the private computation without any client involvement.

Step 4: Revealing the final results to the clients. When the server invokes the reveal() method, the clients are notified to find the result shares in the SS store, and finally recover the cleartext result.

5 PO Protocols

5.1 Mathematical preliminaries

We first review some mathematical preliminaries used in PrivPy for readers not familiar with the area.

Additive secret sharing. PrivPy uses the classic additive secret sharing scheme. Assume is a non-negative integer and is a large integer. Let be the additive group of integers module . Given and in , we call them two shares of as long as . We then define the function , where is a randomly picked integer in . Apparently, generates two shares of . We can see that is additively homomorphic in , i.e. the output of is the shares of . It is easy to see that both and are uniformly random in , and in PrivPy, the two non-colluding servers and manage these shares independently. Thus, neither server learns anything about the private variable .

Supporting real numbers. Note that the function above is defined on non-negative integers only. To map real numbers to , we use a typical discretization approach. Given a real number where is the bound of the input with , we denote as the corresponding integer in of :

(1)

where is the scaling factor with , which implies that the precision of this representation of real numbers is . Finally, we can define as the secret sharing function of a real number . Ignoring the precision loss, remains additively homomorphic.

Actually, our representation is close to the fixed-point representation [16]. The reason why we do not use the floating-point representation is due to its inefficiency for basic operations (e.g. to perform secure addition with floating-point representation, we need to align the radix points, which is time-consuming) [44, 3]. Meanwhile, fixed-point representations have been proved to be practical enough even for complex data mining tasks [58, 40]. Note that negative numbers are mapped into the range . Assuming all inputs and (accumulated) intermediate results are all within , there will never be any unexpected sign flipping caused by overflowing . We will show later that combining this frequently used mapping scheme with our PO design, we can get much more efficient POs than existing approaches.

Getting rid of wrap-round. Given the above secret sharing and real-number representation schemes, we can get rid of the wrap-round problem [23, 46, 16], so that we can apply arithmetical operations directly on the shares. First, we define as an indicator function that converts an integer in to the signed representation: If , then ; else if , then . We then introduce two theorems (we leave the formal proofs in our technical report [54]).

Theorem 1.

Given , if , then we have .

Theorem 2.

Given a private real number , denoting the event as FAIL, we have .

Combining Theorem 1 and Theorem 2, we can see that, if

is small enough, the probability

is extremely high. For example, given , and , we have , which is negligible. In practice, the negligible does not cause any failure even for complex algorithms. With this conclusion, we will show that we can improve the performance for multipilication and comparison of real numbers.

5.2 The addition PO

Based on the definition of , to compute the secret shares of the sum, the servers and only need to independently add up the shares they have.

5.3 The multiplication PO

Real-number multiplication is a basic operation for machine learning algorithms. For example, neural networks frequently use matrix multiplications for network propagation. Such algorithms usually involve massive real-number multiplications, and the throughput of real-number multiplication is vital for these algorithms.

Our solution avoids the secure bit-level operations in integer/bit-based schemes (e.g. [11, 25]) and precomputation in [60]. We believe that the performance gain deserves the price of finding one or two more semi-honest servers. Protocol 1 shows our real-number multiplication protocol. An important detail is that the mapping scheme scales its input using a scaling factor , thus each product term like contains the factor . Thus we need to adjust the result by dividing the factor.

Input: and : and
Output: : shares of
Steps:
  1. generates two random integer

    and , and sends them to .

  2. sets and .

    sets and .

  3. sends to and sends to .

    sends to and sends to .

  4. calculates .

    calculates .

    calculates .

    calculates .

  5. sets and sets .

  6. sends to and sends to .

    sends to and sends to .

  7. sets .

    sets .

Protocol 1 Multiplication PO protocol.

Analysis. (Correctness). We can verify that and in step are the shares of and respectively.

Ignoring the rounding effect, and with the conclusion of the above theorems, we have and . Thus , which means the modular sum of and produced in step is , i.e. . Steps to calculate this sum and result shares are and produced in step .

(Security). We leave the detailed proof in Appendix .3. Informally speaking, only steps and

involve communications. And the numbers transfered between the servers are all uniformly distributed integers generated by the secret sharing function. Thus we can see that, no information is leaked to the servers.

(Complexity). Note that the random integers sharing in step can be optimized by letting and use the same random number generator with the same seed so that we can avoid the transfer for these random integers. The same applies to step . Thus there are invocations: in step and in step .

5.4 The comparison PO

Comparison is common in many basic components, such as ReLU function and -means, and mostly works together with other operations. The (improved) Garbled Circuit (GC) [81] is one of the most communication-efficient secure comparison protocols [21, 47]. The measurement in [27] also demonstrates this. Therefore, unlike some existing secret-sharing-based frameworks (e.g.[25, 11]) which use expensive secure bit-level operations to implement comparison, PrivPy integrates GC with the secret sharing scheme to get more efficient comparison

There are also some previous approaches, such as [27, 32], integrating GC with secret sharing. They perform comparison on and as follows: 1) First, they perform two addition circuits on and respectively to get the Yao’s circuit shares of and ; 2) Then they evaluate the comparison circuit using the two Yao’s circuit shares as input; 3) Finally they convert the comparison result (represented using Yao’s circuit sharing) back to arithmetic sharing. Our comparison protocol performs GC directly on the arithmetic shares, thus avoids the addition circuits. And with the two assistant servers, we can efficiently convert the comparison result back to arithmetic shares. As the communicaton cost of garbled circuits domains the performance of the comparison (especially for batch comparison) and the addition circuit is more costly than the comparison circuit [48], eliminating the addition circuits can improve the throughput by about . Protocol 2 details the process.

Input: and : and
Output: : shares of , where can be etc.
Steps:
  1. sets .

    sets .

  2. sends to and sends to .

    sends to and sends to .

  3. sets .

    sets .

  4. sets and sets .

  5. and generate random bits and respectively. They collaboratively compute using GC, where means XOR. is the output of GC.

  6. sends to and . sets . Then sends to and sends to .

  7. If is 0, sets and sets .

    If is 1, sets and

    sets , where is the share of 1.

Protocol 2 Comparison PO protocol.

To implement GC, we use the free XOR construction [48, 47]. Besides, we use the half-and trick [84] to reduce the size of garbled AND gates. We can further optimize GC performance by moving the circuit generation and transfer to the setup phase, like  [47] does.

Analysis. (Correctness). It is obvious that and at step are shares of . The purpose of step is to convert integers to regular signed integers so that we can use GC to compare. This step produces . Note that as Theorem 2 establishes, we have . Clearly, has the same sign as , so . Thus, we can now perform GC directly on and to get the comparison result, without using the conversion in [27].

At step , the comparison result is masked by two random bits and generated independently by and , like [47] does. After receiving and the shares of (step ), and recover the shares of the original comparison result (step ). When , they perform to flip the bit. Finally, is the share of the result.

(Security). Like in Protocol 1, we can also define a functionality in an ideal model and construct simlators to show that each server only sees independent random numbers or bits in Protocol 2. We omit the details here.

Informally speaking, during the entire process, and only generate and receive secret shares. and perform the comparison between themselves, and the comparison result is masked by the random bits and . Each server thus only sees independent random integers or garbled bits.

(Complexity). Apart from the communication triggered by GC, there are 8 invocations: in step and in step . In GC, following [47], we move the circuit generation and transfer to the setup phase. Thus, in the online phase of GC, only needs to send bits to , where is the bit length of and is the security parameter of . In this paper, we set .

5.5 Derived POs

We can compose multiple basic POs and form more complex derived POs commonly used in machine learning algorithms. For example, we can use the Newton-Raphson algorithm [82] to implement division. To implement the logistic function , we use the Euler method [73]. We also implement other common maths functions using similar numerical methods, such as sqrt, log, exp and max_pooling. We omit the details due to space limitation.

5.6 POs for performance optimization

We provide the following three sets of POs whose functionality is already covered by the basic POs, but the separate versions can significantly improve performance in certain cases. Programmers can use these POs directly.

Array POs. Batch up is a commonly used optimization in SMC frameworks [11, 25, 85], which batches up independent data transfers among the servers and thus reduce the fixed overhead. Array POs natively support batch up. And as many machine learning algorithms heavily utilize array operations, this optimization reduces communication rounds and can improve performance significantly.

Multiply by public variables. In a case where an operation involves both public and private variables, we can optimize performance by revealing the public variables. Multiplication benefits from the optimization the most, as and only need to multiply their shares by the public variables directly and there is no necessary communication between servers.

Dot and outer product. Dot and outer product of arrays are freqently used in common machine learning algorithms. For example, logistic regression and neural networks use dot product for forward propagation, represented as . Outer product is often used for calculating graduations. While implementing them using for-loops, there are too many duplicated trasfers for each element, as each element will be multiplied by several other elements in a multi-dimensional situation. We thus provide built-in optimized dot and outer product. Specifically, for two private arrays and ( and can have any dimensions, as long as their shapes match after broadcasting, like that in Numpy), we can calculate the dot product as . The process is similar as the multiplication protocol: and calculate the former two terms locally while and calculate the latter two terms. The security can also be proved in a similar way as the multiplication protocol. This optimization significantly reduces communication cost. As an example, given two arrays, a for-loop for dot product triggers multiplications in Protocol 1 and the communication cost is , while the optimized one only incurs communication cost of . In comparison, well-known SMC front-ends, such as PICCO and SPDZ, does not provide such built-in optimization.

6 Python Front-end and Optimizations

As discussed in Section 3, PrivPy provides a programming interface that is compatibility with plain Python code. In this section, we focus on the implementation and optimization of the programming interfaces. Our goal is to provide intuitive interfaces and automatic optimizations to avoid steep learning curves and enable programmers focus on the machine learning algorithm itself.

6.1 Python Interfaces

Private array types. The private array class in PrivPy encapsulates arrays of any shape. Users only need to pass a private array to the constructor, then the constructor automatically detects the shape. Like the array type in Numpy [75], our private array supports broadcasting, i.e. PrivPy can handle arithmetic operations with arrays of different shapes by “broadcasting” the smaller arrays. We also implement the ndarray methods in Numpy. Our technical report [54] lists the ndarray methods we have implemented. Broadcasting and ndarray methods are rather useful for implementing common machine learning algorithms which usually handles arrays of different shapes.

Operator overloading. We overload operators for private data classes, so standard operators such as work on both private and public data, or a combination. The implementation of these overloaded operators chooses the right POs to use based on data types.

Support for large arrays. In this “big data” age, real-world machine learning tasks usually use large arryas as inputs. However, mapping the data onto secret shares unavoidably increases the data size. Thus, real-world datasets that fit in memory in cleartext may fail to load in the private version. For example, the matrix we use in our experiments require over 200 GB memory when mapped into a 256-bit long integer space. It is hard for application programmers to design new algorithms to handle the memory limit. Thus we provide a LargeArray class that transparently uses disks as the back storage for arrays too large to fit in memory.

Array operation examples. In addition to Fig. 2, Fig. 4 shows an extra example of matrix factorization, which decomposes a large private matrix to two latent matrices and . Both demonstrate ndarray methods in PrivPy. Users can implement the algorithms in plain Python, then just replace the Numpy package with PrivPy package and add private variables declaration. Actually, by replacing all privpy with numpy, line 2-13 of Fig. 4 can run directly in raw Python environment with cleartext inputs.

x = ... # read data using ss()
factor,gamma,lamb,iter_cnt = initPublicParameters()
n,d = x.shape
P = privpy.random.random((n,factor))
Q = privpy.random.random((d,factor))
for _  in range(iter_cnt):
  e = x - privpy.dot(P,privpy.transpose(Q))
  P1 = privpy.reshape(privpy.repeat(P,d,axis=0),P.shape[:-1] + (d,P.shape[-1]))
  e1 = privpy.reshape(privpy.repeat(e,factor,axis=1),e.shape + (factor,))
  Q1 = privpy.reshape(privpy.tile(Q,(n,1)),(n,d,factor))
  Q += privpy.sum(gamma * (e1 * P1 - lamb * Q1),axis = 0) / n
  Q1 = privpy.reshape(privpy.tile(Q,(n,1)),(n,d,factor))
  P += privpy.sum(gamma * (e1 * Q1 - lamb * P1),axis = 1) / d
P.reveal(); Q.reveal()
Figure 4: Example PrivPy code: matrix factorization.

6.2 Code analysis and optimization

Comparing to the computation on cleartext, private operations have very distinct cost, and many familiar programming constructs may lead to bad performance, creating “performance pitfalls”. Thus, we provide aggressive code analysis and rewriting to help avoid these pitfalls. For example, it is fine to write an element-wise multiplication of two vectors in plain Python program.

for i in range(n): z[i] = x[i] * y[i]

However, this is a typical anti-pattern causing performance overhead due to the multiplication POs involved, comparing to a single array PO (Section 5.6).

To solve the problem, we build a source code analyzer and optimizer based on Python’s abstract syntax tree (AST) package [61]. Before the servers execute the user code, our analyzer scans the AST and rewrites anti-patterns into more efficient ones. In this paper, we implement three such examples:

For-loops vectorization. Vectorization [80] is a well-known complier optimization. This analyzer rewrites the above for-loop into a vector form .

Common factor extraction. We convert expressions with pattern to . In this way, we reduce the number of from to 1, saving significant communication time.

Common expression vectorization. Programmers often write vector expressions explicitly, like , especially for short vectors. The optimizer extracts two vectors and , and rewrite the expression into a vector dot product of .

6.3 Rejecting unsupported statements

We allow users to write legal Python code that we cannot run correctly, such as branches with private conditions (actually, most SMC tools do not support private conditions [85, 56], or only support limited scenarios [85, 83]). In order to minimize users’ surprises at runtime, we perform AST-level static checking to either rewrite or reject unsupported statements at the initialization phase. For example, for an expression containing private variables, if it is a simple case like res = a if cond else b, we will automatically rewrite it to res = b + cond * (a - b). On more complex cases, we prompt the user at the initialization phase whether they want to reveal the condition value to the servers. If so, we automatically rewrite the code to add a reveal procedure, and otherwise, we terminate with an error.

7 Evaluation

7.1 Experiment setup

Testbed. We use four servers for the experiments, each of which is a KVM-based virtual machine running in a private OpenStack environment. Each server has 8 virtual CPU cores (based on Ivy Bridge Xeon running at 2.0 Ghz), 64 GB RAM and 1 GE network connection.

PrivPy implementation. We implement the main program of PrivPy with Python. We hand-code GC-based comparison in C++ with the Crypto++ library. We compile the C++ code using gcc -O3, and wrap it into Python code. We use SSL with 1024-bit key to protect all communications. We measure that the round-trip time of sending a 10-byte message with SSL is about 0.1 ms.

Parameter setting. In the following benchmarks, we fix the bit length of to 256 for simplicity. We also set the scaling factor , thus the precision is . This is enough for most common applications. We repeat each experiment 100 times and report the average values.

7.2 Microbenchmarks

7.2.1 Basic operations

To demonstrate the performance of basic operations in PrivPy, we evaluate the fundamental POs, as well as the basic secret sharing process.

Fundamental POs. The fundamental POs are , and . We compare two versions of these POs based on the operand types: two private variables, and one private variable with one public variable. Table 1 shows the result, and we have the following observations:

a) and are slower than , due to communication.

b) with a public is faster than the two-private version, due to the multiply-by-public optimization.

c) with a public number is slower than the normal version. This is because the servers should map public variables during the computation. We are developing a variable analysis tool to automatically identify POs involving public variables, so that the client can preprocess them.

public
public
public
1.83e-3 0.3 0.87 9.07e-3 9.0e-3 0.89
Table 1: Time for basic operations (in milliseconds).

Derived POs with numerical methods. Derived POs in Section 5.5 approximate non-linear operations using iterative numerical methods. We evaluate the relative error and execution time with different numbers of iterations.

For division, we use the default initial value of , and evaluate the expression of . For the logistic function, we use the default starting value of . Figure 5 plots the relative error and execution time with different number of iterations and different values of the input . We can see the per-iteration time is reasonably short and the algorithm converges fast.

Note that different from cleartext algorithms, as the servers do not see the outcome of each iteration, they cannot tell whether the result has converged or not. Thus, we need to set a conservative iteration limit as a tradeoff between result accuracy and computation time. For all our experiments, we use 50 iterations, and it takes about 30 ms to compute a division or logistic function.

Figure 5: Time and accuracy after each iteration in derived POs with iterative numerical methods.

Client-server interaction overload We evaluate the client time consumption (including computation and communication) of the secret sharing process and the result recover process . Figure 6 shows that even with 1000 clients and 1000-dimension vectors, it takes only less 0.6 seconds for the servers to collect/reveal all the vectors from/to all the clients.

Figure 6: Performance of and .

7.2.2 Effectiveness of optimizations

Now we show the improvement with batch operations and the code optimizer.

Batch array operations. We evaluate the effectiveness of batching up, using two common operations: element-wise multiplication and element-wise comparison on vectors. Figure 7 shows the result. The batched-up multiplication is about faster than the unbatched one, while the batched-up comparison is about faster.

Figure 7: The optimization of doing operations in batch.

Dot and outer product. We evaluate the dot product on two square matrices and evaluate the outer product on two vectors. We vary the number of total elements and measure the time consumption. The optimized outer product needs multiplications, while the optimized dot product of two square matrices needs multiplications. As Fig. 8 shows, when , speed-up is for the outer product.

Figure 8: The optimization of dot and outer product.

Code optimizations using AST. We evaluate the common factor extraction and expression vectorization. As these hand-written anti-patterns are usually small, we range the expression size from 2 to 10. Figure 9 shows that the optimizations lead to performance improvement for five-term expressions in both situations.

Figure 9: Code optimizer performance.

7.2.3 Comparison with existing approaches

We compare the performance of PrivPy with existing approaches. We only choose systems that are open source and at least support both additions and multiplications. The goal of this comparison is to show the performance gap of different underlying protocols for common arithmetical operations over real numbers.

HElib [35] is an implementation of (leveled) FHE, which has a parameter controlling the depth of circuits. We use a choice of small only allowing shallow circuits. Even so, HElib is less efficient than PrivPy.

Obliv-C [83] is a highly optimized garbled circuit implementation. And to support real numbers, we convert each real number to a large integer by multiplying a scaling factor, following the sample code [38].

P4P+HE P4P [30] is a practical additive secret sharing based system. Following [5], we add a partial-HE (i.e. 1024-bit Paillier [62]) based multiplication on to P4P.

SPDZ [25] As described in Section 3, we port SPDZ into the PrivPy framework. We evaluate both the native SPDZ (the Raw SPDZ in the following discussion), and SPDZ with PrivPy front-end (SPDZ + PrivPy). In this paper, SPDZ runs on two servers for two-party computation.

We compare the performance of , and . Table 2 presents the latency of basic scalar operations, and Table 3 shows the throughput. Note that not all frameworks above support multi-core CPUs like PrivPy. Therefore, to evaluate their throughputs, we run multiple independent processes of these frameworks and add up the throughput. As different frameworks perform different processing during startup time, to avoid the effect of unrelated factors, we also ignore the startup time for them, including the time for compilation, program loading and pre-computation, though it is non-negligible. Even so, PrivPy still performs much better. Our key observations include:

1) For , PrivPy, P4P and SPDZ have similar performance, as they are all secret-sharing-based. HElib and Obliv-C need to evaluate encrypted or garbled circuits and handle carry bits for secure addition, thus slower than the secret-sharing-based tools.

2) For , as Table 2 shows, secret-sharing-based tools, such as SPDZ and PrivPy, show a improvement over Obliv-C for a single multiplication, by using secret sharing servers instead of HE or garbled circuits. Moreover, as Table 3 shows, the multiplication throughput of PrivPy is over better than others, thanks to our efficient multiplication PO.

3) SPDZ uses expensive secure bit-level operations for comparison, while PrivPy and Obliv-C use GC which is more communication-efficient. The main reason that PrivPy is slower than Obliv-C for a single comparison is that PrivPy compares 256-bit integers while Obliv-C works only on 64-bit ones. However, thanks to our optimization, the throughput of PrivPy is higher than that of Obliv-C and higher than that of SPDZ.

4) SPDZ + PrivPy has similar performance as the raw SPDZ, showing minimal cost of our front-end porting.

Note that although SPDZ provides active security by generating and keeping MACs of private data and may introduce extra cost, the cost mainly resides at the pre-computation phase, while updating MACs in the online phase can be done efficiently with simple operations and involves no extra communication and thus brings little impact on the overall performance. As we evaluate the performance of online phase and ignore the time for pre-computation, the big gap of performance between SPDZ and PrivPy presented above still comes from the computaion protocol itself, as we analyzed in Section 5: that is, even without the MACs, the online performance of SPDZ will not improve much and PrivPy will still perform much better than SPDZ.

Another thing to note is that, in the above evaluation we do not compare with Sharemind [11], which is as far as we know state-of-the-art SMC framework based on passive security, as it is a closed-source commercial product and we do not have the access. However, according to the report of [28], PrivPy performs about better than Sharemind for the throughput of fixed-point multiplication, even if their experiments ran on faster servers. Another notable system is SecureML [60], which provides builtin support for real numbers like PrivPy. But it does not provide any language front-end. However, according to the report in [60], the overall throughput of PrivPy for multiplication is about better than SecureML, as it requires precomputation to generate multiplication triplets.

HElib Obliv-C
P4P +
HE
Raw
SPDZ
SPDZ +
PrivPy
PrivPy
4.0e-2 1.3e-2 1.8e-3 1.85e-3 1.86e-3 1.8e-3
31 1.6 1.8 0.348 0.344 0.3
- 0.1 - 1.35 1.35 0.87
Table 2: Time consumed by single basic operations (in milliseconds).
HElib Obliv-C
P4P +
HE
Raw
SPDZ
SPDZ +
PrivPy
PrivPy
258
3,930
4,344
83,073
83,229
2,583,158
-
78,431
-
20,472
20,320
150,125
Table 3: Throughput of basic operations (ops/second).

7.3 Performance in real algorithms

batch
size
LR
(Adult)
LR
(Creditcard)
LR
(Movielens)
-means
(Adult)
-means
(Creditcard)
-means
(Movielens)
MF
(Adult)
MF
(Creditcard)
MF
(Movielens)
1 80 (60) 70 (50) 190 (-) 150 (120) 150 (60) 770 (-) 20 (1006) 10.7 (240) 820 (-)
10 11 (13) 9 (7) 79 (-) 23 (70) 16 (19) 318 (-) 12.8 (226) 3.0 (64.5) 460 (-)
100 3.7 (10.7) 2.3 (4.5) 62.3 (-) 12.2 (-) 6.5 (14.8) 268 (-) 12.4 (-) 3.0 (56.1) 429.4 (-)
1000 2.5 (-) 1.39 (-) 74.4 (-) 9.37 (-) 4.69 (-) 271.9 (-) 11.4 (-) 2.7 (-) 439.8 (-)
Table 4: Average time taken per instance for model training (in milliseconds). The values in () are the results of the algorithms run on SPDZ with our front-end. (-) means the SPDZ compilation crashes.
Figure 10: RRMSE between private and cleartext result of the logistic regression parameter after each iteration.

PrivPy supports real learning algorithms with large-scale datasets. Here we evaluate both model training and inference. For training, we use three real datasets and three algorithms. For inference, we evaluate traditional feedforward neural network and convolutional neural network. And as our front-end supports both PrivPy engine and SPDZ engine, we run the same code on both PrivPy and PrivPy + SPDZ. It takes a long time to pre-compile the code to run on SPDZ (e.g. it takes 16 minutes to compile logistic regression on 1000 instances of the Adult dataset in SPDZ) or even crashes. PrivPy does not suffer from this problem. Again, we ignore the compilation and pre-computation time in the evaluation. Note that, by comparing with SPDZ, we are not aiming to simply compare the performance, as we have shown this in the previous section. Instead, we mainly want to show that employing an end-to-end implementation with active security (e.g. SPDZ) requires expensive price to pay, and in real-world applications, we should make a right trade-off and choose a practical solution like PrivPy.

7.3.1 Model training on secret datasets

We use three real-world datasets. We treat the records in these datasets private and train models using them.

1) Adult [55] contains records of information about individuals. There are dimensions per record.

2) CreditCard [22] consists of credit card transactions with numeric features each.

3) Movielens [37] contains 1 million movie ratings from thousands of users. We encode it to a matrix. As it is too large to fit into memory, we treat it as a disk-backed LargeArray.

Moreover, we evaluate the following three algorithms.

Logistic regression (LR) [36]. We train logistic regression using Stochastic Gradient Descent (SGD), which calculates the gradients of the weights in each iteration.

-means [64]. -means is a method for unsupervised clustering, which updates the centroids and the clusters of the instances in each iteration. In all -means evaluations, we set the number of clusters to .

Matrix factorization (MF) [8]. It decomposes a large matrix into two smaller latent matrices for efficient prediction, which performs several matrix multiplications in each iteration. In this paper, we decompose each matrix to a matrix and a matrix.

Number of POs
Logistic
regression
element-wise vector additions, 3 element-wise vector multiply, 2 multiply-by-public, 1 dot product of and matrixes, and 1 logistic function for -dimension vector.
-means
additions, 3 element-wise vector multiply, 2 multiply-by-public, 1 dot product of and matrixes, and 1 logistic function for -dimension vector.
Matrix
factorization
additions, 3 element-wise vector multiply, 2 multiply-by-public, 1 dot product of and matrixes, and 1 logistic function for -dimension vector.
Feedforward
NN
1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, element-wise comparisons for matrix.
CNN
1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, element-wise comparisons for matrix.
CNN + BN
1 dot products of and matrixes, 1 dot product of and matrices, 2 ReLU for matrix, element-wise comparisons for matrix.
Table 5: Number of POs foe each real time cases. is the batch size, and is the dimension of each instance.

Table 4 summarizes the average time consumed by each instance with different batch sizes in an iteration. The key observations are: 1) batch operations bring per-instance performance improvements in all algorithms; 2) SPDZ fails to handle larger scale cases, as its pre-compilation module runs out of memory and crashes; and 3) PrivPy uses the LargeArray to handle the largest Movielens dataset, and the program works ok.

To verify the accuracy, we compute the Relative Root Mean Squared Error (RRMSE) of the resulting model parameters between PrivPy and the cleartext version after each iteration for the Adult and Creditcard dataset. Unlike [60]

, which suffers from large precision loss due to simplified versions of activation functions, PrivPy supports direct approximations of these functions (Section 

5.5) and the precision loss is negligible. Figure 10 shows that the RRMSE is small ( range and on average) even after iterations. We verify that the computation error is negligible and the prediction accuracy on the two datasets is the same as the cleartext version.

7.3.2 Neural network (NN) inference

We use MNIST dataset [51] with labeled handwritten digits [19] with pixels each. We use three example neural networks for handwritten digits recognition to evaluate the inference performance, by treating both the model and the data as private. Note that we do not present the results of neural networks on SPDZ, as SPDZ is able to compile none of these cases successfully.

Feedforward neural network. The network consists of a 784-dimension input layer, two 625-dimension hidden layers and a 10-dimension output layer. Finally, we pass the output vector to an argmin function to get the output. The activation function is ReLU.

Convolutional neural network (CNN). We use the well-known LeNet-5 [52] model to demonstrate CNN. LeNet-5 has a 784-dimension input layer, 3 convolutional layers with a

kernel, 2 sum-pooling layers, 3 sigmoid layers, 1 dot product layer and 1 Radial Basis Function layer. Also, LeNet-5 performs an

argmin function on a 10-dimension vector to get the output. This is a quite heavy computation, involving a large number of POs including multiplications and comparisons.

CNN + batch normalization (BN).

Based on the LeNet-5 model, we add a batch normalization 

[41] layer to each sigmoid layer. Thus we add 3 BN layers to the CNN model. BN mainly introduces some secure multiplications to the computation.

Table 6 shows the average time to infer an image. Batching up still brings significant speedup for all algorithms. Even with complex neural network models such as CNN, it takes only 1.1 seconds to process a single image, and about 0.1 seconds for an image on average when processing images in batch. This is acceptable considering the privacy guarantee. We verify that the classification result is the same as the cleartext version. To the best of our knowledge, this is the first practical implementation of a real convolutional neural network using a noise-free privacy-preserving method.

batch size
Feedforward NN
CNN
CNN + BN
1 1.48 1.57 1.58
10 0.31 0.23 0.32
100 0.06 0.1 0.17
1000 0.04 0.1 0.16
Table 6: Average time taken per instance for inference on MNIST dataset using different batch sizes (in seconds).

8 Conclusion and Future Work

Over thirty years of SMC literature provides an ocean of protocols and systems, and many work great on certain aspects of performance, security or ease of programming. We believe it is time to integrate these techniques into an application-driven and coherent system for machine learning tasks. PrivPy is a framework with top-down design. At the top, it provides familiar Python programming interfaces with essential data types like real numbers and arrays, and use code optimizer/checkers to avoid common mistakes. In the middle, using secret shares as both storage and communication intermediary, we build a composable PO system that helps decoupling the front-end with backend. At the low level, we design new protocols that improve real number and array computation speed. PrivPy shows great potential: it handles large data set (1M-by-5K) and complex algorithms (CNN) fast, with minimal program porting effort.

PrivPy opens up many future directions. Firstly, we are improving the PrivPy back-end to provide active security while preserving high efficiency. Secondly, we would like to port existing machine learning libraries to our front-end. Third, although we focus on SMC in this work, we will introduce randomization to protect the final results [66, 53]. Last but not least, we will also improve fault tolerance mechanism to the servers.

.1 Proof of Theorem 1

the big prime that determins the field
the additive group of integers module
the bound of numbers in the computation
the scaling factor
the secret sharing function splitting integers
in to shares
the corresponding integer in of
a real number
the secret sharing result of a real number
that is equivalent to
the reverse process of that maps
the secrets back to real numbers
the helper function that converts integers in
to the signed representation
private variables
the shares of private variables
Table 7: The notations in this paper.

For convenience, Table 7 summarizes the notations we use throughout the paper. First we consider the case . As , it is impossible that and are both negative. Thus, there are three possibilities: 1) and ; 2) and ; 3) and . In case 1), both and fall into , and (or will be negative). Then we have . In case 2), falls into and falls into . In case 3), falls into and falls into . In either case of 2) and 3), . Meanwhile, in this case, to ensure for , we have . Thus still holds.

Similarly, for the case , there are three possibilities: 1) and ; 2) and ; 3) and . In case 1), both and fall into , and (as ). As for , we have . In either case of 2) and 3), (or will be positive). Thus .

.2 Proof of Theorem 2

Before we start the proof, we introduce the following three lemmas.

Lemma 1.

Given , if and , or and , then .

Proof If and , and . Similarly, if and , and . In either case, . Given the ranges of and , we know .

First consider the case , it must be that (otherwise will be shares of a negative number). This means that is just . Therefore .

Then suppose , to represent a negative number, it must be that (otherwise will be shares of a positive number). In this case , we still have .

Lemma 2.

For a private variable , given , we have 1) if and , then ; 2) if and , then .

Proof As , if and , then and . Thus we have .

On the other hand, if and , then and . Thus we have .

Lemma 3.

For a private variable , given , we have 1) if and , then ; 2) if and , then .

Proof As , if and , and , then . Thus we have .

On the other hand, if and , then and . Thus we have .

Now let us back to the proof of Theorem 2. From Lemma 1, Lemma 2 and Lemma 3, we can see that, is equivalent to and for , or and for . Thus can be calculated as follows:

First consider the case . Since and are random shares and , there are three possibilities: 1) ; 2) ; and 3) . In the second case, and will fall into as . This happens with probability . In case 3), will never fall into . To see this, notice that