1 Introduction
The advent of big data and striking recent progress in artificial intelligence are fueling the impending industrial automation revolution. In particular, Deep Learning (DL) —a method based on learning Deep Neural Networks (DNNs) —is demonstrating a breakthrough in accuracy. DL models outperform human cognition in a number of critical tasks such as speech and visual recognition, natural language processing, and medical data analysis. Given DL’s superior performance, several technology companies are now developing or already providing DL as a service. They train their DL models on a large amount of (often) proprietary data on their own servers; then, an inference API is provided to the users who can send their data to the server and receive the analysis results on their queries. The notable shortcoming of this remote inference service is that the inputs are revealed to the cloud server, breaching the privacy of sensitive user data.
Consider a DL model used in a medical task in which a health service provider withholds the prediction model. Patients submit their plaintext medical information to the server, which then uses the sensitive data to provide a medical diagnosis based on inference obtained from its proprietary model. A naive solution to ensure patient privacy is to allow the patients to receive the DL model and run it on their own trusted platform. However, this solution is not practical in realworld scenarios because: (i) The DL model is considered an essential component of the service provider’s intellectual property (IP). Companies invest a significant amount of resources and funding to gather the massive datasets and train the DL models; hence, it is important to service providers not to reveal the DL model to ensure their profitability and competitive advantage. (ii) The DL model is known to reveal information about the underlying data used for training [59]. In the case of medical data, this reveals sensitive information about other patients, violating HIPAA and similar patient health privacy regulations.
Oblivious inference is the task of running the DL model on the client’s input without disclosing the input or the result to the server itself. Several solutions for oblivious inference have been proposed that utilize one or more cryptographic tools such as Homomorphic Encryption (HE) [14, 13], Garbled Circuits (GC) [61], GoldreichMicaliWigderson (GMW) protocol [25], and Secret Sharing (SS). Each of these cryptographic tools offer their own characteristics and tradeoffs. For example, one major drawback of HE is its computational complexity. HE has two main variants: Fully Homomorphic Encryption (FHE) [14] and Partially Homomorphic Encryption (PHE) [13, 43]
. FHE allows computation on encrypted data but is computationally very expensive. PHE has less overhead but only supports a subset of functions or depthbounded arithmetic circuits. The computational complexity drastically increases with the circuit’s depth. Moreover, nonlinear functionalities such as the ReLU activation function in DL cannot be supported.
GC, on the other hand, can support an arbitrary functionality while requiring only a constant
round of interactions regardless of the depth of the computation. However, it has a high communication cost and a significant overhead for multiplication. More precisely, performing multiplication in GC has quadratic computation and communication complexity with respect to the bitlength of the input operands. It is wellknown that the complexity of the contemporary DL methodologies is dominated by matrixvector multiplications. GMW needs less communication than GC but requires many rounds of
interactions between the two parties.A standalone SSbased scheme provides a computationally inexpensive multiplication yet requires three or more independent (noncolluding) computing servers, which is a strong assumption. Mixedprotocol solutions have been proposed with the aim of utilizing the best characteristics of each of these protocols [49, 40, 38, 32]. They require secure conversion of secrets from one protocol to another in the middle of execution. Nevertheless, it has been shown that the cost of secret conversion is paid off in these hybrid solutions. Roughly speaking, the number of interactions between server and client (i.e., round complexity) in existing hybrid solutions is linear with respect to the depth of the DL model. Since depth is a major contributor to the deep learning accuracy [58], scalability of the mixedprotocol solutions with respect to the number of layers remains an unsolved issue for more complex, manylayer networks.
This paper introduces Xonn, a novel endtoend framework which provides a paradigm shift in the conceptual and practical realization of privacypreserving interference on deep neural networks. The existing work has largely focused on the development of customized security protocols while using conventional fixedpoint deep learning algorithms. Xonn, for the first time, suggests leveraging the concept of the Binary Neural Networks (BNNs) in conjunction with the GC protocol. In BNNs, the weights and activations are restricted to binary (i.e, ) values, substituting the costly multiplications with simple XNOR operations during the inference phase. The XNOR operation is known to be free in the GC protocol [33]; therefore, performing oblivious inference on BNNs using GC results in the removal of costly multiplications. Using our approach, we show that oblivious inference on the standard DL benchmarks can be performed with minimal, if any, decrease in the prediction accuracy.
We emphasize that an effective solution for oblivious inference should take into account the deep learning algorithms and optimization methods that can tailor the DL model for the security protocol. Current DL models are designed to run on CPU/GPU platforms where many multiplications can be performed with high throughput, whereas, bitlevel operations are very inefficient. In the GC protocol, however, bitlevel operations are inexpensive, but multiplications are rather costly. As such, we propose to train deep neural networks that involve many bitlevel operations but no multiplications in the inference phase; using the idea of learning binary networks, we achieve an average of reduction in the number of gates for the GC protocol.
We perform extensive evaluations on different datasets. Compared to the Gazelle [32] (the prior best solution) and MiniONN [38] frameworks, we achieve and lower inference latency, respectively. Xonn outperforms DeepSecure [52] (prior best GCbased framework) by and CryptoNets [19], an HEbased framework, by . Moreover, our solution renders a constant round of interactions between the client and the server, which has a significant effect on the performance on oblivious inference in Internet settings. We highlight our contributions as follows:

[leftmargin=*]

Introduction of Xonn, the first framework for privacy preserving DNN inference with a constant round complexity that does not need expensive matrix multiplications. Our solution is the first that can be scalably adapted to ensure security against malicious adversaries.

Proposing a novel conditional addition protocol based on Oblivious Transfer (OT) [44], which optimizes the costly computations for the network’s input layer. Our protocol is
faster than GC and can be of independent interest. We also devise a novel network trimming algorithm to remove neurons from DNNs that minimally contribute to the inference accuracy, further reducing the GC complexity.

Designing a highlevel API to readily automate fast adaptation of Xonn, such that users only input a highlevel description of the neural network. We further facilitate the usage of our framework by designing a compiler that translates the network description from Keras to Xonn.

Proofofconcept implementation of Xonn and evaluation on various standard deep learning benchmarks. To demonstrate the scalability of Xonn, we perform oblivious inference on neural networks with as many as layers for the first time in the oblivious inference literature.
2 Preliminaries
Throughout this paper, scalars are represented as lowercase letters (), vectors are represented as bold lowercase letters (), matrices are denoted as capital letters (
), and tensors of more than 2 ways are shown using bold capital letters (
). Brackets denote element selection and the colon symbol stands for all elements — represents all values in the th row of .2.1 Deep Neural Networks
The computational flow of a deep neural network is composed of multiple computational layers. The input to each layer is either a vector (i.e., ) or a tensor (i.e.,
). The output of each layer serves as the input of the next layer. The input of the first layer is the raw data and the output of the last layer represents the network’s prediction on the given data (i.e., inference result). In an image classification task, for instance, the raw image serves as the input to the first layer and the output of the last layer is a vector whose elements represent the probability that the image belongs to each category. Below we describe the functionality of neural network layers.
Linear Layers:Linear operations in neural networks are performed in FullyConnected (FC) and Convolution (CONV) layers. The vector dot product (VDP) between two vectors and is defined as follows:
(1) 
Both CONV and FC layers repeat VDP computation to generate outputs as we describe next. A fully connected layer takes a vector and generates the output
using a linear transformation:
(2) 
where is the weight matrix and
is a bias vector. More precisely, the
th output element is computed as .A convolution layer is another form of linear transformation that operates on images. The input of a CONV layer is represented as multiple rectangular channels (2D images) of the same size: , where and are the dimensions of the image and is the number of channels. The CONV layer maps the input image into an output image . A CONV layer consists of a weight tensor and a bias vector . The th output channel in a CONV layer is computed by sliding the kernel over the input, computing the dot product between the kernel and the windowed input, and adding the bias term to the result.
Nonlinear Activations: The output of linear transformations (i.e., CONV and FC) is usually fed to an activation layer, which applies an elementwise nonlinear transformation to the vector/tensor and generates an output with the same dimensionality. In this paper, we particularly utilize the Binary Activation (BA) function for hidden layers. BA maps the input operand to its sign value (i.e., or ).
Batch Normalization:
BN) layer is typically applied to the output of linear layers to normalize the results. If a BN layer is applied to the output of a CONV layer, it multiplies all of the th channel’s elements by a scalar and adds a bias term to the resulting channel. If BN is applied to the output of an FC layer, it multiplies the th element of the vector by a scalar and adds a bias term to the result.Pooling: Pooling layers operate on image channels outputted by the CONV
layers. A pooling layer slides a window on the image channels and aggregates the elements within the window into a single output element. Maxpooling and Averagepooling are two of the most common pooling operations in neural networks. Typically, pooling layers reduce the image size but do not affect the number of channels.
2.2 Secret Sharing
A secret can be securely shared among two or multiple parties using Secret Sharing (SS) schemes. An SS scheme guarantees that each share does not reveal any information about the secret. The secret can be reconstructed using all (or subset) of shares. In Xonn, we use additive secret sharing in which a secret is shared among two parties by sampling a random number (integers modulo ) as the first share and creating the second share as where is the number of bits to describe the secret. While none of the shares reveal any information about the secret , they can be used to reconstruct the secret as . Suppose that two secrets and are shared among two parties where party has and and party has and . Party can create a share of the sum of two secrets as without communicating to the other party. This can be generalized for arbitrary (more than two) number of secrets as well. We utilize additive secret sharing in our Oblivious Conditional Addition (OCA) protocol (Section 3.3).
2.3 Oblivious Transfer
One of the most crucial building blocks of secure computation protocols, e.g., GC, is the Oblivious Transfer (OT) protocol [44]. In OT, two parties are involved: a sender and a receiver. The sender holds different messages with a specific bitlength and the receiver holds an index () of a message that she wants to receive. At the end of the protocol, the receiver gets with no additional knowledge about the other messages and the sender learns nothing about the selection index. In GC, 1outof2 OT is used where in which case the selection index is only one bit. The initial realizations of OT required costly public key encryptions for each run of the protocol. However, the OT Extension [30, 8, 7] technique enables performing OT using more efficient symmetrickey encryption in conjunction with a fixed number of base OTs that need publickey encryption. OT is used both in the OCA protocol as well as the Garbled Circuits protocol which we discuss next.
2.4 Garbled Circuits
Yao’s Garbled Circuits [61], or GC in short, is one of the generic twoparty secure computation protocols. In GC, the result of an arbitrary function on inputs from two parties can be computed without revealing each party’s input to the other. Before executing the protocol, function has to be described as a Boolean circuit with twoinput gates.
GC has three main phases: garbling, transferring data, and evaluation. In the first phase, only one party, the Garbler, is involved. The Garbler starts by assigning two randomly generated bit binary strings to each wire in the circuit. These binary strings are called labels and they represent semantic values 0 and 1. Let us denote the label of wire corresponding to the semantic value as . For each gate in the circuit, the Garbler creates a fourrow garbled table as follows. Each label of the output wire is encrypted using the input labels according to the truth table of the gate. For example, consider an AND gate with input wires and and output wire . The last row of the garbled table is the encryption of using labels and .
Once the garbling process is finished, the Garbler sends all of the garbled tables to the Evaluator. Moreover, he sends the correct labels that correspond to input wires that represent his inputs to the circuit. For example, if wire is the first input bit of the Garbler and his input is 0, he sends . The Evaluator acquires the labels corresponding to her input through 1outof2 OT where Garbler is the sender with two labels as his messages and the Evaluator’s selection bit is her input for that wire. Having all of the garbled tables and labels of input wires, the Evaluator can start decrypting the garbled tables one by one until reaching the final output bits. She then learns the plaintext result at the end of the GC protocol based on the output labels and their relationships to the semantic values that are received from the Garbler.
3 The Xonn Framework
In this section, we explain how neural networks can be trained such that they incur a minimal cost during the oblivious inference. The most computationally intensive operation in a neural network is matrix multiplication. In GC, each multiplication has a quadratic computation and communication cost with respect to the input bitlength. This is the major source of inefficiency in prior work [52]. We overcome this limitation by changing the learning process such that the trained neural network’s weights become binary. As a result, costly multiplication operations are replaced with XNOR gates which are essentially free in GC. We describe the training process in Section 3.1. In Section 3.2, we explain the operations and their corresponding Boolean circuit designs that enable a very fast oblivious inference. In Section 4, we elaborate on Xonn implementation.
3.1 Customized Network Binarization
Numerical optimization algorithms minimize a specific cost function associated with neural networks. It is wellknown that neural network training is a nonconvex optimization, meaning that there exist many locallyoptimum parameter configurations that result in similar inference accuracies. Among these parameter settings, there exist solutions where both neural network parameters and activation units are restricted to take binary values (i.e., either or ); these solutions are known as Binary Neural Netowrks (BNNs) [18].
One major shortcoming of BNNs is their (often) low inference accuracy. In the machine learning community, several methods have been proposed to modify BNN functionality for accuracy enhancement
[46, 24, 34]. These methods are devised for plaintext execution of BNNs and are not efficient for oblivious inference with GC. We emphasize that, when modifying BNNs for accuracy enhancement, one should also take into account the implications in the corresponding GC circuit. With this in mind, we propose to modify the number of channels and neurons in CONV and FC layers, respectively. Increasing the number of channels/neurons leads to a higher accuracy but it also increases the complexity of the corresponding GC circuit. As a result, Xonn provides a tradeoff between the accuracy and the communication/runtime of the oblivious inference. This tradeoff enables cloud servers to customize the complexity of the GC protocol to optimally match the computation and communication requirements of the clients. To customize the BNN, Xonn configures the perlayer number of neurons in two steps:
[leftmargin=*]

Linear Scaling: Prior to training, we scale the number of channels/neurons in all BNN layers with the same factor (), e.g., . Then, we train the scaled BNN architecture.

Network Trimming: Once the (uniformly) scaled network is trained, a postprocessing algorithm removes redundant channels/neurons from each hidden layer to reduce the GC cost while maintaining the inference accuracy.
Figure 1 illustrates the BNN customization method for an example baseline network with four hidden layers. Network trimming (pruning) consists of two steps, namely, Feature Ranking and Iterative Pruning which we describe next.
Feature Ranking: In order to perform network trimming, one needs to sort the channels/neurons of each layer based on their contribution to the inference accuracy. In conventional neural networks, simple ranking methods sort features based on absolute value of the neurons/channels [26]. In BNNs, however, the weights/features are either or and the absolute value is not informative. To overcome this issue, we utilize first order Taylor approximation of neural networks and sort the features based on the magnitude of the gradient values [41]. Intuitively, the gradient with respect to a certain feature determines its importance; a high (absolute) gradient indicates that removing the neuron has a destructive effect on the inference accuracy. Inspired by this notion, we develop a feature ranking method described in Algorithm 1.
Iterative Pruning: We devise a stepbystep algorithm for model pruning which is summarized in Algorithm 2. At each step, the algorithm selects one of the BNN layers and removes the first features with the lowest importance (line 17). The selected layer and the number of pruned neurons maximize the following reward (line 15):
(3) 
where and are the GC complexity of the BNN before and after pruning, whereas, and denote the corresponding validation accuracies. The numerator of this reward encourages higher reduction in the GC cost while the denominator penalizes accuracy loss. Once the layer is pruned, the BNN is finetuned to recover the accuracy (line 18). The pruning process stops once the accuracy drops below a predefined threshold.
3.2 Oblivious Inference
BNNs are trained such that the weights and activations are binarized, i.e., they can only have two possible values:
or . This property allows BNN layers to be rendered using a simplified arithmetic. In this section, we describe the functionality of different layer types in BNNs and their Boolean circuit translations. Below, we explain each layer type.Binary Linear Layer: Most of the computational complexity of neural networks is due to the linear operations in CONV and FC layers. As we discuss in Section 2.1, linear operations are realized using vector dot product (VDP). In BNNs, VDP operations can be implemented using simplified circuits. We categorize the VDP operations of this work into two classes: (i) IntegerVDP where only one of the vectors is binarized and the other has integer elements and (ii) BinaryVDP where both vectors have binary () values.
IntegerVDP: For the first layer of the neural network, the server has no control over the input data which is not necessarily binarized. The server can only train binary weights and use them for oblivious inference. Consider an input vector with integer (possibly fixedpoint) elements and a weight vector with binary values. Since the elements of the binary vector can only take or , the IntegerVDP can be rendered using additions and subtractions. In particular, the binary weights can be used in a selection circuit that decides whether the pertinent integer input should be added to or subtracted from the VDP result.
BinaryVDP: Consider a dot product between two binary vectors and . If we encode each element with one bit (i.e., and ), we obtain binary vectors and . It has been shown that the dot product of and can be efficiently computed using an operation [18]. Figure 2 depicts the equivalence of and for a VDP between dimensional vectors. First, elementwise XNOR operations are performed between the two binary encodings. Next, the number of set bits is counted, and the output is computed as .
Binary Activation Function: A Binary Activation (BA) function takes input and maps it to where outputs either or based on the sign of its input. This functionality can simply be implemented by extracting the most significant bit of .
Binary Batch Normalization: in BNNs, it is often useful to normalize feature using a Batch Normalization (BN) layer before applying the binary activation function. More specifically, a BN layer followed by a BA is equivalent to:
since is a positive value. The combination of the two layers (BN+BA) is realized by a comparison between and .
Binary MaxPooling: Assuming the inputs to the maxpooling layers are binarized, taking the maximum in a window is equivalent to performing logical OR over the binary encodings as depicted in Figure 3. Note that averagepooling layers are usually not used in BNNs since the average of multiple binary elements is no longer a binary value.
Figure 4 demonstrates the Boolean circuit for BinaryVDP followed by BN and BA. The number of nonXOR gates for binaryVDP is equal to the number of gates required to render the treeadder structure in Figure 4. Similarly, Figure 5 shows the IntegerVDP counterpart. In the first level of the treeadder of IntegerVDP (Figure 5), the binary weights determine whether the integer input should be added to or subtracted from the final result within the “Select” circuit. The next levels of the treeadder compute the result of the integerVDP using “Adder” blocks. The combination of BN and BA is implemented using a single comparator. Compared to BinaryVDP, IntegerVDP has a high garbling cost which is linear with respect to the number of bits. To mitigate this problem, we propose an alternative solution based on Oblivious Transfer (OT) in Section 3.3.
3.3 Oblivious Conditional Addition Protocol
In Xonn, all of the activation values as well as neural network weights are binary. However, the input to the neural network is provided by the user and is not necessarily binary. The first layer of a typical neural network comprises either an FC or a CONV layer, both of which are evaluated using oblivious IntegerVDP. On the one side, the user provides her input as nonbinary (integer) values. On the other side, the network parameters are binary values representing and . We now demonstrate how IntegerVDP can be described as an OT problem. Let us denote the user’s input as a vector of (bit) integers. The server holds a vector of binary values denoted by . The result of IntegerVDP is a number “” that can be described with
bits. Figure 6 summarizes the steps in the OCA protocol. The first step is to bitextend from bit to bit. In other words, if is a vector of signed integer/fixedpoint numbers, the most significant bit should be repeated (
)many times, otherwise, it has to be zeropadded for most significant bits. We denote the bitextended vector by
. The second step is to create the two’s complement vector of , called . The client also creates a vector of (bit) randomly generated numbers, denoted as . She computes elementwise vector subtractions and . These two vectors are many pair of messages that will be used as input to many 1outoftwo OTs. More precisely, is a list of first messages and is a list of second messages. The server’s list of selection bits is . After many OTs are finished, the server has a list of transferred numbers called whereFinally, the client computes and the server computes . By OT’s definition, the receiver (server) gets only one of the two messages from the sender. That is, based on each selection bit (a binary weight), the receiver gets an additive share of either the sender’s number or its two’s complement. Upon adding all of the received numbers, the receiver computes an additive share of the IntegerVDP result. Now, even though the sender does not know which messages were selected by the receiver, she can add all of the randomly generated numbers s which is equal to the other additive share of the IntegerVDP result. Since all numbers are described in the two’s complement format, subtractions are equivalent to the addition of the two’s complement values, which are created by the sender at the beginning of OCA. Moreover, it is possible that as we accumulate the values, the bitlength of the final IntegerVDP result grows accordingly. This is supported due to the bitextension process at the beginning of the protocol. In other words, all additions are performed in a larger ring such that the result does not overflow.
Note that all numbers belong to the ring and by definition, a ring is closed under addition, therefore, and are true additive shares of . We described the OCA protocol for one IntegerVDP computation. As we outlined in Section 3.2, all linear operations in the first layer of the DL model (either FC or CONV) can be formulated as a series of IntegerVDPs.
In traditional OT, publickey encryption is needed for each OT invocation which can be computationally expensive. Thanks to the Oblivious Transfer Extension technique [30, 8, 7], one can perform many OTs using symmetrickey encryption and only a fixed number of publickey operations.
Required Modification to the Next Layer. So far, we have shown how to perform IntegerVDP using OT. However, we need to add an “addition” layer to reconstruct the true value of from its additive shares before further processing it. The overhead of this layer, as well as OT computations, are discussed next. Note that OCA is used only for the first layer and it does not change the overall constant round complexity of Xonn since it is performed only once regardless of the number of layers in the DL model.
Comparison to IntegerVDP in GC. Table 1 shows the computation and communication costs for two approaches: (i) computing the first layer in GC and (ii) utilizing OCA. OCA removes the GC cost of the first layer in Xonn. However, it adds the overhead of a set of OTs and the GC costs associated with the new ADD layer.

GC  OCA  

OT  ADD Layer  
Comp. (AES ops)  {2, 4}  {1, 2}  {2, 4}  
Comm. (bit) 
3.4 Security of Xonn
We consider the HonestbutCurious (HbC) adversary model consistent with all of the stateoftheart solutions for oblivious inference [52, 40, 38, 49, 15, 32]. In this model, neither of the involved parties is trusted but they are assumed to follow the protocol. Both server and client cannot infer any information about the other party’s input from the entire protocol transcript. Xonn relies solely on the GC and OT protocols, both of which are proven to be secure in the HbC adversary model in [36] and [44], respectively. Utilizing binary neural networks does not affect GC and OT protocols in any way. More precisely, we have changed the function that is evaluated in GC such that it is more efficient for the GC protocol: drastically reducing the number of AND gates and using XOR gates instead. Our novel Oblivious Conditional Addition (OCA) protocol (Section 3.3) is also based on the OT protocol. The sender creates a list of message pairs and puts them as input to the OT protocol. Each message is an additive share of the sender’s private data from which the secret data cannot be reconstructed. The receiver puts a list of selection bits as input to the OT. By OT’s definition, the receiver learns nothing about the unselected messages and the sender does not learn the selection bits.
During the past few years, several attacks have been proposed that extract some information about the DL model by querying the server many times [59, 23, 55]. It has been shown that some of these attacks can be effective in the blackbox setting where the client only receives the prediction results and does not have access to the model. Therefore, considering the definition of an oblivious inference, these type of attacks are out of the scope of oblivious inference frameworks. However, in Appendix B, we show how these attacks can be thwarted by adding a simple layer at the end of the neural network which adds a negligible overhead.
Security Against Malicious Adversaries. The HbC adversary model is the standard security model in the literature. However, there are more powerful security models such as security against covert and malicious adversaries. In the malicious security model, the adversary (either the client or server) can deviate from the protocol at any time with the goal of learning more about the input from the other party. One of the main distinctions between Xonn and the stateoftheart solutions is that Xonn can be automatically adapted to the malicious security using cutandchoose techniques [37, 29, 35]. These methods take a GC protocol in HbC and readily extend it to the malicious security model. This modification increases the overhead but enables a higher security level. To the best of our knowledge, there is no practical solution to extend the customized mixedprotocol frameworks [38, 49, 15, 32] to the malicious security model. Our GCbased solution is more efficient compared to the mixedprotocol solutions and can be upgraded to the malicious security at the same time.
4 The Xonn Implementation
In this section, we elaborate on the garbling/evaluation implementation of Xonn. All of the optimizations and techniques proposed in this section do not change the security or correctness in anyway and only enable the framework’s scalability for large network architectures.
We design a new GC framework with the following design principles in mind: (i) Efficiency: Xonn is designed to have a minimal data movement and low cachemiss rate. (ii) Scalability: oblivious inference inevitably requires significantly higher memory usage compared to plaintext evaluation of neural networks. High memory usage is one critical shortcoming of stateoftheart secure computation frameworks. As we show in our experimental results, Xonn is designed to scale for very deep neural networks that have higher accuracy compared to networks considered in prior art. (iii) Modularity: our framework enables users to create Boolean description of different layers separately. This allows the hardware synthesis tool to generate more optimized circuits as we discuss in Section 4.1. (iv) Easetouse: Xonn provides a very simple API that requires few lines of neural network description. Moreover, we have created a compiler that takes a Keras description and automatically creates the network description for Xonn API.
Xonn is written in C++ and supports all major GC optimizations proposed previously. Since the introduction of GC, many optimizations have been proposed to reduce the computation and communication complexity of this protocol. Bellare et al. [9] have provided a way to perform garbling using efficient fixedkey AES encryption. Our implementation benefits from this optimization by using Intel AESNI instructions. Rowreduction technique [42] reduces the number of garbled tables from four to three. HalfGates technique [62] further reduces the number of rows in the garbled tables from three to two. One of the most influential optimizations for the GC protocol is the freeXOR technique [33] which makes XOR, XNOR, and NOT almost free of cost. Our implementation for Oblivious Transfer (OT) is based on libOTe [50].
4.1 Modular Circuit Synthesis and Garbling
In Xonn, each layer is described as multiple invocations of a base circuit. For instance, linear layers (CONV and FC) are described by a VDP circuit. MaxPool is described by an OR circuit where the number of inputs is the window size of the MaxPool layer. BA/BN layers are described using a comparison (CMP) circuit. The memory footprint is significantly reduced in this approach: we only create and store the base circuits. As a result, the connection between two invocations of two different base circuits is handled at the software level.
We create the Boolean circuits using TinyGarble [57] hardware synthesis approach. TinyGarble’s technology libraries are optimized for GC and produce circuits that have low number of nonXOR gates. Note that the Boolean circuit description of the contemporary neural networks comprises between millions to billions of Boolean gates, whereas, synthesis tools cannot support circuits of this size. However, due to Xonn modular design, one can synthesize each base circuit separately. Thus, the bottleneck transfers from the synthesis tool’s maximum number of gates to the system’s memory. As such, Xonn effectively scales for any neural network complexity regardless of the limitations of the synthesis tool as long as enough memory (i.e., RAM) is available. Later in this section, we discuss how to increase the scalability by dynamically managing the allocated memory.
Pipelined GC Engine. In Xonn, computation and communication are pipelined. For instance, consider a CONV layer followed by an activation layer. We garble/evaluate these layers by multiple invocations of the VDP and CMP circuits (one invocation per output neuron) as illustrated in Figure 7. Upon finishing the garbling process of layer , the Garbler starts garbling the layer and creates the random labels for output wires of layer . He also needs to create the random labels associated with his input (i.e., the weight parameters) to layer . Given a set of input and output labels, Garbler generates the garbled tables, and sends them to the Evaluator as soon as one is ready. He also sends one of the two input labels for his input bits. At the same time, the Evaluator has computed the output labels of the layer. She receives the garbled tables as well as the Garbler’s selected input labels and decrypts the tables and stores the output labels of layer .
Dynamic Memory Management. We design the framework such that the allocated memory for the labels is released as soon as it is no longer needed, reducing the memory usage significantly. For example, without our dynamic memory management, the Garbler had to allocate GB for the labels and garbled tables for the entire garbling of BC1 network (see Section 7 for network description). In contrast, in our framework, the size of memory allocation never exceeds 2GB and is less than 0.5GB for most of the layers.
4.2 Application Programming Interface (API)
Xonn provides a simplified and easytouse API for oblivious inference. The framework accepts a highlevel description of the network, parameters of each layer, and input structure. It automatically computes the number of invocations and the interconnection between all of the base circuits. Figure 8 shows the complete network description that a user needs to write for a sample network architecture (the BM3 architecture, see Section 7). All of the required circuits are automatically generated using TinyGarble [57] synthesis libraries. It is worth mentioning that for the task of oblivious inference, our API is much simpler compared to the recent highlevel EzPC framework [15]. For example, the required lines of code to describe BM1, BM2, and BM3 network architectures (see Section 7) in EzPC are 78, 88, and 154, respectively. In contrast, they can be described with only 6, 6, and 10 lines of code in our framework.
Keras to Xonn Translation. To further facilitate the adaptation of Xonn, a compiler is created to translate the description of the neural network in Keras [16] to the Xonn format. The compiler creates the .xonn file and puts the network parameters into the required format (HEX string) to be read by the framework during the execution of the GC protocol. All of the parameter adjustments are also automatically performed by the compiler.
5 Related Work
CryptoNets [19] is one of the early solutions that suggested the adaptation of Leveled Homomorphic Encryption (LHE) to perform oblivious inference. LHE is a variant of Partially HE that enables evaluation of depthbounded arithmetic circuits. DeepSecure [52] is a privacypreserving DL framework that relies on the GC protocol. CryptoDL [27] improves upon CryptoNets [19] and proposes more efficient approximation of the nonlinear functions using lowdegree polynomials. Their solution is based on LHE and uses meanpooling in replacement of the maxpooling layer. Chou et al. propose to utilize the sparsity within the DL model to accelerate the inference [17].
SecureML [40] is a privacypreserving machine learning framework based on homomorphic encryption, GC, and secret sharing. SecureML also uses customized activation functions and supports privacypreserving training in addition to inference. Two noncolluding servers are used to train the DL model where each client XORshares her input and sends the shares to both servers. MiniONN [38] is a mixedprotocol framework for oblivious inference. The underlying cryptographic protocols are HE, GC, and secret sharing.
Chameleon [49]
is a more recent mixedprotocol framework for machine learning, i.e., Support Vector Machines (SVMs) as well as DNNs. Authors propose to perform lowdepth nonlinear functions using the GoldreichMicaliWigderson (GMW) protocol
[25], highdepth functions by the GC protocol, and linear operations using additive secret sharing. Moreover, they propose to use correlated randomness to more efficiently compute linear operations. EzPC [15] is a secure computation framework that enables users to write highlevel programs and translates it to a protocolbased description of both Boolean and Arithmetic circuits. The backend cryptographic engine is based on the ABY framework.Shokri and Shmatikov [54] proposed a solution for privacypreserving collaborative deep learning where the training data is distributed among many parties. Their approach, which is based on differential privacy, enables clients to train their local model on their own training data and update the central model’s parameters held by a central server. However, it has been shown that a malicious client can learn significant information about the other client’s private data [28]. Google [11] has recently introduced a new approach for securely aggregating the parameter updates from multiple users. However, none of these approaches [54, 11] study the oblivious inference problem. An overview of related frameworks is provided in [48, 47].
Frameworks such as [39] and SecureNN [60] have different computation models and they rely on three (or four) parties during the oblivious inference. In contrast, Xonn does not require an additional server for the computation. In E2DM framework [31], the model owner can encrypt and outsource the model to an untrusted server to perform oblivious inference. Concurrently and independently of ours, in TAPAS [53], Sanyal et al. study the binarization of neural networks in the context of oblivious inference. They report inference latency of 147 seconds on MNIST dataset with 98.6% prediction accuracy using custom CNN architecture. However, as we show in Section 7 (BM3 benchmark), Xonn outperforms TAPAS by close to three orders of magnitude.
Gazelle [32] is the previously most efficient oblivious inference framework. It is a mixedprotocol approach based on additive HE and GC. In Gazelle, convolution operations are performed using the packing property of HE. In this approach, many numbers are packed inside a single ciphertext for faster convolutions. In Section 6, we briefly discuss one of the essential requirements that the Gazelle protocol has to satisfy in order to be secure, namely, circuit privacy.
HighLevel Comparison. In contrast to prior work, we propose a DLsecure computation codesign approach. To the best of our knowledge, DeepSecure [52] is the only solution that preprocesses the data and network before the secure computation protocol. However, this preprocessing step is unrelated to the underlying cryptographic protocol and compacts the network and data. Moreover, in this mode, some information about the network parameters and structure of data is revealed. Compared to mixedprotocol solutions, not only Xonn provides a more efficient solution but also maintains the constant round complexity regardless of the number of layers in the neural network model. It has been shown that round complexity is one of the important criteria in designing secure computation protocols [10] since the performance can significantly be reduced in Internet settings where the network latency is high. Another important advantage of our solution is the ability to upgrade to the security against malicious adversaries using cutandchoose techniques [37, 29, 35]. As we show in Section 7, Xonn outperforms all previous solutions in inference latency. Table 2 summarizes a highlevel comparison between stateoftheart oblivious inference frameworks.
Framework  Crypto. Protocol  C  D  I  U  S 

CryptoNets [19]  HE  ✓  ✗  ✓  ✗  ✗ 
DeepSecure [52]  GC  ✓  ✓  ✓  ✓  ✓ 
SecureML [40]  HE, GC, SS  ✗  ✗  ✗  ✗  ✗ 
MiniONN [38]  HE, GC, SS  ✗  ✗  ✓  ✗  ✓ 
Chameleon [49]  GC, GMW, SS  ✗  ✗  ✗  ✗  ✓ 
EzPC [15]  GC, SS  ✗  ✗  ✓  ✗  ✓ 
Gazelle [32]  HE, GC, SS  ✗  ✗  ✓  ✗  ✓ 
Xonn (This work)  GC, SS  ✓  ✓  ✓  ✓  ✓ 
6 Circuit Privacy
In Gazelle [32], for each linear layer, the protocol starts with a vector that is secretshared between client and server (). The protocol outputs the secret shares of the vector where is a matrix known to the server but not to the client. The protocol has the following procedure: (i) Client generates a pair of public and secret keys of an additive homomorphic encryption scheme HE. (ii) Client sends HE to the server. Server adds its share () to the ciphertext and recovers encryption of : HE. (iii) Server homomorphically evaluates the multiplication with and obtains the encryption of . (iv) Server secret shares by sampling a random vector and returns ciphertext HE to the client. The client can decrypt using private key and obtain .
Gazelle uses the BrakerskiFanVercauteren (BFV) scheme [12, 22]. However, the vanilla BFV scheme does not provide circuit privacy. At highlevel, the circuit privacy requirement states that the ciphertext should not reveal any information about the private inputs to the client (i.e., and ) other than the underlying plaintext . Otherwise, some information is leaked. Gazelle proposes two methods to provide circuit privacy that are not incorporated in their implementation. Hence, we need to scale up their performance numbers for a fair comparison.
The first method is to let the client and server engage in a twoparty secure decryption protocol, where the input of client is and input of server is . However, this method adds communication and needs extra rounds of interaction. A more widely used approach is noise flooding. Roughly speaking, the server adds a large noise term to before returning it to the client. The noise is big enough to drown any extra information contained in the ciphertext, and still small enough to so that it still decrypts to the same plaintext.
For the concrete instantiation of Gazelle, one needs to triple the size of ciphertext modulus from 60 bits to 180 bits, and increase the ring dimension from 2048 to 8192. The (amortized) complexity of homomorphic operations in the BFV scheme is approximately , with the exception that some operations run in amortized time. Therefore, adding noise flooding would result in a 33.6 times slow down for the HE component of Gazelle. To give some concrete examples, we consider two networks used for benchmarking in Gazelle: MNISTD and CIFAR10 networks. For the MNISTD network, homomorphic encryption takes 55% and 22% in online and total time, respectively. For CIFAR10, the corresponding figures are 35%, and 10%^{1}^{1}1these percentage numbers are obtained through private communication with the authors.
. Therefore, we estimate that the total time for MNISTD will grow from 0.81s to 1.161.27s (network BM3 in this paper). In the case of CIFAR10 network, the total time will grow from 12.9s to 15.4816.25s.
7 Experimental Results
We evaluate Xonn on MNIST and CIFAR10 datasets, which are two popular classification benchmarks used in prior work. In addition, we provide four healthcare datasets to illustrate the applicability of Xonn in realworld scenarios. For training Xonn, we use Keras [16]
with Tensorflow backend
[5]. The source code of Xonn is compiled with GCC 5.5.0 using O3 optimization. All Boolean circuits are synthesized using Synopsys Design Compiler 2015. Evaluations are performed on (Ubuntu 16.04 LTS) machines with IntelCore i77700k and GB of RAM. The experimental setup is comparable (but has less computational power) compared to the prior art [32]. Consistent with prior frameworks, we evaluate the benchmarks in the LAN setting.7.1 Evaluation on MNIST
There are mainly three network architectures that prior works have implemented for the MNIST dataset. We convert these reference networks into their binary counterparts and train them using the standard BNN training algorithm [18]. Table 3 summarizes the architectures for the MNIST dataset.
Arch.  Previous Papers  Description  

BM1  SecureML [40], MiniONN [38]  3 FC  
BM2 

1 CONV, 2 FC  
BM3  MiniONN [38], EzPC [15]  2 CONV, 2MP, 2FC 
Analysis of Network Scaling: Recall that the classification accuracy of Xonn is controlled by scaling the number of neurons in all layers (Section 3.1). Figure 8(a) depicts the inference accuracy with different scaling factors (more details in Table 11 in Appendix A.2). As we increase the scaling factor, the accuracy of the network increases. This accuracy improvement comes at the cost of a higher computational complexity of the (scaled) network. As a result, increasing the scaling factor leads to a higher runtime. Figure 8(b) depicts the runtime of different BNN architectures as a function of the scaling factor . Note that the runtime grows (almost) quadratically with the scaling factor due to the quadratic increase in the number of operations in the neural network (see ). However, for the and networks, the overall runtime is dominated by the constant initialization cost of the OT protocol ( millisecond).
GC Cost and the Effect of OCA: The communication cost of GC is the key contributor to the overall runtime of Xonn. Here, we analyze the effect of the scaling factor on the total message size. Figure 10 shows the communication cost of GC for the BM1 and BM2 network architectures. As can be seen, the message size increases with the scaling factor. We also observe that the OCA protocol drastically reduces the message size. This is due to the fact that the first layer of BM1 and BM2 models account for a large portion of the overall computation; hence, improving the first layer with OCA has a drastic effect on the overall communication.
Comparison to Prior Art: We emphasize that, unlike previous work, the accuracy of Xonn can be customized by tuning the scaling factor (). Furthermore, our channel/neuron pruning step (Algorithm 2) can reduce the GC cost in a postprocessing phase. To provide a fair comparison between Xonn and prior art, we choose a proper scaling factor and trim the pertinent scaled BNN such that the corresponding BNN achieves the same accuracy as the previous work. Table 4 compares Xonn with the previous work in terms of accuracy, latency, and communication cost (a.k.a., message size). The last column shows the scaling factor () used to increase the width of the hidden layers of the BNN. Note that the scaled network is further trimmed using Algorithm 2.
In Xonn, the runtime for oblivious transfer is at least second for initiating the protocol and then it grows linearly with the size of the garbled tables; As a result, in very small architectures such as , our solution is slightly slower than previous works since the constant runtime dominates the total runtime. However, for the network which has higher complexity than and , Xonn achieves a more prominent advantage over prior art. In summary, our solution achieves up to faster inference (average of ) compared to Gazelle [32]. Compared to MiniONN [38], Xonn has up to lower latency (average of ) Table 4. Compared to EzPC [15], our framework is faster. Xonn achieves , , , and better latency compared to SecureML [40], CryptoNets [19], DeepSecure [52], and Chameleon [49], respectively.
Arch.  Framework  Runtime (s)  Comm. (MB)  Acc. (%)  s 
BM1  SecureML  4.88    93.1   
MiniONN  1.04  15.8  97.6    
EzPC  0.7  76  97.6    
Gazelle  0.09  0.5  97.6    
Xonn  0.13  4.29  97.6  1.75  
BM2  CryptoNets  297.5  372.2  98.95   
DeepSecure  9.67  791  98.95    
MiniONN  1.28  47.6  98.95    
Chameleon  2.24  10.5  99.0    
EzPC  0.6  70  99.0    
Gazelle  0.29  8.0  99.0    
Xonn  0.16  38.28  98.64  4.00  
BM3  MiniONN  9.32  657.5  99.0   
EzPC  5.1  501  99.0    
Gazelle  1.16  70  99.0    
Xonn  0.15  32.13  99.0  2.00 
7.2 Evaluation on CIFAR10
In Table 5, we summarize the network architectures that we use for the CIFAR10 dataset. In this table, BC1 is the binarized version of the architecture proposed by MiniONN. To evaluate the scalability of our framework to larger networks, we also binarize the Fitnet [51] architectures, which are denoted as BC2BC5. We also evaluate Xonn on the popular VGG16 network architecture (BC6). Detailed architecture descriptions are available in Appendix A.2, Table 13.
Arch.  Previous Papers  Description  

BC1 

7 CONV, 2 MP, 1 FC  
BC2  Fitnet [51]  9 CONV, 3 MP, 1 FC  
BC3  Fitnet [51]  9 CONV, 3 MP, 1 FC  
BC4  Fitnet [51]  11 CONV, 3 MP, 1 FC  
BC5  Fitnet [51]  17 CONV, 3 MP, 1 FC  
BC6  VGG16 [56]  13 CONV, 5 MP, 3 FC 
Analysis of Network Scaling: Similar to the analysis on the MNIST dataset, we show that the accuracy of our binary models for CIFAR10 can be tuned based on the scaling factor that determines the number of neurons in each layer. Figure 10(a) depicts the accuracy of the BNNs with different scaling factors. As can be seen, increasing the scaling factor enhances the classification accuracy of the BNN. The runtime also increases with the scaling factor as shown in Figure 10(b) (more details in Table 12, Appendix A.2).
Comparison to Prior Art: We scale the BC2 network with a factor of , then prune it using Algorithm 2. Details of pruning steps are available in Table 10 in Appendix A.1. The resulting network is compared against prior art in Table 6. As can be seen, our solution achieves , , , and lower latency compared to Gazelle, EzPC, Chameleon, and MiniONN, respectively.
Framework  Runtime (s)  Comm. (MB)  Acc. (%)  s 

MiniONN  544  9272  81.61   
Chameleon  52.67  2650  81.61   
EzPC  265.6  40683  81.61   
Gazelle  15.48  1236  81.61   
Xonn  5.79  2599  81.85  3.00 
7.3 Evaluation on Medical Datasets
One of the most important applications of oblivious inference is medical data analysis. Recent advances in deep learning greatly benefit many complex diagnosis tasks that require exhaustive manual inspection by human experts [21, 20, 6, 45]. To showcase the applicability of oblivious inference in realworld medical applications, we provide several benchmarks for publicly available healthcare datasets summarized in Table 7. We split the datasets into validation and training portions as indicated in the last two columns of Table 7. All datasets except Malaria Infection are normalized to have
mean and standard deviation of
per feature. The images of Malaria Infection dataset are resized to pictures. The normalized datasets are quantized up to 3 decimal digits. Detailed architectures are available in Appendix A.2, Table 13 We report the validation accuracy along with inference time and message size in Table 8.Task  Arch.  Description  # of Samples  

Tr.  Val.  
Breast Cancer [1]  BH1  3 FC  453  113  
Diabetes [4]  BH2  3 FC  615  153  
Liver Disease [2]  BH3  3 FC  467  116  
Malaria Infection [3]  BH4 

24804  2756 
Arch.  Runtime (ms)  Comm. (MB)  Acc. (%) 

BH1  82  0.35  97.35 
BH2  75  0.16  80.39 
BH3  81  0.3  80.17 
BH4  482  120.75  95.03 
8 Conclusion
We introduce Xonn, a novel framework to automatically train and use deep neural networks for the task of oblivious inference. Xonn utilizes Yao’s Garbled Circuits (GC) protocol and relies on binarizing the DL models in order to translate costly matrix multiplications to XNOR operations that are free in the GC protocol. Compared to Gazelle [32], prior best solution, Xonn achieves lower latency. Moreover, in contrast to Gazelle that requires one round of interaction for each layer, our solution needs a constant round of interactions regardless of the number of layers. Maintaining constant round complexity is an important requirement in Internet settings as a typical network latency can significantly degrade the performance of oblivious inference. Moreover, since our solution relies on the GC protocol, it can provide much stronger security guarantees such as security against malicious adversaries using standard cutandchoose protocols. Xonn highlevel API enables clients to utilize the framework with a minimal number of lines of code. To further facilitate the adaptation of our framework, we design a compiler to translate the neural network description in Keras format to that of Xonn.
Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments.
References
 [1] Breast Cancer Wisconsin, accessed on 01/20/2019. https://www.kaggle.com/uciml/breastcancerwisconsindata, 2019.
 [2] Indian Liver Patient Records, accessed on 01/20/2019. https://www.kaggle.com/uciml/indianliverpatientrecords, 2019.
 [3] Malaria Cell Images, accessed on 01/20/2019. https://www.kaggle.com/iarunava/cellimagesfordetectingmalaria, 2019.
 [4] Pima Indians Diabetes, accessed on 01/20/2019. https://www.kaggle.com/uciml/pimaindiansdiabetesdatabase, 2019.
 [5] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for largescale machine learning. In Operating Systems Design and Implementation (OSDI), 2016.
 [6] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
 [7] Gilad Asharov, Yehuda Lindell, Thomas Schneider, and Michael Zohner. More efficient oblivious transfer and extensions for faster secure computation. In ACM CCS, 2013.
 [8] Donald Beaver. Correlated pseudorandomness and the complexity of private computations. In STOC, 1996.
 [9] Mihir Bellare, Viet Tung Hoang, Sriram Keelveedhi, and Phillip Rogaway. Efficient garbling from a fixedkey blockcipher. In IEEE S&P, 2013.
 [10] Aner BenEfraim, Yehuda Lindell, and Eran Omri. Optimizing semihonest secure multiparty computation for the internet. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 578–590. ACM, 2016.
 [11] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacypreserving machine learning. In ACM CCS, 2017.
 [12] Zvika Brakerski. Fully homomorphic encryption without modulus switching from classical gapsvp. In Advances in cryptology–crypto 2012, pages 868–886. Springer, 2012.
 [13] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. (leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory (TOCT), 6(3):13, 2014.
 [14] Zvika Brakerski and Vinod Vaikuntanathan. Efficient fully homomorphic encryption from (standard) lwe. SIAM Journal on Computing, 43(2):831–871, 2014.
 [15] Nishanth Chandran, Divya Gupta, Aseem Rastogi, Rahul Sharma, and Shardul Tripathi. EzPC: Programmable, efficient, and scalable secure twoparty computation. IACR Cryptology ePrint Archive, 2017/1109, 2017.
 [16] François Chollet et al. Keras. https://keras.io, 2015.
 [17] Edward Chou, Josh Beal, Daniel Levy, Serena Yeung, Albert Haque, and Li FeiFei. Faster CryptoNets: Leveraging sparsity for realworld encrypted inference. arXiv preprint arXiv:1811.09953, 2018.
 [18] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [19] Nathan Dowlin, Ran GiladBachrach, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. CryptoNets: Applying neural networks to encrypted data with high throughput and accuracy. In ICML, 2016.
 [20] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologistlevel classification of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.
 [21] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.
 [22] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012:144, 2012.
 [23] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In ACM CCS. ACM, 2015.
 [24] Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. ReBNet: Residual binarized neural network. In 2018 IEEE 26th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), pages 57–64. IEEE, 2018.

[25]
Oded Goldreich, Silvio Micali, and Avi Wigderson.
How to play any mental game.
In
Proceedings of the nineteenth annual ACM symposium on Theory of computing
, pages 218–229. ACM, 1987.  [26] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [27] Ehsan Hesamifard, Hassan Takabi, Mehdi Ghasemi, and Rebecca N Wright. Privacypreserving machine learning as a service. Proceedings on Privacy Enhancing Technologies, 2018(3):123–142, 2018.
 [28] Briland Hitaj, Giuseppe Ateniese, and Fernando PérezCruz. Deep models under the GAN: information leakage from collaborative deep learning. In ACM CCS, 2017.
 [29] Yan Huang, Jonathan Katz, and David Evans. Efficient secure twoparty computation using symmetric cutandchoose. In Advances in Cryptology–CRYPTO 2013, pages 18–35. Springer, 2013.
 [30] Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending oblivious transfers efficiently. In Annual International Cryptology Conference, pages 145–161. Springer, 2003.
 [31] Xiaoqian Jiang, Miran Kim, Kristin Lauter, and Yongsoo Song. Secure outsourced matrix computation and application to neural networks. In ACM CCS, 2018.
 [32] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. GAZELLE: A low latency framework for secure neural network inference. USENIX Security, 2018.
 [33] Vladimir Kolesnikov and Thomas Schneider. Improved garbled circuit: Free XOR gates and applications. In ICALP, 2008.

[34]
Xiaofan Lin, Cong Zhao, and Wei Pan.
Towards accurate binary convolutional neural network.
In Advances in Neural Information Processing Systems, pages 345–353, 2017.  [35] Yehuda Lindell. Fast cutandchoosebased protocols for malicious and covert adversaries. Journal of Cryptology, 29(2):456–490, 2016.
 [36] Yehuda Lindell and Benny Pinkas. A proof of security of Yao’s protocol for twoparty computation. Journal of Cryptology, 22(2):161–188, 2009.
 [37] Yehuda Lindell and Benny Pinkas. Secure twoparty computation via cutandchoose oblivious transfer. Journal of Cryptology, 25(4):680–722, 2012.
 [38] Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. Oblivious neural network predictions via MiniONN transformations. In ACM CCS, 2017.
 [39] Payman Mohassel and Peter Rindal. ABY3: a mixed protocol framework for machine learning. In ACM CCS, 2018.
 [40] Payman Mohassel and Yupeng Zhang. SecureML: A system for scalable privacypreserving machine learning. In IEEE S&P, 2017.
 [41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
 [42] Moni Naor, Benny Pinkas, and Reuban Sumner. Privacy preserving auctions and mechanism design. In ACM Conference on Electronic Commerce, 1999.
 [43] Pascal Paillier. Publickey cryptosystems based on composite degree residuosity classes. In International Conference on the Theory and Applications of Cryptographic Techniques, pages 223–238. Springer, 1999.
 [44] Michael O Rabin. How to exchange secrets with oblivious transfer. IACR Cryptology ePrint Archive, 2005:187, 2005.
 [45] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1(1):18, 2018.

[46]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.
XNORnet: Imagenet classification using binary convolutional neural networks.
InEuropean Conference on Computer Vision
, pages 525–542. Springer, 2016.  [47] M Sadegh Riazi and Farinaz Koushanfar. Privacypreserving deep learning and inference. In Proceedings of the International Conference on ComputerAided Design, page 18. ACM, 2018.
 [48] M Sadegh Riazi, Bita Darvish Rouhani, and Farinaz Koushanfar. Deep learning on private data. IEEE Security and Privacy (S&P) Magazine., 2019.
 [49] M Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M Songhori, Thomas Schneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In ASIACCS’18, 2018.
 [50] Peter Rindal. libOTe: an efficient, portable, and easy to use Oblivious Transfer Library. https://github.com/osucrypto/libOTe, 2018.
 [51] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [52] Bita Darvish Rouhani, M Sadegh Riazi, and Farinaz Koushanfar. DeepSecure: Scalable provablysecure deep learning. DAC, 2018.
 [53] Amartya Sanyal, Matt Kusner, Adria Gascon, and Varun Kanade. TAPAS: Tricks to accelerate (encrypted) prediction as a service. In International Conference on Machine Learning, pages 4497–4506, 2018.
 [54] Reza Shokri and Vitaly Shmatikov. Privacypreserving deep learning. In ACM CCS, 2015.
 [55] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In S&P. IEEE, 2017.
 [56] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [57] Ebrahim M Songhori, Siam U Hussain, AhmadReza Sadeghi, Thomas Schneider, and Farinaz Koushanfar. TinyGarble: Highly compressed and scalable sequential garbled circuits. In IEEE S&P, 2015.
 [58] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. CVPR, 2015.
 [59] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction APIs. In USENIX Security, 2016.
 [60] Sameer Wagh, Divya Gupta, and Nishanth Chandran. SecureNN: Efficient and private neural network training, 2018.
 [61] Andrew Yao. How to generate and exchange secrets. In FOCS, 1986.
 [62] Samee Zahur, Mike Rosulek, and David Evans. Two halves make a whole. In EUROCRYPT, 2015.
Appendix A Experimental Details
a.1 Network Trimming Examples
Network  Property  Trimming Step  Change  
initial  step 1  step 2  step 3  

Acc. (%)  97.63  97.59  97.28  97.02  0.61%  
Comm. (MB)  4.95  4.29  3.81  3.32  1.49 less  
Lat. (ms)  158  131  114  102  1.54 faster  

Acc. (%)  98.64  98.44  98.37  98.13  0.51%  
Comm. (MB)  38.28  28.63  24.33  15.76  2.42 less  
Lat. (ms)  158  144  134  104  1.51 faster  

Acc. (%)  99.22  99.11  98.96  99.00  0.22%  
Comm. (MB)  56.08  42.51  37.34  32.13  1.75 less  
Lat. (ms)  190  165  157  146  1.3 faster 
Property  Trimming Step  Change  

initial  step 1  step 2  step 3  
Acc. (%)  82.40  82.39  82.41  81.85  0.55% 
Com. (GB)  3.38  3.05  2.76  2.60  1.30 less 
Lat. (s)  7.59  6.87  6.23  5.79  1.31 faster 
a.2 Accuracy, Runtime, and Communication
Runtime and communication reports are available in Table 11 and Table 12 for MNIST and CIFAR10 benchmarks, respectively. The corresponding neural network architectures are provided in Table 13. Entries corresponding to a communication of more than GB are estimated using numerical runtime models.
Arch.  s  Acc. (%)  Comm. (MB)  Lat. (s) 

BM1  1  97.10  2.57  0.12 
1.5  97.56  4.09  0.13  
2  97.82  5.87  0.13  
3  98.10  10.22  0.14  
4  98.34  15.62  0.15  
BM2  1  97.25  2.90  0.10 
1.50  97.93  5.55  0.12  
2  98.28  10.09  0.14  
3  98.56  21.90  0.18  
4  98.64  38.30  0.23  
BM3  1  98.54  17.59  0.17 
1.5  98.93  36.72  0.22  
2  99.13  62.77  0.3  
3  99.26  135.88  0.52  
4  99.35  236.78  0.81 
Arch.  s  Acc. (%)  Comm. (MB)  Lat. (s) 

BC1  1  0.72  1.26  3.96 
1.5  0.77  2.82  8.59  
2  0.80  4.98  15.07  
3  0.83  11.15  33.49  
BC2  1  0.67  0.39  1.37 
1.5  0.73  0.86  2.78  
2  0.78  1.53  4.75  
3  0.82  3.40  10.35  
BC3  1  0.77  1.35  4.23 
1.5  0.81  3.00  9.17  
2  0.83  5.32  16.09  
3  0.86  11.89  35.77  
BC4  1  0.82  4.66  14.12 
1.5  0.85  10.41  31.33  
2  0.87  18.45  55.38  
3  0.88  41.37  123.94  
BC5  1  0.81  5.54  16.78 
1.5  0.85  12.40  37.29  
2  0.86  21.98  65.94  
3  0.88  49.30  147.66  
BC6  1  0.67  0.65  2.15 
1.5  0.74  1.46  4.55  
2  0.78  2.58  7.91  
3  0.80  5.77  17.44 
BM1  

1  FC [input: , output: ] + BN + BA 
2  FC [input: , output: ] + BN + BA 
3  FC [input: , output: ] + BN + Softmax 
BM2  
1 
CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  FC [input: , output: ] + BN + BA 
3  FC [input: , output: ] + BN + Softmax 
BM3  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  MP [input: , window: , output: ] 
3  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
4  MP [input: , window: , output: ] 
5  FC [input: , output: ] + BN + BA 
6  FC [input: , output: ] + BN + Softmax 
BC1  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
3  MP [input: , window: , output: ] 
4  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
5  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
6  MP [input: , window: , output: ] 
7  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
8  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
9  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
10  FC [input: , output: ] + BN + Softmax 
BC2  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
3  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
4  MP [input: , window: , output: ] 
5  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
6  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
7  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
8  MP [input: , window: , output: ] 
9  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
10  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
11  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
12  MP [input: , window: , output: ] 
13  FC [input: , output: ] + BN + Softmax 
BC3  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
3  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
4  MP [input: , window: , output: ] 
5  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
6  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
7  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
8  MP [input: , window: , output: ] 
9  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
10  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
11  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
12  MP [input: , window: , output: ] 
13  FC [input: , output: ] + BN + Softmax 
BC4  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
3  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
4  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
5  MP [input: , window: , output: ] 
6  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
7  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
8  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
9  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA 
10  MP [input: , window: , output: ] 

11  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
12  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
13  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
14  MP [input: , window: , output: ] 
15  FC [input: , output: ] + BN + Softmax 
BC5  
1  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
2  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
3  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
4  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
5  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
6  MP [input: , window: , output: ] 
7  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
8  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
9  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
10  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
11  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
12  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
13  MP [input: , window: , output: ] 
14  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
15  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
16  CONV [input: , window: , stride: , kernels: , 
output: ] + BN + BA  
17  CONV [input: 