Deep neural networks are revolutionizing many applications, but practical use may be slowed down by privacy concerns. As an illustrative example, let us consider a hospital that wishes to use an external diagnosis service for its medical images (e.g. MRI scans). In some cases the hospital would be prevented from sharing the medical data of its patients for privacy reasons. On the other hand, the diagnosis company may not be willing to share its model with the hospital to safeguard its intellectual property. Such privacy conflicts could prevent hospitals from using neural network services for improving healthcare. The ability to evaluate neural network models on private data will allow the use of neural network services in privacy-sensitive applications.
The privacy challenge has attracted significant research in the cryptography community. Cryptographic tools were developed to convert any computation to secure computation, i.e. computation where the data of each involved party is guaranteed to hold no information for other parties. The deep learning setting consists of two parties, one providing the data and the other providing the neural network model. Secure computation is significantly slower then non-secure computation and requires much higher networking bandwidth. Recently, various approaches were proposed for secure computation of neural networks[bae2018security, tanuwidjaja2019survey]. Due to the efficiency limitations of secure computation, these approaches are limited to simple architectures, decreasing their accuracy and applicability.
Instead of using existing architectures and optimizing the cryptographic protocols, we take a complementary approach. We propose to design new neural network architectures that are crypto-oriented. For example, while non-linear functions such as ReLU are very costly to evaluate in privacy preserving computations, they are almost free in plain computations. This suggests that design of crypto-oriented architectures needs to differ from non crypto-oriented architectures.
We propose three design principles for crypto-oriented architectures, derived from empirical observations on multiple state-of-the-art architectures:
Principle 1: Partial activation layers. Non-linear activations such as ReLU are very expensive for secure computation. We propose to split each layer to two branches, applying the non-linear activation on one branch only, to significantly reduce required resources.
Principle 2: Activation layers selection. We propose to eliminate activation layers whose removal makes no significant impact on accuracy.
Principle 3: Alternative non-linearities.
Many commonly used non-linearities have alternative variants with similar expressiveness. We propose to select variants having lower resource requirements, e.g. avoid max pooling and ReLU6.
Motivated by our design principles, we present new crypto-oriented architectures based on three popular (non crytpo-oriented) efficient neural network architectures -MobileNetV2 [sandler2018mobilenetv2], ShuffleNetV2 [ma2018shufflenet] and SqueezeNet [iandola2016squeezenet]. Our new architectures are significantly more efficient than their non crypto-oriented counterparts, with a reasonable loss of accuracy.
2.1 Privacy-Preserving Machine Learning
Research on privacy-preserving machine learning has so far focused on two main challenges: Privacy-preserving training and privacy-preserving inference. Privacy-preserving training[shokri2015privacy, abadi2016deep, bonawitz2017practical] aims at enabling neural networks to be trained with private data. This happens, for example, when private training data arrives from different sources, and data privacy must be protected from all other parties.
In this work we address the challenge of privacy-preserving inference. A pre-trained neural network is provided, and the goal is to transform the network to process (possibly interactively) encrypted data. The network’s output should also be encrypted, and only the data owner can decode it. This enables users with private data, such as medical records, to rely on the services of a model provider.
Existing privacy-preserving inference methods [bae2018security, tanuwidjaja2019survey] rely on three cryptographic approaches, developed by the cryptography community in the context of secure computation: Homomorphic encryption, garbled circuits, and secret sharing. Given a neural network with depth , it can be represented by a list of composed layers:
where is the layer of the network, and is the input to the network. Using the above cryptographic tools, each layer can be transformed into a privacy-preserving layer such that given the encoding of a private input the output of:
is encrypted as well, and can be decoded only by the owner of to compute .
The challenge of practicality.
Despite the extensive research within the cryptography community towards more practical secure computation protocols, the above approaches are practical mainly for simple computations. In particular, homomorphic encryption and secret sharing are most suitable for layers that correspond to affine functions (or to polynomials of small constant degrees). Non-affine layers (e.g. ReLU or Max Pooling) lead to significant overhead, both in computation and in communication. Garbled circuits can be efficient for layers corresponding to functions that can be represented via small Boolean circuits, but interaction between the parties for computing every layer is required. This may be undesirable in many scenarios.
Homomorphic encryption [gentry2009fully, brakerski2014leveled] allows to compute an arbitrary function on an encrypted input, without decryption or knowledge of the private key. In other words, for every function and encrypted input it is possible to compute an encryption of without knowing the secret key that was used to encrypt . Gilad-Bachrach et al. [gilad2016cryptonets] relied on homomorphic encryption in their CryptoNets system, replacing the ReLU activation layers with square activation. However this approach significantly increased the overall inference time. [hesamifard2017cryptodl, chabanne2017privacy, sanyal2018tapas, chou2018faster, bourse2018fast] also propose optimization methods using homomorphic encryption.
In the context of layer-by-layer transformations, garbled circuits [yao1986generate] can be roughly viewed as a one-time variant of homomorphic encryption [rouhani2018deepsecure, juvekar2018gazelle, riazi2019xonn]. For two parties, A and B, where A holds a function (corresponding to a single layer of the network) and B holds an input , the function is transformed by into a garbled circuit that computes on a single encoded input. B will encode its input , and then one of the parties will be able to compute an encoding of from which can be retrieved.
Secret sharing schemes [Shamir79, Beimel11] provide the ability to share a secret between two or more parties. The secret can be reconstructed by combining the shares of any “authorized” subset of the parties (e.g., all parties or any subset of at least a certain size). The shares of any “unauthorized” subset do not reveal any information about the secret. As discussed above, secret sharing schemes enable privacy preserving evaluation of neural networks in a layer-by-layer fashion, where the parties use their shares for all values on each layer for computing shares for all value on layer [mohassel2017secureml, liu2017oblivious, riazi2018chameleon, wagh2019securenn].
2.2 Efficient Neural Network Architecture Design
Real world tasks require both accuracy and efficiency, sometimes under different constraints e.g. hardware. This leads to much work focused on designing deep neural network architectures optimally trading off accuracy and efficiency. SqueezeNet [iandola2016squeezenet], an early approach, reduced the number of model parameters by replacing the commonly used convolutions filters with filters and using squeeze and expand modules. Recent works shifted the focus from reducing parameters to minimizing the number of operations. MobileNetV1 [howard2017mobilenets] utilizes depthwise separable convolution to reduce model complexity and improve efficiency. MobileNetV2 [sandler2018mobilenetv2] further improved this approach by introducing the inverted residual with linear bottleneck block. ShuffleNetV1 [zhang2018shufflenet] relies on pointwise group convolutions to reduce complexity and proposed the channel shuffle operation to help information flow across feature channels. ShuffleNetV2 [ma2018shufflenet] proposed guidelines for the design of efficient deep neural network architectures and suggested an improvement over the ShuffleNetV1 architecture.
2.3 Efficiency Metrics.
Standard neural networks measure efficiency using FLOPs (Floating-Point Operations). Privacy preserving neural networks require different metrics due to the interactivity introduced by cryptographic protocols. The main measures of efficiency for such protocols are typically their overall communication volume (communication complexity), or the number of rounds of interaction between the parties (round complexity) [yao1982protocols, yao1986generate, goldreich1987play, beaver1990round, ishai2000randomizing, franklin1992communication, kushilevitz1997communication, kushelvitz1992privacy, goldwasser1997multi].
3 Designing Crypto-Oriented Networks
Our goal is to design neural networks that can be computed efficiently in a secure manner for providing privacy-preserving inference mechanisms. We propose three principles for designing crypro-oriented neural network architectures. These principles exploit the trade-offs that come with the complexity of the cryptographic techniques enabling privacy preserving inference.
In non-secure computations the cost of affine operations like addition or multiplication is almost the same as the cost of non-linearities such as maximum or ReLU. As typical neural networks consist of many more additions and multiplications than non-linearities, the cost of non-linearities is negligible [cong2014minimizing, hunsberger2016training]. Efficient network designs therefore try to limit the number and size of network layers, not taking into account the number of non-linearities.
As explained in Sec. 2, the situation is different for privacy-preserving neural networks, as secure computation of non-linearities is much more expensive. Homomorphic encryption methods approximate the ReLU activation with polynomials, and higher polynomial degrees are needed for better accuracy. This comes at a larger computational complexity. While garbled circuits and secret sharing methods present lighter-weight protocols, they have high communication and round complexities. As a result, the number of non-linearities is an important consideration in the design of efficient privacy preserving networks. Different architectures are therefore optimal in the non-secure and privacy-preserving cases.
Fig. 1 illustrates the remarkable difference between the two scenarios, i.e. secure and non-secure inference. We evaluate the inference runtime of three popular architectures - SqueezeNet, ShuffleNetV2 and MobileNetV2. We can see that in the secure case, the removal of all activations results in more then a runtime reduction, while in the non-secure case the reduction is negligible - around . This highlights that the number of non-linearities must be taken into account in crypto-oriented neural architecture design.
To obtain an analytic understanding of the relative cost of non-linearity vs. convolution evaluation in privacy preserving networks, let us consider the analytic cost for a particular protocol, SecureNN [wagh2019securenn]. For a convolution layer, the round and communication complexities for bit input of size , kernel size and output channels is given by
In comparison, the ReLU protocol has a round and communication complexities of:
where denotes the field size - each
-bit number is secret shared as a vector ofshares, each being a value between and (inclusive).
Consider the toy example of a small neural network with input of size , with a convolution layer with kernel size and output channels followed by a ReLU activation layer. When considering -bit numbers and equal (following SecureNN) the convolution layer will require rounds and communication, while the ReLU layer will require rounds and communication – x more rounds and x more communication.
In the above, ReLU in only used as an illustration. Our principle applies identically to all other non-linear activation layers such as Leaky-ReLU, ELU, SELU, although the exact numerical trade-offs may differ slightly.
3.1 Principle 1: Partial Activation Layers
In order to reduce the number of non-linear operations used, we propose a partial activation layer, illustrated in Fig. 2. Partial activation splits the channels into two branches with different ratios, similarly to the channel split operation suggested in ShuffleNetV2 [ma2018shufflenet]. The non-linear activation is only applied on one branch. The two branches are then concatenated. By using partial activation we can reduce the number of non-linear operations, while keeping the non-linearity of the model. Our experiments show that this operation results in attractive accuracy-efficiency trade offs, dependent on the amount of non-linear channels.
3.2 Principle 2: Activation Layers Selection
Beyond reducing the number of non-linearities per layer, it is beneficial to simply remove activations in locations where they do not improve the network accuracy. Dong et al. [dong2017eraserelu] and Zhao et al. [zhao2017training] have studied the effect of erasing some ReLU layers and have shown that this sometimes even improves accuracy. Sandler et al. [sandler2018mobilenetv2] also explored the importance of linear layers and incorporates this notion into the bottleneck residual block. While we can not remove all non-linear layers, we can minimize their use. Our second principle is to carefully evaluate which non-linear layers are necessary and remove the redundant ones.
3.3 Principle 3: Alternative Non-Linearities
Secure computation of non-linear layers is costly, but the cost of different non-linearities varies significantly. We investigated the cost of several commonly used non-linearities and propose more crypto-oriented alternatives.
Previous empirical results show that replacing max pooling with average pooling has minimal effect on network accuracy and (non-secure) inference runtime. Many recent neural networks use both pooling methods, or replace some of them with strided convolutions, which are a computationally efficient approach to average pooling[ioffe2015batch, he2016deep, szegedy2016rethinking, szegedy2017inception, chollet2017xception, huang2017densely, sandler2018mobilenetv2, tan2019efficientnet]. In secure inference of neural networks, max and average pooling have very different costs. While max pooling is a non-linear operation which requires computing a complicated protocol, average pooling can be simply performed by summation and multiplication with a constant scalar. For example, in the SecureNN [wagh2019securenn] protocol, the max pooling layer has a round complexity of:
here is the kernel area, and a communication complexity of
where is the number of bits representing the input numbers and denotes the field size - each -bit number is secret shared as a vector of shares, each being a value between 0 and (inclusive). Consider a pooling layer with input image of size and pooling kernel size of . Max pooling would require rounds and MB communication, whereas average pooling can be computed locally by each party, i.e. with no communication required.
Many variants were proposed for the ReLU activation function with the objective of improving the training procedure. One common variant is the ReLU6 activation[krizhevsky2010convolutional], which is defined as:
This activation function is used in several recent efficient architectures including MobileNetV2 [sandler2018mobilenetv2]. As mentioned in Sec. 2, comparisons are difficult to compute in a secure manner. Therefore, the cost of ReLU6, which consists of two comparisons is double the cost of the standard ReLU activation. We provide a protocol for the secure computation of ReLU6 and corresponding analysis in Sec. 6.
We suggest replacing all non-linear pooling layers, particularly max-pooling, and avoiding the use of expensive ReLU variants, such as ReLU6, due to their high cost in secure computation and minimal effect on performance.
In this section, we conduct a sequence of experiments demonstrating the effectiveness of our three design principles for crypto-oriented architectures. Our principles find architectures with better trade-offs between efficiency and accuracy in the privacy-preserving regime than standard architectures.
Efficiency evaluation metric.
The fundamental complexity measures for secure computations are the communication and the round complexities, as they represent the structure of the interactive protocol. The runtime of a protocol is hardware and implementation specific, both having large variability. In this work we focused on the communication and round complexities, and provided the runtime only on the two extreme cases: removing all activation in Fig. 1, and using all our proposed optimizations in Table 1.
We focused on the case of privacy preserving inference and assumed the existence of trained models. For this reason, during experiments we trained the different networks in the clear and “encrypted” them to measure accuracy, runtime, round complexity and communication complexity on private data. We use the tf-encrypted framework [tfencrypted] to convert trained neural networks to privacy preserving neural networks. This implementation is based on secure multi party computation and uses the SPDZ [damgaard2012multiparty, damgaard2013practical] and SecureNN [wagh2019securenn] protocols as backend. For runtime measurements we used an independent server for each party in the computation, each consisting of 30 CPUs and 50GB RAM.
Due to limited resources we evaluated on the CIFAR-10 dataset [krizhevsky2009learning]. Experiment were conducted on three popular efficient architectures - SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2], which were downscaled for the CIFAR-10 dataset. For more details on the downscaling we refer the reader to Sec. A.1.
We train our model using stochastic gradient descent (SGD) optimizer and the Nestrov accelerated gradient and momentum of. We use a cosine learning rate which starts from 0.1 (0.04 for SqueezeNet) and reduces to 0. In every experiment we trained from scratch five times and report the average result.
4.1 Principle 1: Partial Activation Layers
We experimented with different partial activation ratios between the channels in the non-linear branch and the total number of channels. Results are presented in Fig. 3. appears to be a good trade-off between efficiency and accuracy. Note that round complexity was eliminated from this comparison as we assume that element-wise non-linearities can all be computed in parallel, i.e. each round of interaction during the secure computation consists of the communication of all element-wise non-linearities in the layer. Under this reasonable assumption, the round complexity is constant across each layer regardless of the number of non-linearities in the layer.
Scaling down network width.
The reduction of non-linearities across layers can also be achieved by simply scaling down the architecture’s width, i.e. reducing the number of channels in each layer (equivalent to dropping the no-activation branch). We compared the performance of scaling down and using partial activation with the same ratio of remaining channels. As can be seen in Fig. 3, scaling down the width is inferior to the use of partial activation with the original width, demonstrating the importance of both branches in the partial activation layer. Note that as we enlarge the non-linear branch in the partial activation layer or reduce the amount of removed channels in the down scaling, the difference to the original model decrease, resulting with a reduction in the accuracy loss.
4.2 Principle 2: Activation Layers Selection
We evaluated the effectiveness of removing activation layers from each of the three architecture blocks, where each block has two activation layers. Our experiment exhaustively evaluates the effects of removing one or both layers. The results presented in Table 2 clearly demonstrate that one activation layer in each block can be removed completely with a plausible loss of accuracy.
Activation removal and partial activation.
In order to further minimize the use of non-linearities, we investigated the combination of our novel partial activation layer, discussed in our first principle, with selective removal of activation layers. We evaluated the removal of one activation layer while replacing the other with a -partial activation layer. Results are presented in Table 2 and demonstrate that we were able to significantly improve the communication complexity of the secure inference of SqueezeNet, ShuffleNetV2 and MobileNetV2 by , and , respectively, with reasonable change in accuracy. The round complexity was considerably improved as well with , and improvement for SqueezeNet, ShuffleNetV2 and MobileNetV2, respectively.
4.3 Principle 3: Alternative Non-Linearities
We evaluated the effect of using max pooling against average pooling. SqueezeNet consists of multiple max pooling layers and a global average pooling layer. In the max pooling experiment we replaced the global pooling layer by a max global pooling, while in the average pooling experimented we replaced all max pooling layers by average pooling. MobileNetV2 and ShuffleNetv2 use strided convolutions for dimensionality reduction. In order to better emphasize the effect of the different pooling methods, we removed the strides and replaced them with pooling layers. Results are presented in Fig. 4. We can see that average pooling is much more efficient while not affecting accuracy significantly in comparison to max pooling.
We investigated the effect of using ReLU6 versus ReLU activations. MobileNetV2 was designed with ReLU6 so we simply replace those with ReLU. ShuffleNetV2 and SqueezeNet use the ReLU activation, which we replaced with the ReLU6 activation. Results presented in Fig. 4. The choice of non-linearity has minimal effect on accuracy, while ReLU is twice as efficient as ReLU6.
4.4 Crypto-Oriented Neural Architectures
We use our three principles to design state-of-the-art crypto-oriented neural network architectures, based on regular state-of-the-art architectures. Specifically, we present crypto-oriented versions of the building blocks in SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2]. For illustrative purposes, we describe in detail the application of our principles on the inverted residual with linear bottleneck blocks from MobileNetV2, illustrated in Fig. 6. A more detailed description of the applications of our principles for the other blocks is presented in Sec. A.2. Final results are presented in Table 1.
Partial activation layers: In order to reduce the number of non-linear evaluations we replaced all activation layers in the inverted residual with linear bottleneck block with -partial activation layers. This results with an improvement of in communication complexity.
Activation layer selection: After careful evaluation, we removed the first activation layer completely (i.e. the depthwise convolution is the only non-linear layer). This reduces the communication complexity by and the round complexity by . Combining this change with the former, i.e. replacing the first activation layer by a partial activation results with additional improvement of in communication complexity. Overall, by applying the changes motivated by these two design principles we reduce the communication and round complexity by and , respectively.
Alternative non-linearities: As discussed in Sec. 3.3, the ReLU6 variant costs twice as much as the ReLU activation. Therefore, we replace the ReLU6 activation function by the ReLU function. This change produce an improvement of in communication complexity and in round complexity.
The above modifications, motivated by each of our design principles yield an improvement of in communication complexity and in round complexity.
Due to the slow inference runtime of secure neural networks, we have measured accuracy in the non-secure setting. As the SecureNN framework, which we base our analysis on, applies no approximations on the inference, there should be no significant difference between the secure and non-secure accuracy. In order to verify this assumption, we have measured the secure inference accuracy on a subset of experiments, using the tf-encrypted library [tfencrypted], and compared to the non-secure accuracy of the same model. The results, presented in Sec. A.3, show minor loss of accuracy.
Comparison to other frameworks.
Our analysis focused on the SecureNN protocol [wagh2019securenn]. We stress that the same idea applies to other frameworks. For example, consider the Gazelle framework [juvekar2018gazelle], which proposed a hybrid between homomorphic encryption for linear layers, and garbled circuits for non-linear layers. According to the benchmarks provided by the authors, a convolution layer with input of size , kernel size and output channels, i.e.
output neurons, will takems and no communication. In comparison, a ReLU layer with neurons, less then third of the output of the aforementioned convolution layer, will take ms and MB.
It should be noted that there are frameworks for which secure computation of a convolution layer is more expensive then secure computation of a ReLU layer. An example is the DeepSecure framework [rouhani2018deepsecure] which only uses garbled circuits. The optimal architectures might change slightly in this case, but the core idea of our work, i.e. the need in designing crypto-oriented architectures, is highly relevant.
In order to minimize accuracy reduction, we tried to gain more expressiveness by increasing the number of channels with no activation. As discussed in Sec. 3, based on the analysis of the SecureNN [wagh2019securenn] framework (and others, as discussed above), secure computation of convolution layers is more efficient than the cost of activation layers. Therefore we can add more channels when removing non-linearities. Results are detailed in Sec. A.4, and show minor increase in accuracy while slightly increasing communication. The difference was not significant enough to be included in our final crypto-oriented architectures.
6 Analysis of ReLU Alternatives
In order to analyse the cost of secure computation of ReLU6, we modify the ReLU protocol suggested in [wagh2019securenn] and provide a protocol for the secure computation of ReLU6. ReLU6 can be decomposed into a combination of Heaviside step functions ():
The protocol is described in Algorithm 1, and is a step-by-step secure evaluation of Eq. (10) via secret sharing. We denote the model provider by and the data owner by . represents the crypto-producer, a third-party “assistant” that provides randomness. and are the two secret shares of over . and are the secure protocols presented in [wagh2019securenn] for computing and matrix multiplication (for scalar multiplication, we use MatMul with matrices), respectively. For more details we refer the reader to [wagh2019securenn]. The proof of this protocol follows directly via the security of the two underlying protocols and the fact that at each point in time the parties learn only secret shares of the current state of the computation. The round and communication complexities of ReLU6 under this protocol are specified in Eq. (12)–(13).
Another popular non-linearity, often used as an alternative to ReLU is the LeakyReLU activation.
This activation function can be also written as:
We modified the ReLU protocol suggested in [wagh2019securenn] for a protocol which is a step by step secure evaluation of (15). Description of the protocol is provided in Sec. A.5. Secure computation of LeakyReLU differs from the secure computation of ReLU only by a constant scalar multiplication, and therefore has the same communication and round complexity. LeakyReLU can be used instead of ReLU with no additional costs.
Algorithm 1 :
Input: hold and , respectively.
Output: get ReLU6 and ReLU6.
Common Randomness: hold random shares of 0 over , denoted by and resp.
run with having input and learn and , resp. call with having input and learn and , resp. run with having input and learn and , resp. call with having input and learn and , resp. For , outputs .
We addressed efficiency challenges in privacy-preserving neural network inference. Motivated by the unique properties of secure computation, we proposed three design principles for crypto-oriented neural network architectures: partial activation layers, activation layers selection and alternative non-linearities. By applying our design principles on three state-of-the-art architectures (SqueezeNet, ShuffleNetV2 and MobileNetV2) we achieved significant improvement on all architectures. On MobileNetV2, for example, we achieved an improvement of in communication complexity, in round complexity and in secure inference runtime, with only a reasonable loss in accuracy.
This research has been supported by the Israel ministry of Science and Technology, by the Israel Science foundation, and by the European Union’s Horizon 2020 Framework Program (H2020) via an ERC Grant (Grant No. 714253).
a.1 CIFAR-10 Downscaling
Experiments were conducted on three popular efficient architectures - SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2]
. These architectures were designed for the ImageNet dataset[deng2009imagenet], a large scale dataset. Due to limited resources we evaluated on the CIFAR-10 dataset [krizhevsky2009learning], which required downscaling the architectures accordignly. We will detail the modifications applied to each architecture:
Changed the kernel size of the first convolution layer from to and reduced the stride from to . In each pooling layer, we replaced the kernel with a
kernel. In addition, we removed the dropout layer and added batch normalization layers after every convolution layer.
Reduced the stride in the first convolution layer from to . In addition, we removed the first pooling layer.
Reduced the stride in the first convolution layer and in the first inverted residual block from to . In addition, we increased the weight decay from to .
a.2 Other Crypto-Oriented Neural Architectures
We present our crypto-oriented version of the building blocks in SqueezeNet [iandola2016squeezenet] and ShuffleNetV2 [ma2018shufflenet].
Partial activation layers:
In order to reduce the number of non-linearities we replaced the activation layers in the Fire module (i.e. the SqueezeNet building block) with -partial activation layers. This reduces the communication complexity by .
Activation layers removal:
We removed the first activation layer in the Fire module, i.e. from the squeeze phase of the block. This results with an improvement of in communication complexity and with in round complexity. Combining this change with the former, i.e. replacing the remaining activations with -partial activation layers results further improves the communication complexity by . Overall, by applying both changes we reduce communication and round complexity by and , respectively.
We replaced each max pooling layer with an average pooling layer, as max pooling is very expensive to compute in a secure manner. This improved the communication complexity by and the round complexity by .
Our final crypto-oriented version of the SqueezeNet architecture is improving over it’s non crypto-oriented counterpart by in communication complexity and in round complexity. This comes at the cost of a reasonable accuracy loss of . Our crypto-oriented Fire module block is presented in Fig. 7.
Partial activation layers:
We replaced all activation layers in the ShuffleNetV2 unit by -partial activation layers. This improves the communication complexity by .
Activation layers removal:
We removed the second activation layer from the ShuffleNetV2 unit. This results with an improvement of in communication complexity and in round complexity. By further reducing the number of non-linearities and replacing the remaining activation layer with a -partial activation layer we get an additional improvement of in communication complexity.
We did not change the non-linearities in the architecture as the ShuffleNetV2 unit was not using expensive variants such as ReLU6. In addition, the max pooling layer that existed in the original design of the architecture was removed in our downscaling process, detailed above.
Applying the aforementioned modifications, motivated by our design principles, yields an improvement of communication and round complexity by and , respectively. This optimization has a small accuracy loss of . Our crypto-oriented ShuffleNetV2 unit is presented in Fig. 8.
a.3 Encrypted Accuracy
In our experiments, we measured accuracy in the non-secure setting, due to the slow inference time of secure neural networks. Our experiments were conducted using the tf-encrypted library [tfencrypted] which is based on the SPDZ [damgaard2012multiparty, damgaard2013practical] and SecureNN [wagh2019securenn] protocols. As this implementation does not apply any approximations on the inference, we do not expect there to be a significant difference between the secure and non-secure accuracy measurements. In order to verify this assumption, we evaluated the secure accuracy on our final crypto-oriented architectures and compared the results against the non-secure accuracy. As done in all of our experiments, each experiment was conducted five times, and we report the average results. The results presented in Table 3 show that the accuracy difference is indeed negligible.
a.4 Double Channels Results
As mentioned in the discussion section, we tried to reduce the accuracy loss resulting from the minimization of activations, i.e. the removal of activation layers and replacement of the remaining layers with partial activation layer, described in Table 4. We evaluated the effect of increasing the number of channels in layers with no activations by a factor of two. The goal is to amplify the model’s expressiveness, without adding further non-linearities. As mentioned in the paper and based on the analysis of the SecureNN framework [wagh2019securenn], the added cost of increasing the convolution channels is less then the cost of the removed activations, therefore enabling us to ”compensate” for the removal of activations with more convolutional channels. As can be in the results, detailed in Table 4, increasing the amount of channels had a minor affect on accuracy while increasing the costs. The benefit of increasing the channels was not significant enough to be included in our final crypto-oriented architectures.
a.5 LeakyReLU Protocol
We present a protocol, based on the ReLU protocol from [wagh2019securenn], for the secure computation of the LeakyReLU activation function. The LeakyReLU activation is defined as:
This activation function can be also written as:
Note that the secure computation of LeakyReLU only differs from the secure computation of ReLU, provided in [wagh2019securenn], by only a constant scalar multiplication, and therefore has the same communication and round complexity. This suggests that the LeakyReLU activation function can be used instead of ReLU with no additional costs.
Algorithm 2. :
Input: hold and , respectively.
Output: get LeakyReLU and LeakyReLU.
Common Randomness: hold random shares of 0 over , denoted by and resp.
run with having input and learn and , resp. call with having input and learn and , resp. For , outputs .