Crypto-Oriented Neural Architecture Design

11/27/2019 ∙ by Avital Shafran, et al. ∙ 0

As neural networks revolutionize many applications, significant privacy concerns emerge. Owners of private data wish to use remote neural network services while ensuring their data cannot be interpreted by others. Service providers wish to keep their model private to safeguard its intellectual property. Such privacy conflicts may slow down the adoption of neural networks in sensitive domains such as healthcare. Privacy issues have been addressed in the cryptography community in the context of secure computation. However, secure computation protocols have known performance issues. E.g., runtime of secure inference in deep neural networks is three orders of magnitude longer comparing to non-secure inference. Therefore, much research efforts address the optimization of cryptographic protocols for secure inference. We take a complementary approach, and provide design principles for optimizing the crypto-oriented neural network architectures to reduce the runtime of secure inference. The principles are evaluated on three state-of-the-art architectures: SqueezeNet, ShuffleNetV2, and MobileNetV2. Our novel method significantly improves the efficiency of secure inference on common evaluation metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks are revolutionizing many applications, but practical use may be slowed down by privacy concerns. As an illustrative example, let us consider a hospital that wishes to use an external diagnosis service for its medical images (e.g. MRI scans). In some cases the hospital would be prevented from sharing the medical data of its patients for privacy reasons. On the other hand, the diagnosis company may not be willing to share its model with the hospital to safeguard its intellectual property. Such privacy conflicts could prevent hospitals from using neural network services for improving healthcare. The ability to evaluate neural network models on private data will allow the use of neural network services in privacy-sensitive applications.

The privacy challenge has attracted significant research in the cryptography community. Cryptographic tools were developed to convert any computation to secure computation, i.e. computation where the data of each involved party is guaranteed to hold no information for other parties. The deep learning setting consists of two parties, one providing the data and the other providing the neural network model. Secure computation is significantly slower then non-secure computation and requires much higher networking bandwidth. Recently, various approaches were proposed for secure computation of neural networks

[bae2018security, tanuwidjaja2019survey]. Due to the efficiency limitations of secure computation, these approaches are limited to simple architectures, decreasing their accuracy and applicability.

Instead of using existing architectures and optimizing the cryptographic protocols, we take a complementary approach. We propose to design new neural network architectures that are crypto-oriented. For example, while non-linear functions such as ReLU are very costly to evaluate in privacy preserving computations, they are almost free in plain computations. This suggests that design of crypto-oriented architectures needs to differ from non crypto-oriented architectures.

Our Contribution.

We propose three design principles for crypto-oriented architectures, derived from empirical observations on multiple state-of-the-art architectures:

  • Principle 1: Partial activation layers. Non-linear activations such as ReLU are very expensive for secure computation. We propose to split each layer to two branches, applying the non-linear activation on one branch only, to significantly reduce required resources.

  • Principle 2: Activation layers selection. We propose to eliminate activation layers whose removal makes no significant impact on accuracy.

  • Principle 3: Alternative non-linearities.

    Many commonly used non-linearities have alternative variants with similar expressiveness. We propose to select variants having lower resource requirements, e.g. avoid max pooling and ReLU6.

Motivated by our design principles, we present new crypto-oriented architectures based on three popular (non crytpo-oriented) efficient neural network architectures -MobileNetV2 [sandler2018mobilenetv2], ShuffleNetV2 [ma2018shufflenet] and SqueezeNet [iandola2016squeezenet]. Our new architectures are significantly more efficient than their non crypto-oriented counterparts, with a reasonable loss of accuracy.

Model Acc- Comm. Rounds Runtime
uracy (MB) (sec)
Squeeze-orig 92.49 326.41 393 14.59
Squeeze-ours 91.87 148.8 232 9.03
Shuffle-orig 92.6 310.89 486 24.37
Shuffle-ours 92.5 156.89 294 14.88
Mobile-orig 94.49 1925.42 808 41.01
Mobile-ours 93.44 402.6 296 17.11
Table 1: Comparison of performance on secure CIFAR-10 classification, between a few known networks (SqueezeNet, ShuffleNetV2, and MobileNetV2) and our proposed crypto oriented modifications. Substantial increase of efficiency is obtained with a reasonable reduction of accuracy.

2 Background

2.1 Privacy-Preserving Machine Learning

Research on privacy-preserving machine learning has so far focused on two main challenges: Privacy-preserving training and privacy-preserving inference. Privacy-preserving training

[shokri2015privacy, abadi2016deep, bonawitz2017practical] aims at enabling neural networks to be trained with private data. This happens, for example, when private training data arrives from different sources, and data privacy must be protected from all other parties.

In this work we address the challenge of privacy-preserving inference. A pre-trained neural network is provided, and the goal is to transform the network to process (possibly interactively) encrypted data. The network’s output should also be encrypted, and only the data owner can decode it. This enables users with private data, such as medical records, to rely on the services of a model provider.

Existing privacy-preserving inference methods [bae2018security, tanuwidjaja2019survey] rely on three cryptographic approaches, developed by the cryptography community in the context of secure computation: Homomorphic encryption, garbled circuits, and secret sharing. Given a neural network with depth , it can be represented by a list of composed layers:

(1)

where is the layer of the network, and is the input to the network. Using the above cryptographic tools, each layer can be transformed into a privacy-preserving layer such that given the encoding of a private input the output of:

(2)

is encrypted as well, and can be decoded only by the owner of to compute .

The challenge of practicality.

Despite the extensive research within the cryptography community towards more practical secure computation protocols, the above approaches are practical mainly for simple computations. In particular, homomorphic encryption and secret sharing are most suitable for layers that correspond to affine functions (or to polynomials of small constant degrees). Non-affine layers (e.g. ReLU or Max Pooling) lead to significant overhead, both in computation and in communication. Garbled circuits can be efficient for layers corresponding to functions that can be represented via small Boolean circuits, but interaction between the parties for computing every layer is required. This may be undesirable in many scenarios.

Homomorphic encryption.

Homomorphic encryption [gentry2009fully, brakerski2014leveled] allows to compute an arbitrary function on an encrypted input, without decryption or knowledge of the private key. In other words, for every function and encrypted input it is possible to compute an encryption of without knowing the secret key that was used to encrypt . Gilad-Bachrach et al. [gilad2016cryptonets] relied on homomorphic encryption in their CryptoNets system, replacing the ReLU activation layers with square activation. However this approach significantly increased the overall inference time. [hesamifard2017cryptodl, chabanne2017privacy, sanyal2018tapas, chou2018faster, bourse2018fast] also propose optimization methods using homomorphic encryption.

Garbled circuits.

In the context of layer-by-layer transformations, garbled circuits [yao1986generate] can be roughly viewed as a one-time variant of homomorphic encryption [rouhani2018deepsecure, juvekar2018gazelle, riazi2019xonn]. For two parties, A and B, where A holds a function (corresponding to a single layer of the network) and B holds an input , the function is transformed by into a garbled circuit that computes on a single encoded input. B will encode its input , and then one of the parties will be able to compute an encoding of from which can be retrieved.

Secret sharing.

Secret sharing schemes [Shamir79, Beimel11] provide the ability to share a secret between two or more parties. The secret can be reconstructed by combining the shares of any “authorized” subset of the parties (e.g., all parties or any subset of at least a certain size). The shares of any “unauthorized” subset do not reveal any information about the secret. As discussed above, secret sharing schemes enable privacy preserving evaluation of neural networks in a layer-by-layer fashion, where the parties use their shares for all values on each layer for computing shares for all value on layer [mohassel2017secureml, liu2017oblivious, riazi2018chameleon, wagh2019securenn].

2.2 Efficient Neural Network Architecture Design

Real world tasks require both accuracy and efficiency, sometimes under different constraints e.g. hardware. This leads to much work focused on designing deep neural network architectures optimally trading off accuracy and efficiency. SqueezeNet [iandola2016squeezenet], an early approach, reduced the number of model parameters by replacing the commonly used convolutions filters with filters and using squeeze and expand modules. Recent works shifted the focus from reducing parameters to minimizing the number of operations. MobileNetV1 [howard2017mobilenets] utilizes depthwise separable convolution to reduce model complexity and improve efficiency. MobileNetV2 [sandler2018mobilenetv2] further improved this approach by introducing the inverted residual with linear bottleneck block. ShuffleNetV1 [zhang2018shufflenet] relies on pointwise group convolutions to reduce complexity and proposed the channel shuffle operation to help information flow across feature channels. ShuffleNetV2 [ma2018shufflenet] proposed guidelines for the design of efficient deep neural network architectures and suggested an improvement over the ShuffleNetV1 architecture.

2.3 Efficiency Metrics.

Standard neural networks measure efficiency using FLOPs (Floating-Point Operations). Privacy preserving neural networks require different metrics due to the interactivity introduced by cryptographic protocols. The main measures of efficiency for such protocols are typically their overall communication volume (communication complexity), or the number of rounds of interaction between the parties (round complexity)  [yao1982protocols, yao1986generate, goldreich1987play, beaver1990round, ishai2000randomizing, franklin1992communication, kushilevitz1997communication, kushelvitz1992privacy, goldwasser1997multi].

3 Designing Crypto-Oriented Networks

Figure 1: Removal of all activation layers has negligible effect on non-secure networks with drastic reduction of complexity in secure inference. Comparison of inference runtime in SqueezeNet, ShuffleNetV2 and MobileNetV2. (a) - Secure inference. (b) - non-secure inference. Blue - Original network. Orange - Activation removed.

Our goal is to design neural networks that can be computed efficiently in a secure manner for providing privacy-preserving inference mechanisms. We propose three principles for designing crypro-oriented neural network architectures. These principles exploit the trade-offs that come with the complexity of the cryptographic techniques enabling privacy preserving inference.

In non-secure computations the cost of affine operations like addition or multiplication is almost the same as the cost of non-linearities such as maximum or ReLU. As typical neural networks consist of many more additions and multiplications than non-linearities, the cost of non-linearities is negligible [cong2014minimizing, hunsberger2016training]. Efficient network designs therefore try to limit the number and size of network layers, not taking into account the number of non-linearities.

As explained in Sec. 2, the situation is different for privacy-preserving neural networks, as secure computation of non-linearities is much more expensive. Homomorphic encryption methods approximate the ReLU activation with polynomials, and higher polynomial degrees are needed for better accuracy. This comes at a larger computational complexity. While garbled circuits and secret sharing methods present lighter-weight protocols, they have high communication and round complexities. As a result, the number of non-linearities is an important consideration in the design of efficient privacy preserving networks. Different architectures are therefore optimal in the non-secure and privacy-preserving cases.

Fig. 1 illustrates the remarkable difference between the two scenarios, i.e. secure and non-secure inference. We evaluate the inference runtime of three popular architectures - SqueezeNet, ShuffleNetV2 and MobileNetV2. We can see that in the secure case, the removal of all activations results in more then a runtime reduction, while in the non-secure case the reduction is negligible - around . This highlights that the number of non-linearities must be taken into account in crypto-oriented neural architecture design.

To obtain an analytic understanding of the relative cost of non-linearity vs. convolution evaluation in privacy preserving networks, let us consider the analytic cost for a particular protocol, SecureNN [wagh2019securenn]. For a convolution layer, the round and communication complexities for bit input of size , kernel size and output channels is given by

(3)
(4)

In comparison, the ReLU protocol has a round and communication complexities of:

(5)
(6)

where denotes the field size - each

-bit number is secret shared as a vector of

shares, each being a value between and (inclusive).

Consider the toy example of a small neural network with input of size , with a convolution layer with kernel size and output channels followed by a ReLU activation layer. When considering -bit numbers and equal (following SecureNN) the convolution layer will require rounds and communication, while the ReLU layer will require rounds and communication – x more rounds and x more communication.

In the above, ReLU in only used as an illustration. Our principle applies identically to all other non-linear activation layers such as Leaky-ReLU, ELU, SELU, although the exact numerical trade-offs may differ slightly.

3.1 Principle 1: Partial Activation Layers

Figure 2: In Partial Activation layers, the channels are split and activation is applied only to a subset of the channels, and not applied to the other channels.

In order to reduce the number of non-linear operations used, we propose a partial activation layer, illustrated in Fig. 2. Partial activation splits the channels into two branches with different ratios, similarly to the channel split operation suggested in ShuffleNetV2 [ma2018shufflenet]. The non-linear activation is only applied on one branch. The two branches are then concatenated. By using partial activation we can reduce the number of non-linear operations, while keeping the non-linearity of the model. Our experiments show that this operation results in attractive accuracy-efficiency trade offs, dependent on the amount of non-linear channels.

3.2 Principle 2: Activation Layers Selection

Beyond reducing the number of non-linearities per layer, it is beneficial to simply remove activations in locations where they do not improve the network accuracy. Dong et al. [dong2017eraserelu] and Zhao et al. [zhao2017training] have studied the effect of erasing some ReLU layers and have shown that this sometimes even improves accuracy. Sandler et al. [sandler2018mobilenetv2] also explored the importance of linear layers and incorporates this notion into the bottleneck residual block. While we can not remove all non-linear layers, we can minimize their use. Our second principle is to carefully evaluate which non-linear layers are necessary and remove the redundant ones.

3.3 Principle 3: Alternative Non-Linearities

Secure computation of non-linear layers is costly, but the cost of different non-linearities varies significantly. We investigated the cost of several commonly used non-linearities and propose more crypto-oriented alternatives.

Pooling:

Previous empirical results show that replacing max pooling with average pooling has minimal effect on network accuracy and (non-secure) inference runtime. Many recent neural networks use both pooling methods, or replace some of them with strided convolutions, which are a computationally efficient approach to average pooling

[ioffe2015batch, he2016deep, szegedy2016rethinking, szegedy2017inception, chollet2017xception, huang2017densely, sandler2018mobilenetv2, tan2019efficientnet]. In secure inference of neural networks, max and average pooling have very different costs. While max pooling is a non-linear operation which requires computing a complicated protocol, average pooling can be simply performed by summation and multiplication with a constant scalar. For example, in the SecureNN [wagh2019securenn] protocol, the max pooling layer has a round complexity of:

(7)

here is the kernel area, and a communication complexity of

(8)

where is the number of bits representing the input numbers and denotes the field size - each -bit number is secret shared as a vector of shares, each being a value between 0 and (inclusive). Consider a pooling layer with input image of size and pooling kernel size of . Max pooling would require rounds and MB communication, whereas average pooling can be computed locally by each party, i.e. with no communication required.

ReLU.

Many variants were proposed for the ReLU activation function with the objective of improving the training procedure. One common variant is the ReLU6 activation

[krizhevsky2010convolutional], which is defined as:

(9)

This activation function is used in several recent efficient architectures including MobileNetV2 [sandler2018mobilenetv2]. As mentioned in Sec. 2, comparisons are difficult to compute in a secure manner. Therefore, the cost of ReLU6, which consists of two comparisons is double the cost of the standard ReLU activation. We provide a protocol for the secure computation of ReLU6 and corresponding analysis in Sec. 6.

We suggest replacing all non-linear pooling layers, particularly max-pooling, and avoiding the use of expensive ReLU variants, such as ReLU6, due to their high cost in secure computation and minimal effect on performance.

4 Experiments

Figure 3: Comparison between different partial activation ratios on SqueezeNet, ShuffleNetV2, and MobileNetV2, in terms of accuracy (a) and communication complexity (b). Ratio of presents a good balance between accuracy and efficiency. (c)-(e): comparison between partial activation and reducing network width at the same ratios (i.e. removing the no-activation branch) between SqueezeNet (c), ShuffleNetV2 (d) and MobileNetV2 (e). Results demonstrate that both branches contributes significantly to the accuracy.

In this section, we conduct a sequence of experiments demonstrating the effectiveness of our three design principles for crypto-oriented architectures. Our principles find architectures with better trade-offs between efficiency and accuracy in the privacy-preserving regime than standard architectures.

Efficiency evaluation metric.

The fundamental complexity measures for secure computations are the communication and the round complexities, as they represent the structure of the interactive protocol. The runtime of a protocol is hardware and implementation specific, both having large variability. In this work we focused on the communication and round complexities, and provided the runtime only on the two extreme cases: removing all activation in Fig. 1, and using all our proposed optimizations in Table 1.

Implementation details.

We focused on the case of privacy preserving inference and assumed the existence of trained models. For this reason, during experiments we trained the different networks in the clear and “encrypted” them to measure accuracy, runtime, round complexity and communication complexity on private data. We use the tf-encrypted framework [tfencrypted] to convert trained neural networks to privacy preserving neural networks. This implementation is based on secure multi party computation and uses the SPDZ [damgaard2012multiparty, damgaard2013practical] and SecureNN [wagh2019securenn] protocols as backend. For runtime measurements we used an independent server for each party in the computation, each consisting of 30 CPUs and 50GB RAM.

Due to limited resources we evaluated on the CIFAR-10 dataset [krizhevsky2009learning]. Experiment were conducted on three popular efficient architectures - SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2], which were downscaled for the CIFAR-10 dataset. For more details on the downscaling we refer the reader to Sec. A.1.

Training details.

We train our model using stochastic gradient descent (SGD) optimizer and the Nestrov accelerated gradient and momentum of

. We use a cosine learning rate which starts from 0.1 (0.04 for SqueezeNet) and reduces to 0. In every experiment we trained from scratch five times and report the average result.

4.1 Principle 1: Partial Activation Layers

Partial activation.

We experimented with different partial activation ratios between the channels in the non-linear branch and the total number of channels. Results are presented in Fig. 3. appears to be a good trade-off between efficiency and accuracy. Note that round complexity was eliminated from this comparison as we assume that element-wise non-linearities can all be computed in parallel, i.e. each round of interaction during the secure computation consists of the communication of all element-wise non-linearities in the layer. Under this reasonable assumption, the round complexity is constant across each layer regardless of the number of non-linearities in the layer.

Scaling down network width.

The reduction of non-linearities across layers can also be achieved by simply scaling down the architecture’s width, i.e. reducing the number of channels in each layer (equivalent to dropping the no-activation branch). We compared the performance of scaling down and using partial activation with the same ratio of remaining channels. As can be seen in Fig. 3, scaling down the width is inferior to the use of partial activation with the original width, demonstrating the importance of both branches in the partial activation layer. Note that as we enlarge the non-linear branch in the partial activation layer or reduce the amount of removed channels in the down scaling, the difference to the original model decrease, resulting with a reduction in the accuracy loss.

4.2 Principle 2: Activation Layers Selection

Model Accuracy Comm. (MB) Rounds
Squeeze-st 90.54 188.56 233
Squeeze-nd 93.15 309.18 313
Squeeze-st 90.4 179.95 233
Squeeze-nd 92.66 240.26 313
Squeeze-none 86.95 171.33 153
Squeeze-orig 92.49 326.41 393
Shuffle-st 92.83 218.28 294
Shuffle-nd 92.9 188.12 324
Shuffle-st 92.5 156.89 294
Shuffle-nd 92.19 141.82 324
Shuffle-none 83.26 95.51 134
Shuffle-orig 92.6 310.89 484
Mobile-st 93.92 1167.28 466
Mobile-nd 94.13 1002.1 486
Mobile-st 93.66 705.62 466
Mobile-nd 93.28 623.03 486
Mobile-none 78.45 243.96 146
Mobile-orig 94.49 1925.42 806
Table 2: Comparison between removing different activation layers and the effect of replacing the remaining with a partial activation layer, on SqueezeNet, ShuffleNetV2 and MobileNetV2 blocks. We denote by Squeeze- the removal of all but the activation layer in the SqueezeNet block, and by Squeeze- the removal of all but the activation layer and replacing it with a partial activation layer. Removing one activation layer while using partial activation on the other one gives roughly equivalent results for the two activation layers.
Activation removal.

We evaluated the effectiveness of removing activation layers from each of the three architecture blocks, where each block has two activation layers. Our experiment exhaustively evaluates the effects of removing one or both layers. The results presented in Table 2 clearly demonstrate that one activation layer in each block can be removed completely with a plausible loss of accuracy.

Activation removal and partial activation.

In order to further minimize the use of non-linearities, we investigated the combination of our novel partial activation layer, discussed in our first principle, with selective removal of activation layers. We evaluated the removal of one activation layer while replacing the other with a -partial activation layer. Results are presented in Table 2 and demonstrate that we were able to significantly improve the communication complexity of the secure inference of SqueezeNet, ShuffleNetV2 and MobileNetV2 by , and , respectively, with reasonable change in accuracy. The round complexity was considerably improved as well with , and improvement for SqueezeNet, ShuffleNetV2 and MobileNetV2, respectively.

4.3 Principle 3: Alternative Non-Linearities

Figure 4: Comparison between max pooling (blue) and average pooling (orange) on SqueezeNet, ShuffleNetV2 and MobileNetv2, in terms of accuracy (a), communication complexity (b) and round complexity (c). Average pooling has roughly the same accuracy, but is much more efficient.
Figure 5: Comparison between ReLU (Blue) and ReLU6 (Orange) activation functions on SqueezeNet, ShuffleNetV2 and MobileNetv2, in terms of accuracy (a), communication complexity (b) and round complexity (c). Accuracy is similar but ReLU is more efficient that ReLU6.
Pooling.

We evaluated the effect of using max pooling against average pooling. SqueezeNet consists of multiple max pooling layers and a global average pooling layer. In the max pooling experiment we replaced the global pooling layer by a max global pooling, while in the average pooling experimented we replaced all max pooling layers by average pooling. MobileNetV2 and ShuffleNetv2 use strided convolutions for dimensionality reduction. In order to better emphasize the effect of the different pooling methods, we removed the strides and replaced them with pooling layers. Results are presented in Fig. 4. We can see that average pooling is much more efficient while not affecting accuracy significantly in comparison to max pooling.

ReLU6.

We investigated the effect of using ReLU6 versus ReLU activations. MobileNetV2 was designed with ReLU6 so we simply replace those with ReLU. ShuffleNetV2 and SqueezeNet use the ReLU activation, which we replaced with the ReLU6 activation. Results presented in Fig. 4. The choice of non-linearity has minimal effect on accuracy, while ReLU is twice as efficient as ReLU6.

4.4 Crypto-Oriented Neural Architectures

We use our three principles to design state-of-the-art crypto-oriented neural network architectures, based on regular state-of-the-art architectures. Specifically, we present crypto-oriented versions of the building blocks in SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2]. For illustrative purposes, we describe in detail the application of our principles on the inverted residual with linear bottleneck blocks from MobileNetV2, illustrated in Fig. 6. A more detailed description of the applications of our principles for the other blocks is presented in Sec. A.2. Final results are presented in Table 1.

Partial activation layers: In order to reduce the number of non-linear evaluations we replaced all activation layers in the inverted residual with linear bottleneck block with -partial activation layers. This results with an improvement of in communication complexity.

Activation layer selection: After careful evaluation, we removed the first activation layer completely (i.e. the depthwise convolution is the only non-linear layer). This reduces the communication complexity by and the round complexity by . Combining this change with the former, i.e. replacing the first activation layer by a partial activation results with additional improvement of in communication complexity. Overall, by applying the changes motivated by these two design principles we reduce the communication and round complexity by and , respectively.

Alternative non-linearities: As discussed in Sec. 3.3, the ReLU6 variant costs twice as much as the ReLU activation. Therefore, we replace the ReLU6 activation function by the ReLU function. This change produce an improvement of in communication complexity and in round complexity.

The above modifications, motivated by each of our design principles yield an improvement of in communication complexity and in round complexity.

Figure 6: Crypto-oriented inverted residual with linear bottleneck block. By applying our three design principles, i.e. removing first activation layer, replacing second activation layer with -partial activation and using the ReLU activation instead of the ReLU6 variant, we achieved a significant improvement in communication and round complexity of the MobileNetV2 architecture.

5 Discussion

Accuracy measurements.

Due to the slow inference runtime of secure neural networks, we have measured accuracy in the non-secure setting. As the SecureNN framework, which we base our analysis on, applies no approximations on the inference, there should be no significant difference between the secure and non-secure accuracy. In order to verify this assumption, we have measured the secure inference accuracy on a subset of experiments, using the tf-encrypted library [tfencrypted], and compared to the non-secure accuracy of the same model. The results, presented in Sec. A.3, show minor loss of accuracy.

Comparison to other frameworks.

Our analysis focused on the SecureNN protocol [wagh2019securenn]. We stress that the same idea applies to other frameworks. For example, consider the Gazelle framework [juvekar2018gazelle], which proposed a hybrid between homomorphic encryption for linear layers, and garbled circuits for non-linear layers. According to the benchmarks provided by the authors, a convolution layer with input of size , kernel size and output channels, i.e. 

output neurons, will take

ms and no communication. In comparison, a ReLU layer with neurons, less then third of the output of the aforementioned convolution layer, will take ms and MB.

It should be noted that there are frameworks for which secure computation of a convolution layer is more expensive then secure computation of a ReLU layer. An example is the DeepSecure framework [rouhani2018deepsecure] which only uses garbled circuits. The optimal architectures might change slightly in this case, but the core idea of our work, i.e. the need in designing crypto-oriented architectures, is highly relevant.

Increasing channels.

In order to minimize accuracy reduction, we tried to gain more expressiveness by increasing the number of channels with no activation. As discussed in Sec. 3, based on the analysis of the SecureNN [wagh2019securenn] framework (and others, as discussed above), secure computation of convolution layers is more efficient than the cost of activation layers. Therefore we can add more channels when removing non-linearities. Results are detailed in Sec. A.4, and show minor increase in accuracy while slightly increasing communication. The difference was not significant enough to be included in our final crypto-oriented architectures.

6 Analysis of ReLU Alternatives

ReLU6.

In order to analyse the cost of secure computation of ReLU6, we modify the ReLU protocol suggested in [wagh2019securenn] and provide a protocol for the secure computation of ReLU6. ReLU6 can be decomposed into a combination of Heaviside step functions ():

(10)
(11)

The protocol is described in Algorithm 1, and is a step-by-step secure evaluation of Eq. (10) via secret sharing. We denote the model provider by and the data owner by . represents the crypto-producer, a third-party “assistant” that provides randomness. and are the two secret shares of over . and are the secure protocols presented in [wagh2019securenn] for computing and matrix multiplication (for scalar multiplication, we use MatMul with matrices), respectively. For more details we refer the reader to [wagh2019securenn]. The proof of this protocol follows directly via the security of the two underlying protocols and the fact that at each point in time the parties learn only secret shares of the current state of the computation. The round and communication complexities of ReLU6 under this protocol are specified in Eq. (12)–(13).

(12)
(13)

The cost of ReLU is shown in Eq. (5) - (6). Secure computation of ReLU6 requires twice the cost of ReLU.

LeakyReLU.

Another popular non-linearity, often used as an alternative to ReLU is the LeakyReLU activation.

(14)

This activation function can be also written as:

(15)

We modified the ReLU protocol suggested in [wagh2019securenn] for a protocol which is a step by step secure evaluation of (15). Description of the protocol is provided in Sec. A.5. Secure computation of LeakyReLU differs from the secure computation of ReLU only by a constant scalar multiplication, and therefore has the same communication and round complexity. LeakyReLU can be used instead of ReLU with no additional costs.

Algorithm 1 :
Input: hold and , respectively.
Output: get ReLU6 and ReLU6.
Common Randomness: hold random shares of 0 over , denoted by and resp.
run with having input and learn and , resp. call with having input and learn and , resp. run with having input and learn and , resp. call with having input and learn and , resp. For , outputs .

7 Conclusion

We addressed efficiency challenges in privacy-preserving neural network inference. Motivated by the unique properties of secure computation, we proposed three design principles for crypto-oriented neural network architectures: partial activation layers, activation layers selection and alternative non-linearities. By applying our design principles on three state-of-the-art architectures (SqueezeNet, ShuffleNetV2 and MobileNetV2) we achieved significant improvement on all architectures. On MobileNetV2, for example, we achieved an improvement of in communication complexity, in round complexity and in secure inference runtime, with only a reasonable loss in accuracy.

Acknowledgments

This research has been supported by the Israel ministry of Science and Technology, by the Israel Science foundation, and by the European Union’s Horizon 2020 Framework Program (H2020) via an ERC Grant (Grant No. 714253).

References

Appendix

a.1 CIFAR-10 Downscaling

Experiments were conducted on three popular efficient architectures - SqueezeNet [iandola2016squeezenet], ShuffleNetV2 [ma2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2]

. These architectures were designed for the ImageNet dataset 

[deng2009imagenet], a large scale dataset. Due to limited resources we evaluated on the CIFAR-10 dataset [krizhevsky2009learning], which required downscaling the architectures accordignly. We will detail the modifications applied to each architecture:

SqueezeNet.

Changed the kernel size of the first convolution layer from to and reduced the stride from to . In each pooling layer, we replaced the kernel with a

kernel. In addition, we removed the dropout layer and added batch normalization layers after every convolution layer.

ShuffleNetV2.

Reduced the stride in the first convolution layer from to . In addition, we removed the first pooling layer.

MobileNetV2.

Reduced the stride in the first convolution layer and in the first inverted residual block from to . In addition, we increased the weight decay from to .

a.2 Other Crypto-Oriented Neural Architectures

We present our crypto-oriented version of the building blocks in SqueezeNet [iandola2016squeezenet] and ShuffleNetV2 [ma2018shufflenet].

a.2.1 SqueezeNet

Figure 7: Crypto-oriented Fire module block. By applying our design principles, i.e. removing first activation layer, replacing second activation layer with -partial activation, we achieved a significant improvement in communication and round complexity of the SqueezeNet architecture, with a reasonable accuracy loss.
Partial activation layers:

In order to reduce the number of non-linearities we replaced the activation layers in the Fire module (i.e. the SqueezeNet building block) with -partial activation layers. This reduces the communication complexity by .

Activation layers removal:

We removed the first activation layer in the Fire module, i.e. from the squeeze phase of the block. This results with an improvement of in communication complexity and with in round complexity. Combining this change with the former, i.e. replacing the remaining activations with -partial activation layers results further improves the communication complexity by . Overall, by applying both changes we reduce communication and round complexity by and , respectively.

Alternative non-linearities:

We replaced each max pooling layer with an average pooling layer, as max pooling is very expensive to compute in a secure manner. This improved the communication complexity by and the round complexity by .

Our final crypto-oriented version of the SqueezeNet architecture is improving over it’s non crypto-oriented counterpart by in communication complexity and in round complexity. This comes at the cost of a reasonable accuracy loss of . Our crypto-oriented Fire module block is presented in Fig. 7.

a.2.2 ShuffleNetV2

Figure 8: Crypto-oriented ShuffleNetV2 unit. By applying our design principles, i.e. removing second activation layer, replacing first activation layer with -partial activation, we achieved a significant improvement in communication and round complexity of the ShuffleNetV2 architecture with only a small loss of accuracy.
Partial activation layers:

We replaced all activation layers in the ShuffleNetV2 unit by -partial activation layers. This improves the communication complexity by .

Activation layers removal:

We removed the second activation layer from the ShuffleNetV2 unit. This results with an improvement of in communication complexity and in round complexity. By further reducing the number of non-linearities and replacing the remaining activation layer with a -partial activation layer we get an additional improvement of in communication complexity.

Alternative non-linearities:

We did not change the non-linearities in the architecture as the ShuffleNetV2 unit was not using expensive variants such as ReLU6. In addition, the max pooling layer that existed in the original design of the architecture was removed in our downscaling process, detailed above.

Applying the aforementioned modifications, motivated by our design principles, yields an improvement of communication and round complexity by and , respectively. This optimization has a small accuracy loss of . Our crypto-oriented ShuffleNetV2 unit is presented in Fig. 8.

a.3 Encrypted Accuracy

In our experiments, we measured accuracy in the non-secure setting, due to the slow inference time of secure neural networks. Our experiments were conducted using the tf-encrypted library [tfencrypted] which is based on the SPDZ [damgaard2012multiparty, damgaard2013practical] and SecureNN [wagh2019securenn] protocols. As this implementation does not apply any approximations on the inference, we do not expect there to be a significant difference between the secure and non-secure accuracy measurements. In order to verify this assumption, we evaluated the secure accuracy on our final crypto-oriented architectures and compared the results against the non-secure accuracy. As done in all of our experiments, each experiment was conducted five times, and we report the average results. The results presented in Table 3 show that the accuracy difference is indeed negligible.

Model Secure Non-Secure
Accuracy Accuracy
CO-SqueezeNet 91.88 91.89
CO-ShuffleNetV2 92.49 92.51
CO-MobileNetV2 93.46 93.44
Table 3: Comparison of secure and non-secure accuracies on our crypto-oriented architectures. We denote by CO-SqueezeNet our crypto-oriented variant of the SqueezeNet architecture. There is a negligible difference in accuracy between the secure and non-secure setting.

a.4 Double Channels Results

Model Accuracy Comm. (MB) Rounds
Sq-st 90.4 179.95 233
Sq-st-double 90.66 241.75 233
Sq-nd 92.66 240.26 313
Sq-nd-double 92.98 256.74 313
Sq-orig 92.49 326.41 393
Sh-st 92.5 156.89 294
Sh-st-double 92.81 308.41 294
Sh-nd 92.19 141.82 324
Sh-nd-double 92.46 188.84 324
Sh-orig 92.6 310.89 484
Mb-st 93.66 705.62 466
Mb-st-double 94.15 1397.7 466
Mb-nd 93.28 623.03 486
Mb-nd-double 94.15 1213.51 486
Mb-orig 94.49 1925.42 806
Table 4: Comparison of the effect of increasing the number of channels in convolution layers without activations. In this experiments, we removed different activation layers (none, 1st, 2nd or both) and replaced the remaining with a partial activation layer. This was performed on SqueezeNet (Sq), ShuffleNetV2 (Sh) and MobileNetV2 (Mb) blocks. By Sq--double , we denote: i) the removal of all but the activation layer ii) its replacement with a partial activation layer iii) doubling the amount of channels in the no-activation convolution layer. Results show minor minor effect on accuracy while increasing the communication complexity.

As mentioned in the discussion section, we tried to reduce the accuracy loss resulting from the minimization of activations, i.e. the removal of activation layers and replacement of the remaining layers with partial activation layer, described in Table 4. We evaluated the effect of increasing the number of channels in layers with no activations by a factor of two. The goal is to amplify the model’s expressiveness, without adding further non-linearities. As mentioned in the paper and based on the analysis of the SecureNN framework [wagh2019securenn], the added cost of increasing the convolution channels is less then the cost of the removed activations, therefore enabling us to ”compensate” for the removal of activations with more convolutional channels. As can be in the results, detailed in Table 4, increasing the amount of channels had a minor affect on accuracy while increasing the costs. The benefit of increasing the channels was not significant enough to be included in our final crypto-oriented architectures.

a.5 LeakyReLU Protocol

We present a protocol, based on the ReLU protocol from  [wagh2019securenn], for the secure computation of the LeakyReLU activation function. The LeakyReLU activation is defined as:

(16)

This activation function can be also written as:

(17)

Note that the secure computation of LeakyReLU only differs from the secure computation of ReLU, provided in  [wagh2019securenn], by only a constant scalar multiplication, and therefore has the same communication and round complexity. This suggests that the LeakyReLU activation function can be used instead of ReLU with no additional costs.

Algorithm 2. :
Input: hold and , respectively.
Output: get LeakyReLU and LeakyReLU.
Common Randomness: hold random shares of 0 over , denoted by and resp.
run with having input and learn and , resp. call with having input and learn and , resp. For , outputs .