The growing adoption of Machine Learning as a Service (MLaaS) (hunt2018chiron) has given rise to privacy concerns of clients’ personal data and the intellectual property (i.e., trained models) of service providers. To address these concerns, techniques such as different privacy (dwork2006differential; abadi2016deep), federated learning (konevcny2016federated; bonawitz2019towards), secure enclaves (costan2016intel; tramer2018slalom), homomorphic encryption (HE) (gentry2009fully), and multiparty computation (MPC) (shamir1979share) aim to prevent both the server from accessing the client’s sensitive data and the client from learning the server’s model. One area of study within privacy-preserving machine learning (PPML) attempts to perform inference directly on encrypted data using either HE (gilad2016cryptonets; mohassel2017secureml; sanyal2018tapas) or MPC-based techniques such as Secret Sharing (SS) (mohassel2018aby3; riazi2018chameleon; riazi2019xonn; rouhani2018deepsecure; rachuri2019trident; patra2020blaze; chaudhari2019astra; chandran2019ezpc). Common PI protocols employ HE/SS for processing linear operations (e.g., convolutions and fully connected layers) and garbled circuits for nonlinear operations (e.g., ReLU and maxpool) (liu2017oblivious; juvekar2018gazelle; mishra2020delphi; rathee2020cryptflow2; SAFENET; jha2021deepreduce).
Garbled circuits are a major source of inefficiency when performing PI for the following reasons: (1) in PI, unlike plaintext inference, ReLU garbled circuits dominate the runtime and can be orders of magnitude more costly than linear layers computed with SS (cryptonas; mishra2020delphi); (2) a single ReLU operation using garbled circuits requires 17.5 KB of data storage and communication, and a single inference on state-of-the-art DNNs (such as ResNet50 (he2016deep)) requires millions of ReLU computations that leads to hundreds of GiB of data storage and communication (rathee2020cryptflow2). These inefficiencies exist for variants of ReLU such as leaky ReLU (maas2013rectifier), parametric ReLU (he2015delving), RReLU (xu2015empirical), CReLU (shang2016understanding), and the recently proposed DY-ReLU (chen2020dynamic). Furthermore, storage and latency costs of GCs are exacerbated when used to compute more expressive and complex activation functions such as ELU (clevert2015fast), SELU (klambauer2017self), Swish (ramachandran2017searching), GELU (hendrycks2016gaussian) and Mish (misra2019mish).
|Lookup Table (thaine2019efficient)||2||Full||Y||N||N||N|
|Polyfit (chabanne2017privacy)||2, 4, 6||Full||Y||N||N||N|
|CryptoDL (hesamifard2017cryptodl)||2, 3||Full||Y||Y||N||N|
|SAFENet (SAFENET)||2, 3||Partial||N||Y||Y||N|
The aforementioned challenges and inefficiencies of nonlinear computations using garbled circuits have driven researchers to design alternative activation functions that are cheaper to compute under PI. In particular, polynomial functions, which require only simple addition and multiplication, eliminate the need for garbled circuits and have become the de-facto solution for replacing ReLUs in neural networks. In fact, replacing all ReLUs with (denoted Quad here) can reduce online latency and communication dramatically by up to 2843 and 256, respectively (mishra2020delphi).
summarizes prior work using polynomial activation functions for PI. The partial/full distinction indicates whether the solution replaces some or all ReLU activations with polynomials. We find that prior work can be classified into three categories: full replacement using small datasets/models (e.g,. MNIST(mnist), CIFAR-10 (cifar)) (gilad2016cryptonets; thaine2019efficient; chabanne2017privacy; mohassel2017secureml; fastercryptonets; hesamifard2017cryptodl; badawi2018towards), partial replacement on mid-sized models (e.g., CIFAR-100) (mishra2020delphi; SAFENET), and full-replacement on large models using very-high degree approximations (lee2021precise). Each of the solutions significantly advanced our understanding of the problem and the capabilities of PI. However, none have demonstrated full replacement on large datasets/models using low-degree polynomials, which we believe is the ideal solution.
In this paper, we set out to replace all ReLUs with low-degree polynomials. Specifically, we test two drop-and-replace solutions (Taylor Approximation and Polynomial Regression Approximation) and develop two novel replace-and-retrain strategies (QuaIL and QuaIL+ApproxMinMax) on a wide range of networks and datasets. Our contribution can be summarized as follows:
We propose Quadratic Imitation Learning (QuaIL), a training setup inspired by dynamic programming to gradually build neural networks with only polynomial activations and introduce ApproxMinMaxNorm, a normalization strategy that bounds pre-activation values during training and approximately bounds pre-activation values during inference.
We implement and release Sisyphus, a set of methods for wholesale ReLU replacement that range from simple drop-and-replace solutions to replace-and-retrain strategies.
We develop and rigorously evaluate four substitution strategies using the Sisyphus framework and perform an in-depth analysis of their efficacy for deep networks. Crucially, we show that the instabilities of performing both inference and training with polynomial activation functions become more prominent in deeper neural networks and may not be observed in shallower networks.
As we increase the complexity of the replacement strategy we steadily progress towards training deeper, more accurate, PI-friendly networks using only low-degree polynomial activations. Despite our best efforts, we fall short of matching baseline ReLU network performance due to the escaping activation problem: in all solutions, forward-pass activation values inevitably escape the well-behaving range of the polynomial activation function, leading to either exploding values (NaNs) or poor-behaving approximations.
Looking beyond QuaIL+ApproxMinMax (QuaIL+AMM), it may be tempting to evaluate additional solutions. One way to overcome the escaping activation problem in QuaIL would be to bound the range. However, this requires a max function, which if we had, we could simply use to compute ReLU in our networks. Recent work proposed the Pade Activation Unit (PAU), a rational function of two low-degree polynomials that performs well on complex datasets (Molina2020Pade)
. Unfortunately, the division operation required by PAUs is not natively supported by cryptographic primitives of HE/SS and is known to be a challenge to implement. Another recent work has proposed approximating ReLU and max-pooling using very-high (e.g., 29) degree polynomials and reports competitive accuracy for ImageNet(lee2021precise). However, high-degree polynomials can be difficult to evaluate using cryptographic solutions as they would introduce significant additional computation in both SS and HE as well as noise growth in HE. Thus, we name this paper and our framework Sisyphus, as each time a promising solution was evaluated we incur a fundamental limitation that brought us back to square one.
We test the Sisyphus framework on the MNIST, CIFAR-10, CIFAR-100, and TinyImageNet (yao2015tiny) datasets and test each substitution strategy over a wide variety of networks: AlexNet (alexnet), VGG-11/16 (vgg), ResNet18, MobileNetV1 (howard2017mobilenets), and ResNet32 (he2016deep)
. We develop and test our framework using PyTorch(pytorch) (1.8.1+CUDA11.1), and for performing Bayesian Optimization during Polynomial Regression Approximation, we utilize GPyTorch (gpytorch), and BoTorch (botorch). All code for this paper is available online111See: https://github.com/sisyphus-project/sisyphus-ppml.
3. Solutions and Results
In this section we present the solutions evaluated for replacing all ReLUs with polynomial activation functions, including Taylor series approximation, polynomial regression, QuaIL, and QuaIL+AMM.
3.1.1. Taylor Series Approximation
: A simple approach to approximating ReLU as a polynomial is to use the Taylor Approximation. The Taylor approximation estimates a differentiable function,, as a polynomial centered around point (we choose ). This approximation is constructed using high-order derivatives, and in the case of ReLU, all high-order derivative terms in the Taylor approximation vanish as the second derivative of ReLU is everywhere, resulting in a simple approximation:
Setup: First, a baseline ReLU model is trained. We then replace all ReLUs in the trained networks with the Taylor approximation and measure the test accuracy for the network’s respective dataset.
Results: As evident in Table 2, the test accuracy deteriorates significantly for all networks except for the two layer MLP, which sees a dip in test accuracy from 97.98% (using ReLU) to 86.28% (using the Taylor approximation) on MNIST. Given that the Taylor approximation for ReLU is a simple linear function (), we expect deeper networks to perform poorly when using the approximation as an activation function.
Takeaway: Using the Taylor approximation of ReLU collapses each network to a linear model, which restricts the network from representing the non-linear mappings required to achieve a high predictive performance on deeper networks and complex datasets.
3.1.2. Polynomial Regression Approximation
Key Idea: A natural extension of the Taylor approximation is to approximate ReLU using a polynomial over a range rather than a single point. The polynomial fit to a function has the form , where and is the order of the polynomial. Polynomial regression can be employed to fit a polynomial function to any non-linear function by minimizing the mean squared error between the approximation and the target function over a range and order . For example if the target function is ReLU, optimal coefficients can be found by minimizing
Setup: To find that minimizes Equation 1, we first discretize the integral using a granularity of . The polynomial fit heavily depends upon the order of the polynomial () and the range () over which Equation 1 is minimized. To this end, we employ Bayesian Optimization (BayesOpt) to efficiently select effective values for and (bayesopt). To accommodate a variety of polynomials, we choose the range to vary between and the order of the polynomial to vary over integer values between .
Given a setting of , a ReLU approximation is found using polynomial regression. All ReLUs in the original trained network are then replaced with the approximation and we measure the training accuracy. BayesOpt uses this accuracy to iteratively update its probabilistic model and find well performing values of and . We run BayesOpt for 50 iterations (10 random values to seed the probabilistic model and 40 optimized values) for each network and dataset. Finally, we replace all ReLUs in each network with the most accurate polynomial fit and measure the test accuracy.
Result: Table 2 displays the test accuracy for evaluating each network using the polynomial activation function produced by BayesOpt. Evaluating networks using non-linear polynomials introduces unbounded forward activations that compound exponentially with network depth, which we call escaping activations
. Especially for deeper networks, it is possible to generate forward activation values that overflow their floating point representations, which results in a NaN. We consider output logits that contain NaN values to be incorrect predictions. For this reason, two accuracies are presented in some rows of the polynomial regression experiments. Accuracies in parenthesis represent the test accuracy when only considering inputs that do not overflow in forward activation values. Using polynomial regression, we are able to progress to a high accuracy on LeNet, a five-layer network.
Takeaway: Simply replacing all ReLUs with accurate polynomial approximations that are both low-degree and non-linear fails to work for most deeper networks due to the escaping activation problem in which forward activation values grow exponentially, leading to instability in inference.
3.2.1. Quadratic Imitation Learning (QuaIL)
Key Idea: The escaping activation problem encountered when using the polynomial regression strategy was directly caused by the compounding use of polynomials in deeper networks. Specifically, after each pass through the polynomial activation, the output intermediate representation values began to grow exponentially. Following several layers (several passes through the polynomial activations), the intermediate representation values escaped the well-behaving regions of the polynomials and resulted in exploding values (NaNs). The escaping nature of intermediate representations suggests to elevate from simple drop-and-replace strategies to replace-and-retrain strategies which mitigate the escaping activation problem. Rather than attempting to train a network with polynomial activations end-to-end by minimizing the loss between ground truth and predictions, Quadratic Imitation Learning (QuaIL) iteratively builds and trains a neural network with polynomial activations by mimicking the intermediate representation values of a trained ReLU network. Similar to dynamic programming, QuaIL attempts to first solve a sub-problem by mimicking intermediate representation values of a well-behaving network before adding additional layers to a network using polynomial activations. In QuaIL, the polynomial activation function is set to (Quad).
Setup: Figure 1
depicts the QuaIL training process. First, a ReLU baseline network is trained using standard supervised learning techniques (Fig.1.1). Then, the first layer of the ReLU network is duplicated and the layer’s ReLUs are replaced with Quad. Here, the Quad network is trained by minimizing the Mean Square Error (M.S.E.) between the first-layer intermediate representations of both networks (Fig. 1.2). In this way, the single-layer Quad network learns to predict similar first-layer representations as the ReLU network. After training converges to a low M.S.E. between the two intermediate representations, the Quad network’s first-layer weights are frozen and the second layer of the ReLU network is cloned and stacked onto the Quad network. Again, ReLU is replaced by Quad for the second layer. Similar to the first layer, the Quad network now minimizes the M.S.E. between the second-layer representations of both networks. This process is repeated until the final layer of the ReLU network has been added to the Quad network and the error between the final representations is minimized (Fig. 1.3).
At this stage, the Quad network is trained using standard supervising learning while gradually unfreezing shallower layers. In the image classification setting, the Cross Entropy (C.E.) loss is minimized between ground truth labels and predictions (Fig. 1.4- 1.5).
|C10-VGG16||92.78||16.01||13.31 (87.57)||—||82.25 (82.63)|
|C10-ResNet32||91.72||26.96||90.48 (90.62)||—||56.93 (71.81)|
|C100-VGG16||71.44||1.74||0.94 (52.81)||—||54.56 (55.03)|
|Tiny-VGG16||58.88||0.45||00.56 (3.07)||—||45.76 (46.47)|
Result: QuaIL further extends our progress of building deep, PI-friendly networks to AlexNet and VGG11 on the CIFAR-10, CIFAR-100, and TinyImageNet datasets. Each of these networks that only uses the Quad activation function are built up iteratively to limit the effect of escaping activations. However, QuaIL fails to generalize to even deeper networks as even a small difference in the intermediate representations at earlier stages of the deeper networks propagate forward leading to escaping activations and causing training to diverge (denoted as — in Table 2).
Takeaway: QuaIL allows us to iteratively build deep (up to 11-layer) networks with only the Quad activation function but fails to mitigate unstable intermediate representations for even deeper networks and thus still suffers from escaping activations. For example, a ResNet-18 network trained using the QuaIL setup experiences exploding gradients due to escaping activations in latter intermediate representations and is unable to converge during training.
3.2.2. Approximate MinMax Normalization
Key Idea: The escaping activation problem encountered during QuaIL illustrates the need to bound pre-activation values to train networks using low-degree polynomial activation functions, especially for deeper neural networks. To do this we developed Approximate Min-Max Normalization (ApproxMinMaxNorm), which places upper and lower constraints on pre-activation values during training by performing a dimension-wise Min-Max normalization:
where and are scaling parameters. During the training phase, approximations of minimums and maximums are calculated and stored using a weighted moving average of the true minimums and maximums (we use a smoothing factor of ). When performing inference, these stored approximations are then used to perform approximate normalization.
Setup: ApproxMinMaxNorm is combined with the QuaIL training procedure; when building the Quad network, ReLU is replaced by an ApproxMinMaxNorm layer immediately followed by Quad.
Result: We observe stable training for all networks and datasets using QuaIL+AMM. However, at inference time, when using the approximated values of the minimums and maximums for each layer, we again detect the escaping activation problem, albeit to a less degree when compared to the drop-and-replace polynomial regression strategy.
Takeaway: ApproxMinMaxNorm prevents the escaping activation problem at training time by explicitly bounding the pre-activation values to polynomial activations. However, the escaping activation problem returns during inference due to approximate minimum and maximum calculations. Thus a true maximum function is required at test time to guarantee bounds on pre-activation values.
The desirable properties of PI-friendly ReLU substitutions are: low multiplicative depth, stability over a sufficiently large range of activation values, and competitive performance when compared to networks with ReLUs. The Quad activation function has been considered a promising solution as its multiplicative depth is one and exhibits stability for simple models and datasets, e.g., MNIST (gilad2016cryptonets). However, for deeper networks and larger datasets, the desired stability range of pre-activation values increases significantly (lee2021precise) and using Quad in this extended range results in imprecise approximations of ReLU and poor accuracy (chabanne2017privacy). Consequently, higher-degree polynomials are used for more accurate approximation, but suffer from a higher multiplicative depth that results in additional computation (in SS/HE) and noise growth (HE), thus limiting their efficacy in practical settings.
To help mitigate these issues, we devised QuaIL where each layer in the Quad-network learns to mimic intermediate representations of a trained, all-ReLU network. QuaIL worked well for AlexNet and VGG11, which polynomial regression under-performed; however, it did not scale to even deeper networks. To understand why, we dug deeper and found the issue still to be escaping activations. That is, some intermediate representation values still began to grow unbounded.
To mitigate escaping activations at training time, we bounded the pre-activation values inputs using an ApproxMinMax normalization strategy, which achieved reasonable accuracies for all the networks except MobileNetV1 and ResNet32. However since the maximum and minimum values were approximated at inference time, the approximation error grew in deeper layers and some activations began to explode (shown in Figure 2 as Quail+AMM). For a better understanding, we replaced the approximated min and max with the true min and max during inference (termed as QuaIL+MM in Figure 2) and observed that the intermediate representation values were now similar to that of the all-ReLU baseline networks.
Fundamentally, the efficacy of using low-degree polynomials for deeper networks on complex datasets boils down to bounding input values to the the polynomial activations in order to mitigate escaping activations, which requires using exact calculations of both the minimum and maximum. However, the issue of calculating exact minimums and maximums brings us back full circle to the problem we were trying to solve: remove all ReLUs (which is defined using maximum) to prevent the usage of garbled circuits in PI. We hope the insights gained from Sisyphus aid the PPML community in being mindful when using low-degree polynomial activations in PI-friendly networks.
This work was supported in part by the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. This research was also developed with funding from the Defense Advanced Research Projects Agency (DARPA),under the Data Protection in Virtual Environments (DPRIVE) program, contract HR0011-21-9-0003. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.