MaskedNet: A Pathway for Secure Inference against Power Side-Channel Attacks

10/29/2019 ∙ by Anuj Dubey, et al. ∙ 10

Differential Power Analysis (DPA) has been an active area of research for the past two decades to study the attacks for extracting secret information from cryptographic implementations through power measurements and their defenses. Unfortunately, the research on power side-channels have so far predominantly focused on analyzing implementations of ciphers such as AES, DES, RSA, and recently post-quantum cryptography primitives (e.g., lattices). Meanwhile, machine-learning, and in particular deep-learning applications are becoming ubiquitous with several scenarios where the Machine Learning Models are Intellectual Properties requiring confidentiality. The problem of extending side-channel analysis to Machine Learning Model extraction is largely unexplored. This paper extends the DPA framework to neural-network classifiers. First, it shows DPA attacks on classifiers that can extract the secret model parameters such as weights and biases of a neural network. Second, it proposes the first countermeasures against these attacks by augmenting masking. The resulting design uses novel masked components such as masked adder trees for fully-connected layers and masked Rectifier Linear Units for activation functions. On a SAKURA-X FPGA board, experiments show both the insecurity of an unprotected design and the security of our proposed protected design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

footnotetext: This paper has been accepted with shepherding for HOST’20

Since the seminal work on Differential Power Analysis (DPA) [kocher99], there has been an extensive amount of research on power side-channel analysis of cryptographic systems. Such research effort typically focus on new ways to break into various implementations of cryptographic algorithms and countermeasures to mitigate attacks. While cryptography is obviously an important target driving this research, it is not the only scenario where asset confidentiality is needed—secret keys in the case of cryptographic implementations.

In fact, Machine Learning (ML) is a critical new target with several motivating scenarios to keep the internal ML model secret. The ML models are considered trademark secrets, e.g., in Machine-Learning-as-a-Service applications, due to the difficulty of training ML models and privacy concerns about the information embedded in the model coefficients such as weights and biases in the case of neural networks. If leaked, the model, including weights, biases and hyper-parameters can violate data privacy and intellectual property rights. Moreover, knowing the details of the ML classifier makes it more susceptible to adversarial ML attacks [Dalvi], and especially to test-time evasion attacks[ML-poisoning1, ML-poisoning2]. Finally, ML has also been touted to replace cryptographic primitives [kanter02]—under this scenario, learning the ML classifier details would be equivalent to extracting secrets from cryptographic implementations.

In this work, we extend the side-channel analysis framework to ML models. Specifically, we apply power-based side-channel attacks on a hardware implementation of a neural network and propose the first side-channel countermeasure. Fig. 1

shows the vulnerability of a Binarized Neural Network (BNN)—an efficient network for IoT/edge devices with binary weights and activation values 

[Courbariaux]. Following the DPA methodology [brier04], the adversary makes hypothesis on 4-bits of the secret weight. For all these 16 possible weight values, the adversary computes the corresponding power activity on an intermediate computation, which depends on the known input and the secret weight. This is repeated multiple times using random, known inputs. The correlation plots between the calculated power activities for the 16 guesses and the obtained power measurements reveal the value of the secret weights.

The figure shows that at the exact time instance where the targeted computation occurs, there is a significant information leakage between the power measurements and the correct key guess. To extract other weights, the process can be simply repeated reusing the same power measurements. Hence, it shows that implementations of ML Intellectual Properties are as susceptible to side-channel attacks as ciphers.

Fig. 1: Motivation of this work: DPA of the BNN hardware with 100k measurements. Green plot is the correlation trace for the correct 4-bit weight guess, which crosses the 99.99% confidence threshold revealing the significant leak. Blue plot is for the 2’s complement of the correct guess, which is an expected false positive of the target signed multiplication. Other 14 guesses do not show correlation.

Given this vulnerability, the primary objective of this paper is to propose a sound countermeasure for neural network inference against power-based side-channel attacks. A neural network inference is typically a sequence of repeated linear and non-linear operations, similar in essence to cryptographic algorithms, but has unique computations such as row-reduction (i.e., weighted summation) operations and activation functions. Unlike the attack scenario, the defense exhibit challenges due to the presence of these operations in neural networks, which introduce an additional and subtle type of leak. To address the vulnerability, we propose a novel countermeasure that primarily uses the concepts of message blinding and secret sharing. This countermeasure style is called masking [coron00], which is an algorithm-level defense that can produce secure designs independent of the implementation technology[nikova06].

The main contributions of the paper include the following:

  • We demonstrate new attacks that can extract the secret weights of a BNN in a highly-parallelized hardware implementation.

  • We formulate and implement the first power-based side-channel countermeasures for neural networks by adapting masking to the case of neural networks. This process reveals new challenges and solutions to mask unique neural network computations that do not occur in the cryptographic domain.

  • We validate both the insecurity of the baseline design and the security of the masked design using power measurement of an actual FPGA hardware and quantify the overheads of the proposed countermeasure.

We note that while there is prior work on theoretical attacks [ML-extraction-theory1, ML-extraction-theory2, ML-extraction-theory3, ML-extraction-theory4, ML-extraction-theory5] and digital side-channels [hua18, ML-extraction-SCA2, ML-extraction-SCA3, ML-extraction-SCA4] of neural networks, their physical side-channels are largely unexplored—such a research is needed because physical side-channels are orthogonal to these threats, fundamental to the Complementary Metal Oxide Semiconductor (CMOS) technology, and require extensive countermeasures as we have learned from the research on cryptographic implementations. So far, there is only a single white paper recently published on model extraction via physical side-channels [batina18]After we submitted our work to HOST’20, that paper has been published at USENIX’19 [batina19].. This work does not study mitigation techniques and focuses on 8-bit/32-bit microcontrollers. We further analyze attacks on parallelized hardware accelerators and investigate the first countermeasures.

The rest of the paper is organized as follows. Section II explains the threat model that we consider for this work and discusses its relation to prior efforts. Section III gives the preliminary information related to neural networks, binarization, and the hardware design. Section IV analyzes the attack where we describe the setup and the power model used for DPA in detail. Section V provides the architecture of the protected implementation of the complete neural network. Section VI validates the resilience of masked designs using empirical side-channel tests. Section VII discuses the orthogonal aspects and Section VIII concludes the paper.

Ii Threat Model and Relation to Prior Work

This work follows the typical DPA threat model [Kocher2011]. The adversary has physical access to the target device or has a remote monitor [schellenberg18, zhao18, ramesh18] and obtains power measurements while the device processes secret information. We also follow the Kerkchoff principle, hence the security of the system is not based on the secrecy of the software or hardware design. This includes the details of neural network algorithm and its hardware implementation such as the data flow, parallelization and pipelining—in practice, those details are typically public but what remains a secret is the model parameters obtained via training. If the implementation details are unknown, engineering aspects of locating certain operations within a complex system has already been addressed in prior work, both in the context of physical [balasch15, eisenbarth10] and digital side-channels [zhang12, inci16] covering both hardware and software realizations. This aspect of reverse engineering logic functions from bitstream [note08, benz12] is independent of our application and is generic to any given system.

Fig. 2: The adversary’s goal is to extract the secret model parameters on the IoT/edge device during inference using side-channels.

Fig. 2 outlines the system we consider. The adversary in our model targets the ML inference with the objective of learning the secret model parameters. This is different than attacks on training set [shokri17] or the data privacy problem during inference [wei18]. We assume the training phase is trusted but the obtained model is then deployed to operate in an untrusted environment. Our attack is similar to a known-plaintext (input) attack and does not require knowing the inference output or confidence score, making it more powerful than theoretical ML extraction attacks [ML-extraction-theory1, ML-extraction-theory2, ML-extraction-theory3, ML-extraction-theory4, ML-extraction-theory5].

Since edge/IoT devices are the primary target of DPA attacks (due to easy physical access), we focus on BNNs that are suitable on such constrained devices [Courbariaux]. A BNN also allows realizing the entire neural network on the FPGA without having external memory access. Therefore, memory access pattern side-channel attacks on the network [hua18] cannot be mounted. We furthermore consider digitally-hardened accelerator designs that execute in constant time and constant flow with no shared resources, disabling timing-based or other control-flow identification attacks [ML-extraction-SCA2, ML-extraction-SCA3, ML-extraction-SCA4]. This makes the attacks we consider more potent than prior work.

Iii BNN and the Target Implementation

The following subsections give a brief introduction to BNN and discuss the details of the target hardware implementation.

Iii-a Neural Network Classifiers

Neural networks consist of layers of neurons that take in an input vector and ultimately make a decision, e.g., for a classification problem. Neurons at each layer may have a different use, implementing linear or non-linear functions to make decisions, applying filters to process the data, or selecting specific patterns. Inspired from human nervous system, neurons transmit information from one layer to the other typically in a feed-forward manner.

Fig. 3

shows a simple network with two fully-connected hidden layers and depicts the function of a neuron. In a feed-forward neural network, each neuron takes the results from its connections in the previous layer, computes a weighted summation (row-reduction), adds a certain bias, and finally applies a non-linear transformation to compute its

activation output. The resulting activation value will then be used by the next layer’s connected neurons in sequence. The connections between neurons can be strong, weak, or non-existent—the strength of these connections is called weight

, which is a critical parameter for the network. The entire neural network model can be represented with these parameters and with hyperparameters that are high-level parameters such as the number or type of the layers.

Fig. 3: A simple neural network and a single neuron’s function.

Neural networks have two phases: training and inference. During training, the network self-tunes its parameters for the specific classification problem at hand. This is achieved by feeding pre-classified inputs to the network together with their classification results, and by allowing the network to converge into acceptable parameters (based on some conditions) that can compute correct output values. During inference, the network uses those parameters to classify new (unlabeled) inputs.

Iii-B BNNs

A BNN works with binary weights and activation values. This is our starting point as the implementations of such networks have similarities with the implementation of block ciphers, but for the row-reduction step. BNN reduces the memory size and converts a floating point multiplication to a single-bit XNOR operation in the inference 

[XNORNet]. Therefore, such networks are suitable for constrained IoT nodes where some of the detection accuracy can be traded for efficiency. There are several variants of this low-cost approach to build neural networks with reasonably high accuracy [XNORNet, FINN, Courbariaux].

Fig. 4: Overview of the unprotected neural network inference. The adder tree first computes the weighted sum of input pixels. The activation function then binarizes the sum used by the next layer. Finally, the output layer returns the classification result by computing a maximum of last layer activations.

A neural network consists of units called neurons and following is the mathematical equation (1) for a typical neuron:

(1)

where is the activation value, is the weight, is the activation of the previous layer and is the bias value for the node, where , , and have binary values of 0 and 1 respectively representing the actual values of -1 and +1. The function is the non-linear activation function (2), which in our case is defined as follows:

(2)

Equation (1) and (2) shows that the computation involves a summation of weighted products with binary weights with a bias offset and an eventual binarization.

We build the hardware of this BNN inference which performs Modified National Institute of Standards and Technology (MNIST) classification (hand written digit recognition from 28-by-28 pixel images) using 3 feed-forward, fully-connected hidden layers of 1024 neurons. The implementation computes up to 1024 additions (i.e., the entire activation value of a neuron) in parallel.

Iii-C Unprotected Hardware Design

Fig. 4 illustrates the sequence of operations in the neural network. There are two main steps in one fully-connected layer: (1) Calculating the weighted sums using an adder tree, (2) Applying the non-linear activation function . To classify the image in the output layer, the hardware sends out the node number with maximum sum, hence there is no binarization in this layer. Likewise, there is no binarization of the input, which is an 8-bit integer 0 to 255, representing the pixel value of 28-by-28 pixel MNIST input images.

The input image pixels are first buffered or negated based on whether the weight is 1 or 0 respectively, and then fed to the adder tree in Fig. 5. We have implemented a fully-pipelined adder tree of depth 10 as the hardware needs up to 1024 parallel additions to compute the sum. The activation function binarizes the final sum. After storing all the activation values in the first layer, the hardware proceeds to the second layer and repeats the same process using 1024 activation values obtained from the first layer. The output layer, which consists of 10 nodes, computes the confidence of the image to be classified between 0-9. The node index with the maximum confidence score becomes the output of the neural network.

There is a single adder tree in the hardware design reused for each layer’s computation, similar to a prior architecture [nurvitadhi16]. Hence, the hardware has a throughput of approximately 3000 cycles per image classification. The reuse is not directly feasible as the adder tree (Fig. 5

) can only support 784 inputs (of 8-bits) but it receives 1024 outputs from each of the hidden layers. To circumvent this problem, an extra piece of logic converts the 1024 1-bit outputs from each hidden layer to 512 2-bit outputs using LUTs. These LUTs take the weights and activation values as input, and produces the corresponding 512 2-bit sums, which is within the limits of the adder tree. Adding bias and applying the batch normalization is integrated to the adder tree computations. We adopt the batch-normalization free approach 

[BNFree] and hence, we have integer bias values unlike the weights, which are binary.

Fig. 5: Adder Tree used in HW Implementation. The figure shows the scenario where the 2nd stage registers(red) are targeted for DPA. This results in 16 possible key guesses corresponding to the 4 input pixels involved in the computation of each second stage register, grouped by the dotted blue line.

Iv An Example of DPA on BNN Hardware

This section describes the attack we performed on the BNN hardware implementation that is able to extract secret weights.

To carry out a power based side-channel attack on an FPGA implementation, the adversary has to primarily focus on the switching activity of the registers, as they have a significant power consumption compared to combinational logic (especially in FPGAs). The pipeline registers of the adder tree store the intermediate summations of the product of weights and input pixels. Therefore, the value in these registers is directly correlated to the secret—model weights in our case. Fig. 5 shows that there are only 4 possible values that can be loaded in the output register of stage-1: , , and corresponding to weight values of (0,0), (0,1), (1,0) and (1,1) respectively. The number of possible weight combinations depends on the targeted stage of attack of the adder tree and grows exponentially with its depth. Next, we highlight the attack on the second stage.

Fig. 6:

Pearson Correlation Coefficient versus time and number of traces for DPA on weights. Lower plot shows a high correlation peak at the time of target computation, for the correct weight guess denoted in green. The upper plot shows that approximately 40k traces are needed to get a correlation of 99.99% for the correct guess. The confidence intervals are shown in dotted lines. The blue plot denotes the 2’s complement of the correct weight guess

Since the adder tree is pipelined, we need to create a model based on hamming distance of previous cycle and current cycle summations in each register given as:

(3)

where denotes hamming weight of value x. For this purpose, we developed a cycle-accurate hamming distance simulator, which takes the input pixels and makes hypothesis on possible weights to compute the value in each register at every cycle and to find the hamming distances.

The hardware computes the first node activation of the hidden layer in the first cycle, using weights corresponding to that node, and keeps computing the next node activations in subsequent cycles using weights corresponding to each node. This means, each register in the adder tree first loads the sum corresponding to the first node weights, then for the next and so on. Therefore, the attacker first extracts the weights of the first node, and then uses those values to subsequently attack the next layer weights. The bias is added after computing the final sum in the 10 stage, before sending to the activation function. Therefore the adversary can attack this addition operation by creating a hypothesis for the bias. Another way extract bias is by attacking the activation function since the sign of the output correlates to the bias.

Fig. 6 illustrates the result of the attack. There is a strong correlation between the correct key guess and the power measurements, which crosses the 99.99% confidence threshold after 45k measurements. The false positive leak is due to signed multiplication and is caused by the additive inverse of the correct key, which is expected and thus does not affect the attack. Using this approach, the attacker can successively extract the value of weights and biases for all the nodes in all the layers, starting from the first layer.

V Side-Channel Countermeasures

This section presents our novel countermeasure against side-channel attacks. The development of the countermeasure highlights unique challenges that arise for masking neural networks and describes the implementation of the entire neural network inference.

Masking works by making all intermediate computations independent of the secret key—i.e., rather than preventing the leak, by encumbering adversary’s capability to correlate it with the secret value. The advantage of masking is being implementation agnostic. Thus they can be applied on any given circuit style (FPGA or ASIC) without manipulating back-end tools but they require algorithmic tuning, especially to mask unique non-linear computations.

The main idea of masking is to split inputs of all key-dependent computations into two randomized shares: a one-time random mask and a one-time randomized masked value. These shares are then independently processed and are reconstituted at the final step when the final output is generated. This would effectively thwart first-order side-channel attacks probing a single intermediate computation. Higher-order attacks probing multiple computations [meserges2000]—masks and masked computations—can be further mitigated by splitting inputs of key-dependent operations into more shares [akkar2003]. Our implementation is designed to be first-order secure but can likewise be extended for higher order attacks.

Fig. 4 highlights that a typical neural network inference can be split into 3 distinct types of operations: (1) the adder tree computations, (2) the activation function and the (3) output layer max function. We need to mask all these functions to make the network secure. All these specific functions are unique to a neural network inference. Hence, we need to construct novel masked architectures for them using the lessons learned from cryptographic side-channel research. We will explain our approach in a bottom-up fashion by describing masking of individual components first, and then presenting the entire hardware architecture.

V-a Masking the Adder Tree

Using the approach in Fig. 5, the adversary can attack any stage of the adder tree to extract the secret weights. Therefore, the countermeasure needs to break the correlation between the summations generated at each stage and the secret weights. We use the technique of message blinding to mask the input of the adder tree.

Blinding is a technique where the inputs are randomized before sending to the target unit for computation. This prevents the adversary from knowing the actual inputs being processed, which is usually the basis for known-plaintext power-based side channel attacks. Fig. 7 shows our approach, that uses this concept by splitting each of the 784 input pixels into two arithmetic shares and , where each is a unique 8-bit random number. These two shares are therefore independent of the input pixel value, as is a fresh random number never reused again for the same node. The adder tree can operate on each share individually due to additive homomorphism—it generates the two final summations for each branch such that their combination (i.e., addition) will give the original sum. Since the adder tree is reused for all layers, hardware simply repeats the masking process for subsequent hidden layers using fresh randomness for each layer.

Fig. 7: Masking of adder tree. Each input pixel depicted in orange is split into two arithmetic shares depicted by the green and blue nodes with unique random numbers (s). The masked adder tree computes each branch one at a time.

V-A1 A Unique and Fundamental Challenge for Arithmetic Masking of Neural Networks

The extension of arithmetic masking to the adder tree is unfortunately non-trivial due to differences in the fundamental assumptions. Arithmetic masking aims at decorrelating a variable by splitting it into two statistically independent shares: and . The modulo operation in takes place in cryptographic operations since most of the primitives are based on finite fields. In a neural network, however, subtraction means computing the actual difference without any modulus. This introduces the notion of sign in numbers, which is absent in modulo arithmetic, and is the root cause of the problem. Next we discuss this with an example.

Consider two uniformly distributed 8-bit unsigned numbers

and . In a modulo subtraction, the result will be , which is again an 8-bit unsigned number lying between 0 and 255. In an actual subtraction, however, the result will be , which is a 9-bit number with MSB being the sign bit.

Scenario Positive Negative
50% 50%
100% 0%
0% 100%
50% 50%
TABLE I: Probability of being positive or negative

Table I lists four possible scenarios of arithmetic masking based on the magnitude of the two unsigned 8-bit shares. In a perfect masking scheme, probability of being either positive or negative should be 50%, irrespective of the magnitude of the input . Let’s consider the case when , which has a probability of 50%. If , which also has a 50% probability, the resulting sum is always positive. Else if , the value can both be positive or negative with equal probabilities due to uniform distribution. Therefore, given , the probability of the arithmetic mask being positive is and being negative is . Table list the other case when , which results a similar correlation between and . This is showing a clear information leak through the sign bit of arithmetic masks.

The discussed vulnerability does not happen in modulo arithmetic as there is no sign bit; the modulo operation wraps around the result if it is out of bounds, to obey the closure property. Evaluating the correlation of instead of yields similar results. Likewise, shifting the range of based on , to uniformly distribute between -128 to 127, would not resolve the problem and further introduces a bias in both shares.

V-A2 Addressing the Vulnerability with Hiding

The arithmetic masking scheme needs to be augmented in the context of neural networks to decorrelate the sign bit from the input. We used hiding to address this problem. We used hiding just for the sign bit computation. Hiding techniques aim constant power consumption, irrespective of the inputs, which makes it harder for an attacker to correlate the intermediate variables. Power equalized building blocks using such as the Wave Differential Dynamic Logic (WDDL) [WDDL] techniques, can consume constant power to mitigate the vulnerability.

Fig. 8: Differential NAND gate

The differential part of WDDL circuits ensures that the power consumption is constant throughout the operation, by generating the complementary outputs of each gate along with the original outputs. Fig. 8 gives an example of a differential NAND gate. Differential logic makes it difficult for an attacker to distinguish between a and a transition, however, an attacker can still distinguish between a and a transition or a and a transition. Therefore, the differential logic alone is still susceptible to side channel leakages, as the power activity is correlated to the input switching pattern. This is handled using dynamic logic, where all the gates are pre-charged to 0, before the actual computation. This makes the circuit switching activity independent of the input switching pattern. The WDDL OR gate and NOT gate can be constructed similarly and these 3 gates can universally form arbitrary functions.

We use the WDDL gates to solve our problem of sign bit leakages, by modifying the adders to compute the sign bit in WDDL style. Following is the equation of the sign bit, when two 8-bit signed numbers and , represented as and are added to give a 9-bit signed sum represented by :

(4)

After sign-extending and ,

(5)

Performing regular addition on the leftmost 8 bits of and , and generating a carry , the equation of becomes

(6)

Expanding the above expression in terms of AND, OR and NOT operators results:

(7)

Representing the expression only in terms of NAND, so that we can replace all the NANDs by WDDL NAND gates reveals:

(8)
Fig. 9: Circuit diagram of the proposed adder with MSB computed in WDDL style as described in Eq.(4)-(8). Each of the 784 arithmetic shares () are fed as input to these adders, which constitute the secure adder tree. All the bits except the MSB go via a regular adder. The MSBs of the two operands along with the generated carries are fed to the Differential MSB Logic block, which computes the MSB and its complement by replacing the NAND gates in Eq (8) by WDDL gates shown in Fig. 8. The pipeline registers in the tree are replaced by SDDL registers. The NOR gates generate the pre-charge wave at the start of the logic cones.

Fig. 9 depicts the circuit diagram for the above implementation. The WDDL technique is applied to the MSB computation by replacing each NAND function in Eq (8) with WDDL NAND gates shown in Fig. 8. The pipeline registers of the adder tree are replaced by Simple Dynamic Differential Logic (SDDL) registers [WDDL]. Each WDDL adder outputs the actual sum and the complement of its MSB , which go as input to the WDDL adder in the next stage of the pipelined tree. Therefore, we construct a resilient adder tree mitigating the leakage in the sign-bit.

V-B Masking the Activation Function

Neural networks use a non-linear activation function such as a Rectifier-Linear-Unit (ReLU) to help model a complicated response that varies non-linearly with its explanatory variables. Since the activation values in a BNN are binary, a binary sign function (Eq.

2) is a good choice, which generates +1 if the weighted sum is positive, else -1 if the sum is negative. In the unmasked implementation, the sign function receives the weighted sum of the 784 original input pixels, whereas in the masked implementation, it receives the two weighted sums corresponding to each masked share. So, it has to compute the sign of the sum of two shares without actually adding them. Using the fact that the sign only depends on the MSB of the final sum, we propose a novel masked sign function that sequentially computes and propagates the masked carry bits in a ripple carry fashion.

Fig. 10 shows the details of our proposed masked sign function hardware. This circuit generates the first masked carry using a Look-up-Table (LUT) that takes in the LSB of both shares and a random bit () to ensure the randomization of the intermediate state, similar in style to prior works on masked LUT designs [reparaz15]. LUT function computes the masked function with the random input and generates two outputs: one is the bypass of the random value () and the other is the masked output () where is the carry output. The entire LUT function for each output can fit into a single LUT to reduce the effects of glitches [nikova06]. We furthermore take the usual care in masking and store each LUT output in a flip-flop. These are validated empirically in Section VI.

The outputs of an LUT are sent to the next LUT in chain and the next masked carry is computed accordingly. From the second LUT and onward, each LUT has to also take the masked carry and mask value generated from the prior step. The output is simply the input like a forward bypass, because the mask value is also needed for the next computation. This way the circuit processes all the bits of the shares and finally outputs the last carry bit which decides the sign of the sum. Each LUT computation is masked by a fresh random number. More efficient designs may also be possible using a masked Kogge-Stone adder (e.g., by modifying the ones described in [Schneider_arithmeticaddition]).

Fig. 10: Hardware Design of the masked binarizer. It comprises of a chain of LUTs (lut0-lut18) denoted in orange, computing the carry in ripple carry fashion. Each LUT is masked by a fresh random number (). The whole design is fully pipelined to maintain the original throughput by adding flip flops (in green) at each stage.

Fig. 10 illustrates that the first LUT is a 3-bit input 2-bit output LUT because there is no carry-in for LSB, and all the subsequent LUTs have 5-bit inputs and 2-bit outputs since they also need previous stage outputs as their inputs. After the final carry is generated, which is also the sign bit of the sum, the hardware can binarize the sum to 1 or 0 based on whether the sign bit is 0 or 1 respectively. This is the functionality of the final LUT, which is different from the usual masked carry generators in the earlier stages.

The circuit has 19 LUTs in serial; each LUT output is registered for timing and side-channel resilience against glitches. This design, however, adds a latency of 19 cycles to compute each activation value, decreasing the original throughput. Therefore, instead of streaming each of the 19 bits on the top row of LUTs sequentially in Fig. 10, the entire 19 bit sum is registered in the first stage, and each bit is used sequentially throughout the 19 cycles. This avoids the 19 cycle wait time for consecutive sums and brings back the throughput to 1 activation per cycle.

Fig. 11: Different components of the fully masked BNN. The weights (1-4) correspond to the ROMs containing weights for each layer.

V-C Boolean to Arithmetic Share Conversion

Each layer generates 1024 pairs of Boolean shares, which requires two changes in the hardware. First, the adder tree supports 784 inputs which cannot directly process 1024 shares. Second, the activation values are in the form of two Boolean shares while the masking of adder tree requires arithmetic shares as discussed in Section V-A. Using the same strategy as in the unmasked design, the hardware adds 1024 1-bit shares pairwise to produce 512 2-bit shares before sending them to the adder tree. To resolve the conversion of Boolean to arithmetic conversion, the hardware needs to generate such that

(9)
(10)

Using masked LUTs, the hardware performs signed addition of 1024 shares to 512 shares, and it also produces the arithmetic shares. The LUTs take in two consecutive activation values already multiplied by the corresponding weights, and a 2-bit signed random number to generate the arithmetic shares. Since multiplication in binary neural network translates to an XNOR operation[XNORNet], the hardware XNORs the activation value with its corresponding weight before sending it to the LUT. Since the activation value is in the form of 2 Boolean shares, the hardware only performs XNOR on one of the shares as formulated below:

(11)
(12)
(13)

There are a total of five inputs to the LUTs: two shares that are not XNORed, two shares that are XNORed and a 2-bit signed random number. If the actual sum of the two consecutive nodes is ai, then the LUT outputs ri and ai-ri ranges from -2 to +1 since it is a 2-bit signed number and weighted sum of two nodes will range from -2 to +2. From this we can see that ai-ri can range from -3 to +4 and hence should be 4-bit wide. We have 512 of these conversion LUTs that convert the 1024 pairs of Boolean shares to 512 pairs of arithmetic shares and from there we can repeat the same adder tree masking that was described. The arithmetic shares will have a leakage in MSB as discussed in V-A1, but since we reuse the WDDL style adder tree in our design for all layers, this is addressed for all the subsequent layers.

V-D Output Layer

In the output layer, for an unmasked design, the index of the node with maximum confidence score is generated as the output of the neural network. In the masked case, however, the confidence values are split in two arithmetic shares, which by definition cannot be combined. Equations (14–16) formulate the masking operations of the output layer. Basically, we check if the sum of two numbers is greater than the sum of another two numbers, without looking at the terms of each sum at the same time. Therefore, instead of adding the two shares of the confidence values and comparing them, we subtract one share of a confidence value from another share of the other confidence value. In this way, we still solve the inequality, but look at the shares of two different confidence scores.

(14)
(15)
(16)

This simplifies the original problem to the previous problem of finding the sign of the sum of two numbers without combining them. Hence, in the final layer computation, the hardware reuses the masked carry generator explained in Section V-B.

V-E The Entire Inference Engine

Fig. 11 demonstrates all components in the masked neural network. The secure network splits the adder tree computations into two phases, governed by the operand select signal of the MUX that feeds the adder tree. In the first phase, the hardware accumulates the summations in a buffer and in the next phase it starts the masked carry generator by feeding both the summations. Since each layer uses the previous layer results, they are stored in separate memories, depicted as L1 share, L2 share1, and so on. Once all the activations of a layer a computed, these go through a series of XNORs and bool2arith LUTs. The bool2arith LUTs help convert the boolean shares to arithmetic shares as well as reduce the number of inputs from 1024 to 512 to fit in the adder tree. The index of the maximum activation value at the final layer is sent out as the output of the inference.

Fig. 12: Side-channel evaluation tests. First-order analysis on the unmasked design (PRNG off) shows that it leaks information while the masked design (PRNG on) is secure. A second-order analysis on the masked design using centered squared product, leaks information as expected, as the design only provides first-order security for now.

Vi Leakage and Overhead Evaluation

This section describes the measurement setup for our experiments, the evaluation methodology used to validate the security of the unmasked and masked implementations, and the corresponding results.

Vi-a Hardware Setup

Our evaluation setup used the SAKURA-X board [sakurax], which includes a Xilinx Kintex-7 (XC7K160T-1FBGC) FPGA for processing and enables measuring the voltage drop on a 50- shunt-resistance while making use of the on-board amplifiers to measure FPGA power consumption. The clock frequency of the design was 24MHz. We used the Picoscope 3206D oscilloscope to take measurements with a sampling frequency set to 250MHz. To amplify the output of the Kintex-7 FPGA, we used a low-noise amplifier provided by Riscure (HD24248) along with the current probe setup.

Vi-B Side-Channel Test Methodology

Our leakage evaluation methodology is built on the prior test efforts on cryptographic implementations [reparaz-mask-16, balasch15, reparaz15]. We performed DPA on the 3 main operations of an inference engine as stated before, viz. adder tree, activation function and output layer max function. Pseudo Random Number Generators (PRNG) produced the random numbers required for masking—any cryptographically-secure PRNG can be employed to this end. We first show the first-order DPA weight recovery attacks on the masked implementation with PRNG disabled. With PRNG off, the design’s security is equivalent to an unmasked design. We illustrate that such a design leaked information for all the three operations, which ensured that our measurement setup and recovery code was sound. Next, we turned on the PRNG and performed the same attack which failed for all the operations. We further proceeded to perform a second-order attack to show that we used enough number of traces in the first-order analysis. The power model was based on hamming distance of registers that was generated using our in-house HD simulator for the neural network and the tests used the Pearson correlation coefficient to compare the measurement data with the hypothesis.

Vi-C Attacks with PRNG off

The PRNG generates the arithmetic shares for the adder tree, to feed the masked LUTs of the activation function and boolean to arithmetic converters. Turning the PRNG off unmasked all these operations making a first-order attack successful at all these points. Fig. 12 shows the mean power plot on the top for orientation, which is followed below by the 3 attack plots with PRNG disabled. We attacked the second stage of the adder tree in the second plot and the first node activation of the first hidden layer and first output node of the output layer for the next two plots. In all the plots, we can observe a distinct correlation peak for the targeted variable corresponding to the correct weight and bias values. Fig. 13 shows the details of the successful attack. This validates our claim on the vulnerability of the baseline, unprotected design.

Vi-D First-order Attacks with PRNG on

We used the same attack vectors from the case of PRNG off, but with the PRNG turned on this time. This armed all the countermeasures in the design that we have implemented for each operation. The bottom three plots of Fig.12 show that the distinct peaks seen in the unmasked plots do not show up anymore with the random numbers added in the design. Fig. 13 shows the evolution of the correlation coefficient as the number of traces increase. We can see that the attack was successful with 200 traces. This validates our claim on the security of the masked, protected design.

Fig. 13: Evolution of the Pearson coefficient at the point of leak with the number of traces for first-order attacks when the PRNGs are disabled (left), PRNGs are enabled (middle), and for second order attacks with PRNGs enabled (right). The first-order attack with PRNGs disabled is successful around 200 traces but is unsuccessful with PRNGs enabled even at 100k traces, which shows that the design is masked successfully. The second-order attack becomes successful around 3.7k traces, which proves the fact that we used enough number of traces in the first-order analysis.

Vi-E Second Order Attacks with PRNG on

To demonstrate that we used sufficient number of traces in the first-order attack, we also performed a second-order DPA on the activation function. Again, we used the same attack vectors used in the first-order analysis experiments, but applied the attack on centered-squared traces this time. We observed a distinct correlation peak of value -0.06 at the correct time position, making the attack successful this time. Fig. 13 shows the evolution of the correlation coefficient for the second-order attack. We can see that the attack is successful around 3.7k traces. This confirms that 100k traces are sufficient for a first-order attack as the second-order attack is already successful at  3.7k traces.

Vi-F Masking Overheads

Table II summarizes the area and latency overheads of masking in our case. As expected, due to the sequential processing of two share masked implementation, the latency of the inference is approximately doubled, from cycles for the baseline to cycles for the masked design. This is due to splitting computations into two sequential phases for each share. Table II compares the area utilization of the unmasked and masked implementations in terms of the various blocks present in the FPGA. The increase in the number of the LUTs, flip flops and BRAMs in the design is approximately 2.7x, 1.7x and 1.3x. The significant increase in the number of LUTs is mainly due to the masked LUTs used to mask the activation function and convert the boolean shares of each layer to arithmetic shares. Regarding the increase in number of flip flops and BRAM utilization, there are numerous additional storage structures introduced in the masked implementation. The randomness needed to mask the various parts of the inference engine is buffered at the start of operation. Furthermore the arithmetic masks are buffered in the first phase, to be sent together to the masked activation function later. Each layer stores twice the number of activations in the form of two boolean shares.

Design Type LUT/FF BRAM/DSP Cycles
Unmasked 20296/18733 81/0 3192
Masked 55508/33290 111/0 7248
TABLE II: Area and Latency Comparison of unmasked vs. masked implementations

Vii Discussions

This section discusses orthogonal aspects and comments how they can complement our proposed effort.

Vii-a Masking the Sign Bit

We have addressed the leakage in the sign bit of arithmetic share generation of the adder tree through hiding. This is the only part in our hardware design that is not masked and hence is potentially vulnerable without proper back-end modifications. We highlight this issue as an open problem, which can, arguably be addressed through extensions of gate level masking. But such an implementation will incur significant overheads on top of what we already show. More research is needed to design efficient masking components for neural network specific computations and on investigating the security against more advanced (e.g., horizontal) side-channel attacks.

Vii-B Comparison of Theoretical Attacks, Digital Side-Channels, and Physical Side-Channels

We argue that a direct comparison of the physical side-channels to digital and theoretical attack’s effectiveness (in terms of number of queries) is currently unfair due to immaturity of this model extraction field and due to different target algorithms. Analyzing and countering theoretical attacks improve drastically over time. This has already occurred in cryptography: algorithms have won [kocher-talk]. Indeed, there has been no major breakthrough on the cryptanalysis of encryption standards widely-used today. But side-channel attacks are still commonplace and are even of growing importance. While digital side-channels are imminent, they are relatively easier to address in application-specific hardware accelerators/IPs that enforce constant time and constant flow behavior (as opposed to general purpose architectures that execute software). For example, the hardware designs we provide in this work has no digital side-channel leaks. Physical side-channels, by contrast, are still observed in such hardware due to their data-dependent, low-level nature; and therefore require more involved mitigation techniques.

Vii-C Scaling to other Neural Networks

The objective of this paper is to provide the first proof-of-concept of both power side-channel attacks and defenses of NNs in hardware. To this end, we have designed a neural network that encompasses all the basic features of a binarized neural network, like binarized weights and activations, the commonly used sign function as activation. When extended to other neural networks, the proposed defences will roughly scale linearly with the node, layer count and bit-precision (size) of neurons. The scaling of the attack depends on the level of parallelization in hardware. Any algorithm, independent of its complexity can be attacked with physical side-channels. Actually, in a sequential design, increasing the weight size (e.g. moving from one bit to 8-bits or floating point) may improve the attack because there is more signal to correlate.

Viii Conclusion and Future Work

Physical side-channel leaks in neural networks call for a new line of research within side-channel analysis because it opens up a new avenue designing countermeasures tailored for the deep learning inference engines. This work provides a pathway. It articulates the first effort in mitigating the side-channel leaks in neural networks. It primarily applies masking style techniques and demonstrate the new challenges that arise from the unique topological and arithmetic needs of neural networks, including the formulation of an open problem of the signed, arithmetic share generation.

Applying masking is a delicate task. Even after two decades of research, there is still an increasing number of publications on masking the well-established AES standard (e.g., three papers at CHES 2019 [DeMeyer19, sugawara19, cassiers19]). Given the variety in neural networks with no existing standard and the apparent struggle for masking, there is a critical need to heavily invest into securing deep learning frameworks.

An immediate extension of this work is adapting masking to more complex networks (such as 8-bit fixed point or floating point networks) and analyzing the feasibility of attack on their parallelized hardware implementations. The work presented in this paper can also be extended to design higher order masking schemes for a neural network.

References