1. Introduction
Physical sidechannel attacks pose a major threat to the security of cryptographic devices. Attacks like the Differential Power Analysis (DPA) (C:KocJafJun99) can extract secret keys by exploiting the inherent correlation between the secretkeydependent data being processed and the Complementary Metal Oxide Semiconductor (CMOS) power consumption (das2019stellar). DPA has been shown to be effective against many cryptographic implementations in the last two decades (mangard2008power; TCHES:PSKH18; chen2015differential). Until recently, these attacks were confined to cryptographic schemes. But lately, the Machine Learning (ML) applications are shown to be vulnerable to physical sidechannel attacks (batina2018csi; dubey2019maskednet; yudeepem), where an adversary aims to reverse engineer the ML model. Indeed, these models are lucrative targets as they are costly to develop and hence become valuable IPs for the companies (strubell2019energy). Knowledge about model parameters also makes it easy to fool the model using adversarial learning, which is a serious problem if the model performs a critical task like fraud/spam detection (advlearn05).
Unfortunately, most of the existing work on the physical sidechannel analysis of ML accelerators has focused only on attacks, not defenses. To date, there are three publications focusing specifically on the power/EM sidechannel leakage of ML models. The first two discuss some countermeasures like shuffling and masking but do not implement any (batina2018csi; yudeepem). The third one implements a hybrid of masking and hiding based countermeasures and exposes the vulnerability in the arithmetic masking of integers due to the leakage in the signbit (dubey2019maskednet).
Masking uses secure multiparty computation to split the secrets into random shares and to decorrelate the statistical relation of secretdependent computations to sidechannels. Although similar work on cryptographic hardware has been fully masked (BGK05), the earlier work on neural network hardware was partially masked for costeffectiveness (dubey2019maskednet) while the leakage in the sign bit is hidden. Such solutions may work well for a regular IP where reasonable security at lowcost is sufficient. However, full masking is a better alternative for IPs deployed in critical applications (like defense) requiring stronger defenses against sidechannel attacks.
In this work, we propose the design of the first fullymasked neural network accelerator resistant against powerbased sidechannel attacks. We construct novel masked primitives for the various linear and nonlinear operations in a neural network using gatelevel Boolean masking and masked lookup tables (LUT). We analyze neural networkspecific computations like weighted summations in fullyconnected layers and build specialized masked adders and multiplexers to perform those operations in a secure way. We also design a novel hardware that finds the greatest integer out of a set of integers in a masked fashion, which is needed in the output layer to find the node with the highest confidence score.
We target an areaoptimized Binarized Neural Network (BNN) in our work because of their preference for edgebased neural network inference
(umuroglu2017finn; rastegari2016xnor). We optimize the hardware design to reduce the impact of masking on the performance. Specifically, we build an innovative adderaccumulator architecture that provides a throughput of one addition per cycle even with a multicycle masked adder with feedback. We maximize the number of balanced datapaths in the masking elements by adding registers at every stage, to synchronize the arrival of signals and reduce the effects of glitches (CHES:ManPraOsw05). We build the masked design in a modular fashion starting from smaller blocks like Trichina’s AND gates to finally build larger structures like the 20bit masked full adder. We have pipelined the full design to maintain a high throughput.Finally, we implement both the baseline unmasked and the proposed firstorder secure masked neural network design on an FPGA. We use the standard TVLA methodology (becker2013test) to evaluate the firstorder security of the design and demonstrate no leakage up to 2M traces. The latency of the masked implementation is merely 3.5% higher than the latency of the unmasked implementation. The area of the masked design is 5.9 that of the unmasked design. Our goal in this paper is to provide the first fullymasked design where we propose certain optimizations and a practical evaluation of security. We also discuss potential further optimizations and extensions of masking for hardware design and security refinements.
2. Threat Model
We adopt the standard DPA threat model in which an adversary has direct physical access to the target device running inference (batina2018csi; dubey2019maskednet; kocher2011introduction), or can obtain power measurements remotely (zhao2018fpga) when the device executes neural network computations. The adversary can control the inputs and observe the corresponding outputs from the device as in chosenplaintext attacks.
Figure 1 shows our threat model where the training phase is trusted but the trained model is then deployed to an inference engine that operates in an untrusted environment. The adversary is after the trained model parameters (e.g., weights and biases of a neural network)—input data privacy is out of scope (wei2018know).
We assume that the trained ML model is stored in a protected memory and the standard techniques are used to securely transfer it (i.e., bus snooping attacks are out of scope) (mlIPprot)
. The adversary, therefore, has graybox access to the device, i.e., it knows all the design details up to the level of each individual logic gate but does not know the trained ML model. We restrict the secret variables to just the parameters and not the hyperparameters such as the number of neurons, following earlier work
(juuti2019prada; tramer2016stealing; dubey2019maskednet). In fact, an adversary will still not be able to clone the model with just the hyperparameters if it does not possess the required compute power or training dataset. This is analogous to the scenario in cryptography where an adversary, even after knowing the implementation of a cipher, cannot break it without the correct key.We target a hardware implementation of the neural network, not software. The design fully fits on the FPGA. Therefore, it does not involve any offchip memory access and executes with constantflow in constant time. These attributes make the design resilient to any type of digital (memory, timing, accesspattern ,etc.) sidechannel attack. However, the physical sidechannels like power and EM emanations still exist; we address the powerbased sidechannel leakages in our work. Other implementation attacks on neural networks such as the fault attacks (breier2018practical; breier2020sniff) are out of scope.
3. Background and Related Work
This section presents related work on the privacy of ML applications, the current state of sidechannel defenses, preliminaries on BNNs, and our BNN hardware design.
3.1. ML Model Extraction
Recent developments in the field of ML point to several motivating scenarios that demand asset confidentiality. Firstly, training is a computationallyintensive process and hence requires the model provider to invest money on highperformance compute resources (eg. a GPU cluster). The model provider might also need to invest money to buy a labeled dataset for training or label an unstructured dataset. Therefore, knowledge about either the parameters or hyperparameters can provide an unfair business advantage to the user of the model, which is why the ML model should be private. Theoretical model extraction analyzes the queryresponse pair obtained by repeatedly querying an unknown ML model to steal the parameters (jagielski2019high; oh2019towards; RST19; carlini2020cryptanalytic). This type of attack is similar to the class of theoretical cryptanalysis in the cryptography literature. Digital sidechannels, by contrast, exploit the leakage of secretdata dependent intermediate computations like accesspatterns or timing in the neural network computations to steal the parameters (yan2018cache; duddu2018stealing; dong2019floating; hu2019neural), which can usually be mitigated by making the secret computations constantflow and constanttime. Physical sidechannels target the leak in the physical properties like CMOS powerdraw or electromagnetic emanations that will still exist in a constantflow/constanttime algorithm’s implementation (hua2018reverse; batina2018csi; wei2018know; xiang2020open; dubey2019maskednet). Mitigating physical sidechannels are thus harder than digital sidechannels in hardware accelerator design and has been extensively studied in the cryptography community.
3.2. SideChannel Defenses
The researchers have proposed numerous countermeasures against DPA. These countermeasures can be broadly classified as either
hidingbased or maskingbased. The former aims to make the powerconsumption constant throughout the computation by using powerbalancing techniques (tiri2004logic; nassar2010bcdl; yu2007secure). The latter splits the sensitive variable into multiple statistically independent shares to ensure that the power consumption is independent of the sensitive variable throughout the computation (CHES:AkkGir01; CHES:GolTym02; BGK05; OMPR05; TKL05; C:IshSahWag03). The security provided by hidingbased schemes hinges upon the precision of the backend design tools to create a nearperfect powerequalized circuit by balancing the load capacitances across the leakage prone paths. This is not a trivial task and prior literature shows how a wellbalanced dualrail based defense is still vulnerable to localized EM attacks (immler2017your). By contrast, masking transforms the algorithm itself to work in a secure way by never evaluating the secret variables directly, keeping the security mostly independent of backend design and making it a favorable choice over hiding.3.3. Neural Network Classifiers
Neural network algorithms learn how to perform a certain task. In the learning phase, the user sends a set of inputs and expected outputs to the machine (a.k.a., training), which helps it to approximate (or learn) the function mapping the inputoutput pairs. The learned function can then be used by the machine to generate outputs for unknown inputs (a.k.a., inference).
A neural network consists of units called neurons (or nodes) and these neurons are usually grouped into layers. The neurons in each layer can be connected to the neurons in the previous and next layers. Each connection has a weight associated with it, which is computed as part of the training process. The neurons in a neural network work in a feedforward fashion passing information from one layer to the next.
The weights and biases can be initialized to be random values or a carefully chosen set before training (tlearning). These weights and biases are the critical parameters that our countermeasure aims to protect. During training, a set of inputs along with the corresponding labels are fed to the network. The network computes the error between the actual outputs and the labels and tunes the weights and biases to reduce it, converging to a state where the accuracy is acceptable.
3.4. Binarized Neural Networks
The weights and biases of a neural network are typically floatingpoint numbers. However, high area, storage costs, and power demands of floatingpoint hardware do not fare well with the requirements of the resourceconstrained edge devices. Fortunately, Binarized Neural Networks (BNNs) (courbariaux2016binarized)
, with their low hardware cost and power needs fit very well in this usecase while providing a reasonable accuracy. BNNs restrict the weights and activation to binary values (+1 and 1), which can easily be represented in hardware by a single bit. This significantly reduces the storage costs for the weights from floatingpoint values to binary values. The XNORPOPCOUNT operation implemented using XNOR gates replaces the large floatingpoint multipliers resulting in a huge area and performance gain
(rastegari2016xnor).Figure 2 depicts the neuron computation in a fullyconnected BNN. The neuron in the first hidden layer multiplies the input values with their respective binarized weights. The generated products are added to the bias, and the result is fed to the activation function, which is a sign function that binarizes the nonnegative and negative inputs to +1 to 1, repectively. Hence, the activations in the subsequent layer are also binarized.
3.5. Our Baseline BNN Hardware Design
We consider a BNN having an input layer of 784 nodes, 3 hidden layers of 1010 nodes each, and an output layer of 10 nodes. The 784 input nodes denote the 784 pixel values in the 2828 grayscale images of the Modified National Institute of Standards and Technology (MNIST) database and 10 output nodes represent the 10 output classes of the handwritten numerical digit. (courbariaux2016binarized; umuroglu2017finn; rastegari2016xnor).
3.5.1. Weighted Summations
We choose to use a single adder in the design and sequentialize all the additions in the algorithm to reduce the area costs. Figure 3 shows our baseline BNN design. The computation starts from the input layer pixel values stored in the Pixel Memory. For each node of the first hidden layer, the hardware multiplies 784 input pixel values one by one and accumulates the sum of these products. The final summation is added with the bias reusing the adder with a multiplexed input and fed to the activation function. The hardware uses XNOR and POPCOUNT^{1}^{1}1The POPCOUNT operation also involves an additional step of subtracting the number of nodes (1010) from the final sum, which can be done as part of bias addition step. operations to perform weighted summations in the hidden layers. The final layer summations are sent to the output logic.
In the input layer computations, the hardware multiplies an 8bit unsigned input pixel value with its corresponding weight. The weight values are binarized to either 0 or 1 (representing a 1 or +1, respectively). Figure 4 shows the realization of this multiplication with a multiplexer that takes in the pixel value () and its 2’s complement () as the data inputs and weight (1) as the select line. The 8bit unsigned pixel value, when multiplied by 1, needs to be signextended to 9bits, resulting in a 9bit multiplexer.
3.5.2. Activation Function
The activation function binarizes the nonnegative and negative to +1 and 1 respectively for each node of the hidden layer. In hardware, this is implemented using a simple NOT gate that takes the MSB of the summations as its input.
3.5.3. Output Layer
The summations in the output layer represent the confidence score of each output class for the provided image. Therefore, the final classification result is the class having the maximum confidence score. Figure 3 shows the hardware for computing the classification result. As the adder generates output layer summations, they are sent to the output logic block that performs a rolling update of the max register () if the newly received sum is greater than the previously computed max. In parallel, the hardware also stores the index of the current max node. The index stored after the final update is sent out as the final output of the neural network. The hardware takes 2.8M cycles to finish one inference.
4. Fully Masking the Neural Network
This section discusses the hardware design and implementation of all components in the masked neural network. Prior work on masking of neural networks shows that arithmetic masking alone cannot mask integer addition due to a leakage in the signbit (dubey2019maskednet). Hence, we apply gatelevel Boolean masking to perform integer addition in a secure fashion. We express the entire computation of the neural network as a sequence of AND and XOR operations and apply gatelevel masking on the resulting expression. XORs, being linear, do not require any additional masking, and AND gates are replaced with secure, Trichina style AND gates (TKL05). Furthermore, we design specialized circuits for BNN’s unique components like Masked Multiplexer and Masked Output Layer.
4.1. Notations
We first explain the notations in equations and figures. Any variable without a subscript or superscript represents an Nbit number. We use the subscript to refer to a single bit of the Nbit number. For example, refers to the bit of . The superscript in masking refers to the different secret shares of a variable. To refer to a particular share of a particular bit of an Nbit number, we use both the subscript and the superscript. For example, refers to the second Boolean share of the bit of . If a variable only has the superscript (say ), we are referring to its full Nbit Boolean share; N can also be equal to 1, in which case is simply a bit. r (or r) denotes a fresh, random bit.
4.2. Why Trichina’s Masking Style?
Among the closely related masking styles (reparaz162), we chose to implement Trichina’s method due to its simplicity and implementation efficiency. Figure 5 (left) shows the basic structure and functionality of the Trichina’s gate, which implements a 2bit, masked, AND operation of . Each input ( and ) is split into two shares ( and s.t. , and and s.t. ). These shares are sequentially processed with a chain of AND gates initiated with a fresh random bit (). A single AND operation thus uses 3 random bits. The technique ensures that output is the Boolean masked output of the original AND function, i.e., , while all the intermediate computations are randomized.
Unfortunately, the straightforward adoption of Trichina’s AND gate can lead to information leakage due to glitches (ICICS:NikRecRij06). For instance, in Figure 5 (left) if the products and reach the input of second XOR gate before random mask reaches the input of first XOR gate, the output at the XOR gate will evaluate (glitch) to temporarily, which leads to secret value being unmasked. Therefore, we opted for an extension of the Trichina’s AND gate by adding flipflops to synchronise the arrival of inputs at the XOR gates (see Figure 5 right). The only XOR gate not having a flipflop at its input is the leftmost XOR gate in the path of , which is not a problem because a glitching output at this gate does not combine two shares of the same variable. Similar techniques have been used in past (regGlitch). Masking styles like the Threshold gates (CHES:ManPraOsw05; maskedcmosleak; SAC:TirSch06) may be considered for even stronger security guarantees, but they will add further areaperformancerandomness overhead.
4.3. Masked Adder
We adopt the ripplecarry style of implementation for the adder. It is formed using N 1bit full adders where the carryout from each adder is the carryin for the next adder in the chain, starting from LSB. Therefore, ripplecarry configuration eases parameterization and modular design of the Boolean masked adders.
4.3.1. Design of a Masked Full Adder
A 1bit full adder takes as input two operands and a carryin and outputs the sum and the carry, which are a function of the two operands and the carryin. If the input operand bits are denoted by and and carryin bit by , then the Boolean equation of the sum and the carry can be described as follows:
(1) 
(2) 
Figure 6 shows the regular, 1bit full adder (on the left), and the resulting masked adder with Trichina’s AND gates (on the right). In the rest of the subsection, we will discuss the derivation of the masked full adder equations.
First step is to split the secret variables (, and
) into Boolean shares. The hardware samples a fresh, random mask from a uniform distribution and performs XOR with the original variable. If we represent the random masks as
, and , then the masked values , and can be generated as follows:(3) 
A masking scheme always works on the two shares independently without ever combining them at any point in the operation. Combining the shares at any point will reconstruct the secret and create a sidechannel leak at that point.
The function of sumgeneration is linear, making it easy to directly and independently compute the Boolean shares of :
where,
Unlike the sumgeneration, carrygeneration is a nonlinear operation due to the presence of an AND operator. Hence, the hardware cannot directly and independently compute the Boolean shares and of . We use the Trichina’s construction explained in subsection 4.2 to mask carrygeneration.
The hardware uses three Trichina’s AND gates to mask the three AND operations in equation (2) using three random masks. This generates two Boolean shares from each Trichina AND operation. At this point, the expression is linear again, and therefore, the hardware can redistribute the terms, similar to the masking of sum operation.
In the following equations, we use to represent the product implemented via Trichina’s AND Gate as illustrated in the following equation:
where and are the two Boolean shares of the product. Replacing each AND operation in equation (2) with TG, we can write
(4) 
(5) 
(6) 
where , , , , , and are the output shares from each Trichina Gate. From equations (2), (4), (5), and (6) we get
Replacing the TGs from equation (4), (5), and (6) and rearranging the terms, we get
which can also be written as a combination of two Boolean shares and
where
Therefore, we create a masked full adder that takes in the Boolean shares of the two bits to be added along with a carryin and gives out the Boolean shares of the sum and carryout.
4.3.2. The Modular Design of Pipelined Nbit Full Adder
The masked full adders can be chained together to create an Nbit masked adder that can add two masked Nbit numbers. Figure 7 (top) shows how to construct a 4bit masked adder as an example. We pipeline the Nbit adder to yield a throughput of one by adding registers between the fulladders corresponding to each bit (see Figure 7 (bottom)).
4.4. Masking of Activation Function
4.5. Masked Multiplexer
A 9bit multiplexer is internally a set of parallel nine 1bit multiplexers. We implement the masked 1bit multiplexer using a 4input 2output masked lookup table (LUT). Figure 9 shows the masked LUT that takes in the original inputs () and an additional fresh random mask () as inputs and outputs the random mask () which is simply the bypassed and the correct output XORed with the random mask. We assume that each LUT operation is atomic. Since the output functions are 4input, 2output, they can be mapped onto the same LUT of the target FPGA (ugSpartan6). Lesser number of inputs also obviate the need for precautions like building a carefully balanced tree of smaller input LUTs (CHES:RRVV15). Advanced masking constructions can be used to implement this function for a stronger security guarantee. As suggested in another work (CHES:RRVV15), masked lookups can also be implemented using ROMs if the target is an ASIC, since ROMs are immutable and small in size. Thus, the (Boolean) masked output from the LUTs ensures that the secret intermediatevariable (multiplexed input pixel) always remains masked.
4.6. Masking the Output Layer
The hardware stores the 10 output layer summations in the form of Boolean shares. To determine the classification result, it needs to find the maximum value among the 10 masked output nodes. Specifically, it needs to compare two signed values expressed as Boolean shares. We transform the problem of masked comparison to masked subtraction.
Figure 10 shows the hardware design of the masked output layer. The hardware subtracts each output node value from the current maximum and swaps the current maximum (old max shares) with the node value (new max shares) if the MSB is 1 using a masked multiplexer. An MSB of 1 signifies that the difference is negative and hence the new sum is greater than the latest max. Instead of building a new masked subtractor, we reuse the existing masked adder to also function as a subtractor through a flag, which is set while computing max. In parallel, the hardware uses one more masked multiplexerbased updatecircuit to update the Boolean shares of the index corresponding to the current max node (not shown in the Figure). This is to prevent knownciphertext attacks, ciphertext being the classification result in our case. Finally, the Masked Output Logic computes the classification result in the form of (Boolean) shares of the node’s index having the maximum confidence score.
Subtraction is essentially adding a number with the 2’s complement of another number. 2’s complement is computed by taking bitwise 1’s complement and adding 1 to it. A bitwise 1’s complement is implemented as an XOR operation with 1 and the addition of 1 is implemented by setting the initial carryin to be equal to 1. Since this only requires additional XOR gates, which is a linear operator, nothing changes with respect to the masking of the new addersubtractor circuit.
4.7. Scheduling of Operations
We optimize the scheduling in such a way that the hardware maintains a throughput of 1 addition per cycle. The latency of the masked 20bit adder is 100 cycles. Therefore, the result from the adder will only be available after 101 cycles (need an additional cycle for the accumulator register as well) from the time it samples the inputs. The hardware cannot feed the next input in the sequence until the previous sum is available because of the data dependency between the accumulated sum and the next accumulated sum. This incurs a stall for 101 cycles leading to a total of cycles for each node computation. That is a performance drop over the unmasked implementation with a regular adder.
We solve the problem by finding useful work for the adder that is independent of the summation inflight, during the stalls. We observe that computing the weighted summation of one node is completely independent of the next node’s computation. The hardware utilizes this independence to improve the throughput by starting the next node computation while the result for the first node arrives. Similarly, all the nodes up till 101 can be computed upon concurrently using the same adder and achieve the exact same throughput as the baseline design. This comes at the expense of additional registers (see Figure 11^{2}^{2}2The register file also has a demultiplexing and multiplexing logic to update and consume the correct accumulated sum in sequence, which is not shown for simplicity.) for storing 101 summations^{3}^{3}3This is why we use 1010 neurons, which is a multiple of 101, in the hidden layers. plus some control logic but a throughput gain of 784 (or 1010 in hidden layers) is worthwhile. The optimization only works if the number of nextlayer nodes is greater than, and a multiple of 101. This restricts optimizing the output layer (of 10 nodes) and contributes to the 3.5% increase in the latency of the masked design.
5. Results
In this section, we describe the hardware setup used to implement the neural network and capture power measurements, the leakage assessment methodology that we follow to evaluate the security of the proposed design, and the hardware implementation results.
5.1. Hardware Setup
We implement the neural network in Verilog and use Xilinx ISE 14.7 for synthesis and bitstream generation. We use the DONT_TOUCH attribute in the code and disable the options like LUT combining, register reordering, etc. in the tool to prevent any type of optimization in the masked components.
Our sidechannel evaluation platform is the SAKURAG FPGA board (sakurag). It hosts Xilinx Spartan6 (XC6SLX752CSG484C) as the main FPGA that executes the neural network inference. An onboard amplifier amplifies the voltage drop across a 1 shunt resistor on the power supply line. We use Picoscope 3206D (picoscope) as the oscilloscope to capture the measurements from the dedicated SMA output port of the board. The design is clocked at 24MHz and the sampling frequency of the oscilloscope is 125MHz. A higher sampling frequency leads to the challenges that we discuss in Section 6.2
. However, to ensure a sound evaluation, we perform first and secondorder ttests on a smaller unit of the design at a much higher precision: we conduct the experiment at a design frequency of 1.5MHz and sampling frequency of 500MHz, which translates to 333 sample points per clock cycle.
We use Riscure’s Inspector SCA (inspector) software to communicate with the board and initiate a capture on the oscilloscope. By default, the Inspector software does not support SAKURAG board communication. Hence, we develop our own custom modules in the software to automate the setup. The modules implement the FPGA communication protocol and perform the register reads and writes on the FPGA to start the neural network inference and capture the classification result.
5.2. Leakage Evaluation
We perform the leakage assessment of the proposed design using the nonspecific fixed vs random ttests, which is a common and generic way of assessing the sidechannel vulnerability in a given implementation (becker2013test). A tscore lying within the threshold range of 4.5 implies that the power traces do not leak any information about the data being processed, with up to 99.99% confidence. The measurement and evaluation is quite involved and we refer the reader to Section 6.2 for further details. We demonstrate the security up to 2M traces, which is much greater than the firstorder security of the currently bestknown defense that leaks at 40k traces (dubey2019maskednet).
Pseudo Random Number Generators (PRNG) produce the fresh, random masks required for masking. We choose TRIVIUM (prng) as the PRNG, which is a hardware implementation friendly PRNG specified as an International Standard under ISO/IEC 291923, but any cryptographicallysecure PRNG can be employed. TRIVIUM generates 2 bits of output from an 80bit key; hence, the PRNG has to be reseeded before the output space is exhausted.
5.2.1. Firstorder tests
We first perform the firstorder ttest on the design with PRNGs disabled, which is equivalent to an unmasked (baseline/unprotected) design. Figure 12 (left) shows the result for this experiment where we clearly observe significant leakages since the tscores are greater than the threshold of 4.5 for the entire execution. Then, we perform the same test, but with PRNGs switched on this time, which is equivalent to a masked design. Figure 12 (right) shows the results for this case, where we observe that the tscores never cross the threshold of 4.5 except the initial phase.
The initial phase leakages are due to the input correlations during input layer computations. The hardware loads the input pixel after every 101 cycles and feeds it to the masked multiplexer. The secret variable is the weight, which is never exposed because the masked multiplexer randomises the output using a fresh, random mask.
5.2.2. High Precision First and Secondorder tests
We performed univariate secondorder ttest on the fully masked design (schneider2016leakage), but 1M traces were not sufficient to reveal the leakages. Due to the extremely lengthy measurement and evaluation times it was infeasible to continue the test for more number of traces. Therefore, we perform first and secondorder evaluation on the isolated synchronized Trichina’s AND gate, which is one of the main building blocks of the full design. We reduce the design frequency to 1.5MHz to increase the accuracy of the measurement and prevent any aliasing between clock cycles. The SNR for a single gate was not sufficient to see leakage even at 10M traces, hence we amplify the SNR by instantiating 32 independent instances of the Trichina’s AND gate in the design, driven by the same inputs. We present the results for this experiment in Figure 13 that shows no leakage in the firstorder ttest but significant leakages in the secondorder ttests for 500k traces. Thus, by ensuring success in the secondorder ttests we validate the correctness of our measurement setup and the firstorder masking implementation.
Metric  Unmasked  Masked  Change 
Area  1833/1125/163  9833/7624/163  5.3 / 6.8 / 1 
Latency  1.04 
Design Blocks  Unmasked  Masked  Fraction(%) 
Adder  10/0/0  954/1050/0  12/16/ 
PRNGs  0/0/0  1125/1314/0  14/20/ 
Output Layer  7/16/0  32/22/0  0.3/0.09/ 
Throughput  0/20/0  5337/4040/0  66/62/ 
Optimization  
ROMs  411/1009/159  672/1009/159  4// 
RWMs  0/0/4  0/0/4   
Misc  486/108/0  1233/2605/0  9/38/ 

”” denotes no change in the area of the unmasked and masked design.
5.3. Masking Overheads
Table 1 shows that the impact of masking on the performance is 1.04, and on the number of LUTs and FFs is 5.3 and 6.7 respectively. We also summarize the area contribution from each design component in Table 2. The fourth column indicates what fraction of the total increase in area (i.e., 8000 LUTs and 6499 FFs) does each component contribute. Most of the area increase is due to the throughput optimization logic—the register file accumulator logic described in subsection 4.7. The masked adder contributes 12% and 16% to the overall increase in the LUTs and FFs respectively. The increase due to the output layer logic is minimal. ROMs refer to the readonly memories storing the weights and bias values where the increase is minimal^{4}^{4}4The slight increase in the number of LUTs is because one of the memories is implemented using LUTs that might redistribute even for the same memory size.. RWMs refer to the readwrite memories storing the layer activations, which also do not show any increase as the masked version stores two bits (the Boolean shares) instead of one for the activations accommodated in the same BRAM tile.
We compare the areadelay product (ADP) of our proposed design, BoMaNet, to MaskedNet (dubey2019maskednet), where area is defined as the sum of the number of LUTs and FFs, and delay is defined as the latency in number of cycles. The ADP of our design is whereas the ADP of MaskedNET is , which is approximately 100 lower. This is expected since MaskedNet was designed for costeffectiveness using hiding and partial masking, but on BoMaNet every operation is masked at the gatelevel to improve sidechannel security. Similar overheads were observed in previous works on Boolean masking of AES (maskOverhead).
6. Discussions
6.1. ProofofConcept vs. Optimizations
The solution we propose utilizes simple yet effective techniques to mask an inference engine. But certainly, there is scope for improvement both in terms of the hardware design and the security countermeasures. In this section, we discuss some possible optimizations/extensions of our work and alternate approaches taken in the field of privacy for ML.
6.1.1. Design Optimizations
The ripplecarry adder used in this work can be replaced with advanced adder architectures like carrylookahead (cla), carryskip (csa), or KoggeStone (koggestone). These architectures commonly possess an additional logic block that precomputes the generate and propagate bits. Therefore, additional randomness will be needed to mask the nonlinear generate expression. All these adders have more combinational logic than the ripplecarry adder, which may make it harder to avoid glitches. To that end, prior work on TIbased secure versions of ripplecarry and KoggeStone adders can be extended (bomaadder15). Another potential optimization is the use of other masking styles like DoM (gross2016domain) or manual techniques (manualGlitch) to reduce the area and randomness overheads.
6.1.2. Limitations
We reduce glitchrelated vulnerabilities using registers at each stage, which is a lowcost, practical solution. Other works have proposed stronger countermeasures, at the cost of higher performance and area overheads (ICICS:NikRecRij06; gross2016domain). The quest for stronger and more efficient countermeasures is neverending; masking of AES is still being explored, even 20 years after the initial work (CHES:AkkGir01), due to the advent of more efficient or secure masking schemes (d+1shares) and more potent attacks (moos2017static; TKL05).
Our solution is firstorder secure but there is scope for construction of higherorder masked versions. However, higherorder security is a delicate task; Moos et al. recently showed that a straightforward extension of masking schemes to higherorder suffers from both local and compositional flaws (TCHES:MMSS19) and masking extensions were proposed in another recent work (cassiershardware).
This is the first work on fullymasked neural networks and we foresee follow ups as we have experienced in the cryptographic research of AES masking, even after 20 years of intensive study.
6.2. Measurement Challenges
We faced some unique challenges that are not generally seen with the symmetrickey cryptographic evaluations. Inference becomes a lengthy operation, especially for an areaoptimized design—the inference latency of our design is roughly 3 million cycles. For a design frequency of 24MHz, the execution time translates to 122ms per inference. If the oscilloscope samples at 125MHz (sample interval of 8ns) the number of sample points to be captured per power trace is equal to 15 million. This significantly slows down the capturing of power traces. In our case, capturing 2 million power traces took one week, which means capturing 100 million traces as AES evaluation (d+1shares) will take roughly a year. Performing TVLA on such large traces ( 28TB, in our case) also takes a significant amount of time: it took 3 days to get one tscore plot during our evaluations on a highend PC^{5}^{5}5Intel Core i99900K, 64GB RAM.. One possibility to avoid this problem is looking at a small subset of representative traces of the computation (DLRSA), but, we instead conduct a comprehensive evaluation of our design.
6.3. Theoretical vs SideChannel Attacks
Theoretical model extraction by training a proxy model on a synthetically generated dataset using the predictions from the unknown victim model is an active area of research (jagielski2019high; carlini2020cryptanalytic). These attacks mostly assume a blackbox access to the model and successfully extract the model parameters after a certain number of queries. This number ranges typically in the order of (carlini2020cryptanalytic). By contrast, physical sidechannel attacks only require a few thousand queries to successfully steal all the parameters (dubey2019maskednet). This is partly due to fact that physical sidechannel attacks can extract information about intermediate computations even in a blackbox setting. Physical sidechannel attacks also do not require the generation of the synthetic dataset, unlike most theoretical attacks.
6.4. Orthogonal ML Defences
There has been some work on defending the ML models against stealing of inputs and parameters using other techniques like Homomorphic Encryption (HE) and Secure MultiParty Computation (SMPC) (gazelle18; xonn19; delphi20), Watermarking (rouhani2018deepsigns; adi2018turning), and Trusted Execution Engines (TEE) (preventingnn18; tramer2018slalom; mlcapsule18). The survey by Isakov et al. and the draft by NIST is a good reference for a more exhaustive list (mlNIST; isakov2019survey). The computational needs of HE might not be suitable for edge computing. The current SMPC defenses predominantly target a cloudbased ML framework, not edge. We propose masking, which is an extension of SMPC on hardware and we believe that it is a promising direction for ML sidechannel defenses as it has been on cryptographic applications. Watermarking techniques are punitive methods that cannot prevent physical sidechannel attacks. TEEs are subject to everevolving microarchitectural attacks and typically are not available in edge/IoT nodes.
7. Conclusions and Future Outlook
Physical sidechannel analysis of neural networks is a new, promising direction in hardware security where the attacks are rapidly evolving compared to defenses. This work proposed the first fullymasked neural network, demonstrated the security with up to 2M traces, and quantified the overheads of a potential countermeasure. We have addressed the key challenge of masking arithmetic shares of integer addition (dubey2019maskednet) through Boolean masking. We furthermore presented ideas on how to mask the unique linear and nonlinear computations of a fullyconnected neural network that do not exist in cryptographic applications.
The large variety in neural network architectures in terms of the level of quantization, the types of layer operations (e.g., Convolution, Maxpool, Softmax), and the types of activation functions (e.g., ReLU, Sigmoid, Tanh) presents a large design space for neural network sidechannel defenses. This paper focused on BNNs as they are a good starting point. The ideas presented in this work serve as a benchmark to analyze the vulnerabilities that exist in neural network computations and to construct more robust and efficient countermeasures.
Comments
There are no comments yet.