Physical side-channel attacks pose a major threat to the security of cryptographic devices. Attacks like the Differential Power Analysis (DPA) (C:KocJafJun99) can extract secret keys by exploiting the inherent correlation between the secret-key-dependent data being processed and the Complementary Metal Oxide Semiconductor (CMOS) power consumption (das2019stellar). DPA has been shown to be effective against many cryptographic implementations in the last two decades (mangard2008power; TCHES:PSKH18; chen2015differential). Until recently, these attacks were confined to cryptographic schemes. But lately, the Machine Learning (ML) applications are shown to be vulnerable to physical side-channel attacks (batina2018csi; dubey2019maskednet; yudeepem), where an adversary aims to reverse engineer the ML model. Indeed, these models are lucrative targets as they are costly to develop and hence become valuable IPs for the companies (strubell2019energy). Knowledge about model parameters also makes it easy to fool the model using adversarial learning, which is a serious problem if the model performs a critical task like fraud/spam detection (advlearn05).
Unfortunately, most of the existing work on the physical side-channel analysis of ML accelerators has focused only on attacks, not defenses. To date, there are three publications focusing specifically on the power/EM side-channel leakage of ML models. The first two discuss some countermeasures like shuffling and masking but do not implement any (batina2018csi; yudeepem). The third one implements a hybrid of masking and hiding based countermeasures and exposes the vulnerability in the arithmetic masking of integers due to the leakage in the sign-bit (dubey2019maskednet).
Masking uses secure multi-party computation to split the secrets into random shares and to decorrelate the statistical relation of secret-dependent computations to side-channels. Although similar work on cryptographic hardware has been fully masked (BGK05), the earlier work on neural network hardware was partially masked for cost-effectiveness (dubey2019maskednet) while the leakage in the sign bit is hidden. Such solutions may work well for a regular IP where reasonable security at low-cost is sufficient. However, full masking is a better alternative for IPs deployed in critical applications (like defense) requiring stronger defenses against side-channel attacks.
In this work, we propose the design of the first fully-masked neural network accelerator resistant against power-based side-channel attacks. We construct novel masked primitives for the various linear and non-linear operations in a neural network using gate-level Boolean masking and masked look-up tables (LUT). We analyze neural network-specific computations like weighted summations in fully-connected layers and build specialized masked adders and multiplexers to perform those operations in a secure way. We also design a novel hardware that finds the greatest integer out of a set of integers in a masked fashion, which is needed in the output layer to find the node with the highest confidence score.
We target an area-optimized Binarized Neural Network (BNN) in our work because of their preference for edge-based neural network inference(umuroglu2017finn; rastegari2016xnor). We optimize the hardware design to reduce the impact of masking on the performance. Specifically, we build an innovative adder-accumulator architecture that provides a throughput of one addition per cycle even with a multi-cycle masked adder with feedback. We maximize the number of balanced data-paths in the masking elements by adding registers at every stage, to synchronize the arrival of signals and reduce the effects of glitches (CHES:ManPraOsw05). We build the masked design in a modular fashion starting from smaller blocks like Trichina’s AND gates to finally build larger structures like the 20-bit masked full adder. We have pipelined the full design to maintain a high throughput.
Finally, we implement both the baseline unmasked and the proposed first-order secure masked neural network design on an FPGA. We use the standard TVLA methodology (becker2013test) to evaluate the first-order security of the design and demonstrate no leakage up to 2M traces. The latency of the masked implementation is merely 3.5% higher than the latency of the unmasked implementation. The area of the masked design is 5.9 that of the unmasked design. Our goal in this paper is to provide the first fully-masked design where we propose certain optimizations and a practical evaluation of security. We also discuss potential further optimizations and extensions of masking for hardware design and security refinements.
2. Threat Model
We adopt the standard DPA threat model in which an adversary has direct physical access to the target device running inference (batina2018csi; dubey2019maskednet; kocher2011introduction), or can obtain power measurements remotely (zhao2018fpga) when the device executes neural network computations. The adversary can control the inputs and observe the corresponding outputs from the device as in chosen-plaintext attacks.
Figure 1 shows our threat model where the training phase is trusted but the trained model is then deployed to an inference engine that operates in an untrusted environment. The adversary is after the trained model parameters (e.g., weights and biases of a neural network)—input data privacy is out of scope (wei2018know).
We assume that the trained ML model is stored in a protected memory and the standard techniques are used to securely transfer it (i.e., bus snooping attacks are out of scope) (mlIPprot)
. The adversary, therefore, has gray-box access to the device, i.e., it knows all the design details up to the level of each individual logic gate but does not know the trained ML model. We restrict the secret variables to just the parameters and not the hyperparameters such as the number of neurons, following earlier work(juuti2019prada; tramer2016stealing; dubey2019maskednet). In fact, an adversary will still not be able to clone the model with just the hyperparameters if it does not possess the required compute power or training dataset. This is analogous to the scenario in cryptography where an adversary, even after knowing the implementation of a cipher, cannot break it without the correct key.
We target a hardware implementation of the neural network, not software. The design fully fits on the FPGA. Therefore, it does not involve any off-chip memory access and executes with constant-flow in constant time. These attributes make the design resilient to any type of digital (memory, timing, access-pattern ,etc.) side-channel attack. However, the physical side-channels like power and EM emanations still exist; we address the power-based side-channel leakages in our work. Other implementation attacks on neural networks such as the fault attacks (breier2018practical; breier2020sniff) are out of scope.
3. Background and Related Work
This section presents related work on the privacy of ML applications, the current state of side-channel defenses, preliminaries on BNNs, and our BNN hardware design.
3.1. ML Model Extraction
Recent developments in the field of ML point to several motivating scenarios that demand asset confidentiality. Firstly, training is a computationally-intensive process and hence requires the model provider to invest money on high-performance compute resources (eg. a GPU cluster). The model provider might also need to invest money to buy a labeled dataset for training or label an unstructured dataset. Therefore, knowledge about either the parameters or hyperparameters can provide an unfair business advantage to the user of the model, which is why the ML model should be private. Theoretical model extraction analyzes the query-response pair obtained by repeatedly querying an unknown ML model to steal the parameters (jagielski2019high; oh2019towards; RST19; carlini2020cryptanalytic). This type of attack is similar to the class of theoretical cryptanalysis in the cryptography literature. Digital side-channels, by contrast, exploit the leakage of secret-data dependent intermediate computations like access-patterns or timing in the neural network computations to steal the parameters (yan2018cache; duddu2018stealing; dong2019floating; hu2019neural), which can usually be mitigated by making the secret computations constant-flow and constant-time. Physical side-channels target the leak in the physical properties like CMOS power-draw or electromagnetic emanations that will still exist in a constant-flow/constant-time algorithm’s implementation (hua2018reverse; batina2018csi; wei2018know; xiang2020open; dubey2019maskednet). Mitigating physical side-channels are thus harder than digital side-channels in hardware accelerator design and has been extensively studied in the cryptography community.
3.2. Side-Channel Defenses
The researchers have proposed numerous countermeasures against DPA. These countermeasures can be broadly classified as eitherhiding-based or masking-based. The former aims to make the power-consumption constant throughout the computation by using power-balancing techniques (tiri2004logic; nassar2010bcdl; yu2007secure). The latter splits the sensitive variable into multiple statistically independent shares to ensure that the power consumption is independent of the sensitive variable throughout the computation (CHES:AkkGir01; CHES:GolTym02; BGK05; OMPR05; TKL05; C:IshSahWag03). The security provided by hiding-based schemes hinges upon the precision of the back-end design tools to create a near-perfect power-equalized circuit by balancing the load capacitances across the leakage prone paths. This is not a trivial task and prior literature shows how a well-balanced dual-rail based defense is still vulnerable to localized EM attacks (immler2017your). By contrast, masking transforms the algorithm itself to work in a secure way by never evaluating the secret variables directly, keeping the security mostly independent of back-end design and making it a favorable choice over hiding.
3.3. Neural Network Classifiers
Neural network algorithms learn how to perform a certain task. In the learning phase, the user sends a set of inputs and expected outputs to the machine (a.k.a., training), which helps it to approximate (or learn) the function mapping the input-output pairs. The learned function can then be used by the machine to generate outputs for unknown inputs (a.k.a., inference).
A neural network consists of units called neurons (or nodes) and these neurons are usually grouped into layers. The neurons in each layer can be connected to the neurons in the previous and next layers. Each connection has a weight associated with it, which is computed as part of the training process. The neurons in a neural network work in a feed-forward fashion passing information from one layer to the next.
The weights and biases can be initialized to be random values or a carefully chosen set before training (tlearning). These weights and biases are the critical parameters that our countermeasure aims to protect. During training, a set of inputs along with the corresponding labels are fed to the network. The network computes the error between the actual outputs and the labels and tunes the weights and biases to reduce it, converging to a state where the accuracy is acceptable.
3.4. Binarized Neural Networks
The weights and biases of a neural network are typically floating-point numbers. However, high area, storage costs, and power demands of floating-point hardware do not fare well with the requirements of the resource-constrained edge devices. Fortunately, Binarized Neural Networks (BNNs) (courbariaux2016binarized)
, with their low hardware cost and power needs fit very well in this use-case while providing a reasonable accuracy. BNNs restrict the weights and activation to binary values (+1 and -1), which can easily be represented in hardware by a single bit. This significantly reduces the storage costs for the weights from floating-point values to binary values. The XNOR-POPCOUNT operation implemented using XNOR gates replaces the large floating-point multipliers resulting in a huge area and performance gain(rastegari2016xnor).
Figure 2 depicts the neuron computation in a fully-connected BNN. The neuron in the first hidden layer multiplies the input values with their respective binarized weights. The generated products are added to the bias, and the result is fed to the activation function, which is a sign function that binarizes the non-negative and negative inputs to +1 to -1, repectively. Hence, the activations in the subsequent layer are also binarized.
3.5. Our Baseline BNN Hardware Design
We consider a BNN having an input layer of 784 nodes, 3 hidden layers of 1010 nodes each, and an output layer of 10 nodes. The 784 input nodes denote the 784 pixel values in the 2828 grayscale images of the Modified National Institute of Standards and Technology (MNIST) database and 10 output nodes represent the 10 output classes of the handwritten numerical digit. (courbariaux2016binarized; umuroglu2017finn; rastegari2016xnor).
3.5.1. Weighted Summations
We choose to use a single adder in the design and sequentialize all the additions in the algorithm to reduce the area costs. Figure 3 shows our baseline BNN design. The computation starts from the input layer pixel values stored in the Pixel Memory. For each node of the first hidden layer, the hardware multiplies 784 input pixel values one by one and accumulates the sum of these products. The final summation is added with the bias reusing the adder with a multiplexed input and fed to the activation function. The hardware uses XNOR and POPCOUNT111The POPCOUNT operation also involves an additional step of subtracting the number of nodes (1010) from the final sum, which can be done as part of bias addition step. operations to perform weighted summations in the hidden layers. The final layer summations are sent to the output logic.
In the input layer computations, the hardware multiplies an 8-bit unsigned input pixel value with its corresponding weight. The weight values are binarized to either 0 or 1 (representing a -1 or +1, respectively). Figure 4 shows the realization of this multiplication with a multiplexer that takes in the pixel value () and its 2’s complement () as the data inputs and weight (1) as the select line. The 8-bit unsigned pixel value, when multiplied by 1, needs to be sign-extended to 9-bits, resulting in a 9-bit multiplexer.
3.5.2. Activation Function
The activation function binarizes the non-negative and negative to +1 and -1 respectively for each node of the hidden layer. In hardware, this is implemented using a simple NOT gate that takes the MSB of the summations as its input.
3.5.3. Output Layer
The summations in the output layer represent the confidence score of each output class for the provided image. Therefore, the final classification result is the class having the maximum confidence score. Figure 3 shows the hardware for computing the classification result. As the adder generates output layer summations, they are sent to the output logic block that performs a rolling update of the max register () if the newly received sum is greater than the previously computed max. In parallel, the hardware also stores the index of the current max node. The index stored after the final update is sent out as the final output of the neural network. The hardware takes 2.8M cycles to finish one inference.
4. Fully Masking the Neural Network
This section discusses the hardware design and implementation of all components in the masked neural network. Prior work on masking of neural networks shows that arithmetic masking alone cannot mask integer addition due to a leakage in the sign-bit (dubey2019maskednet). Hence, we apply gate-level Boolean masking to perform integer addition in a secure fashion. We express the entire computation of the neural network as a sequence of AND and XOR operations and apply gate-level masking on the resulting expression. XORs, being linear, do not require any additional masking, and AND gates are replaced with secure, Trichina style AND gates (TKL05). Furthermore, we design specialized circuits for BNN’s unique components like Masked Multiplexer and Masked Output Layer.
We first explain the notations in equations and figures. Any variable without a subscript or superscript represents an N-bit number. We use the subscript to refer to a single bit of the N-bit number. For example, refers to the bit of . The superscript in masking refers to the different secret shares of a variable. To refer to a particular share of a particular bit of an N-bit number, we use both the subscript and the superscript. For example, refers to the second Boolean share of the bit of . If a variable only has the superscript (say ), we are referring to its full N-bit Boolean share; N can also be equal to 1, in which case is simply a bit. r (or r) denotes a fresh, random bit.
4.2. Why Trichina’s Masking Style?
Among the closely related masking styles (reparaz16-2), we chose to implement Trichina’s method due to its simplicity and implementation efficiency. Figure 5 (left) shows the basic structure and functionality of the Trichina’s gate, which implements a 2-bit, masked, AND operation of . Each input ( and ) is split into two shares ( and s.t. , and and s.t. ). These shares are sequentially processed with a chain of AND gates initiated with a fresh random bit (). A single AND operation thus uses 3 random bits. The technique ensures that output is the Boolean masked output of the original AND function, i.e., , while all the intermediate computations are randomized.
Unfortunately, the straightforward adoption of Trichina’s AND gate can lead to information leakage due to glitches (ICICS:NikRecRij06). For instance, in Figure 5 (left) if the products and reach the input of second XOR gate before random mask reaches the input of first XOR gate, the output at the XOR gate will evaluate (glitch) to temporarily, which leads to secret value being unmasked. Therefore, we opted for an extension of the Trichina’s AND gate by adding flip-flops to synchronise the arrival of inputs at the XOR gates (see Figure 5 right). The only XOR gate not having a flip-flop at its input is the leftmost XOR gate in the path of , which is not a problem because a glitching output at this gate does not combine two shares of the same variable. Similar techniques have been used in past (regGlitch). Masking styles like the Threshold gates (CHES:ManPraOsw05; maskedcmosleak; SAC:TirSch06) may be considered for even stronger security guarantees, but they will add further area-performance-randomness overhead.
4.3. Masked Adder
We adopt the ripple-carry style of implementation for the adder. It is formed using N 1-bit full adders where the carry-out from each adder is the carry-in for the next adder in the chain, starting from LSB. Therefore, ripple-carry configuration eases parameterization and modular design of the Boolean masked adders.
4.3.1. Design of a Masked Full Adder
A 1-bit full adder takes as input two operands and a carry-in and outputs the sum and the carry, which are a function of the two operands and the carry-in. If the input operand bits are denoted by and and carry-in bit by , then the Boolean equation of the sum and the carry can be described as follows:
Figure 6 shows the regular, 1-bit full adder (on the left), and the resulting masked adder with Trichina’s AND gates (on the right). In the rest of the subsection, we will discuss the derivation of the masked full adder equations.
First step is to split the secret variables (, and
) into Boolean shares. The hardware samples a fresh, random mask from a uniform distribution and performs XOR with the original variable. If we represent the random masks as, and , then the masked values , and can be generated as follows:
A masking scheme always works on the two shares independently without ever combining them at any point in the operation. Combining the shares at any point will reconstruct the secret and create a side-channel leak at that point.
The function of sum-generation is linear, making it easy to directly and independently compute the Boolean shares of :
Unlike the sum-generation, carry-generation is a non-linear operation due to the presence of an AND operator. Hence, the hardware cannot directly and independently compute the Boolean shares and of . We use the Trichina’s construction explained in subsection 4.2 to mask carry-generation.
The hardware uses three Trichina’s AND gates to mask the three AND operations in equation (2) using three random masks. This generates two Boolean shares from each Trichina AND operation. At this point, the expression is linear again, and therefore, the hardware can redistribute the terms, similar to the masking of sum operation.
In the following equations, we use to represent the product implemented via Trichina’s AND Gate as illustrated in the following equation:
where and are the two Boolean shares of the product. Replacing each AND operation in equation (2) with TG, we can write
which can also be written as a combination of two Boolean shares and
Therefore, we create a masked full adder that takes in the Boolean shares of the two bits to be added along with a carry-in and gives out the Boolean shares of the sum and carry-out.
4.3.2. The Modular Design of Pipelined N-bit Full Adder
The masked full adders can be chained together to create an N-bit masked adder that can add two masked N-bit numbers. Figure 7 (top) shows how to construct a 4-bit masked adder as an example. We pipeline the N-bit adder to yield a throughput of one by adding registers between the full-adders corresponding to each bit (see Figure 7 (bottom)).
4.4. Masking of Activation Function
4.5. Masked Multiplexer
A 9-bit multiplexer is internally a set of parallel nine 1-bit multiplexers. We implement the masked 1-bit multiplexer using a 4-input 2-output masked look-up table (LUT). Figure 9 shows the masked LUT that takes in the original inputs () and an additional fresh random mask () as inputs and outputs the random mask () which is simply the bypassed and the correct output XORed with the random mask. We assume that each LUT operation is atomic. Since the output functions are 4-input, 2-output, they can be mapped onto the same LUT of the target FPGA (ugSpartan6). Lesser number of inputs also obviate the need for precautions like building a carefully balanced tree of smaller input LUTs (CHES:RRVV15). Advanced masking constructions can be used to implement this function for a stronger security guarantee. As suggested in another work (CHES:RRVV15), masked look-ups can also be implemented using ROMs if the target is an ASIC, since ROMs are immutable and small in size. Thus, the (Boolean) masked output from the LUTs ensures that the secret intermediate-variable (multiplexed input pixel) always remains masked.
4.6. Masking the Output Layer
The hardware stores the 10 output layer summations in the form of Boolean shares. To determine the classification result, it needs to find the maximum value among the 10 masked output nodes. Specifically, it needs to compare two signed values expressed as Boolean shares. We transform the problem of masked comparison to masked subtraction.
Figure 10 shows the hardware design of the masked output layer. The hardware subtracts each output node value from the current maximum and swaps the current maximum (old max shares) with the node value (new max shares) if the MSB is 1 using a masked multiplexer. An MSB of 1 signifies that the difference is negative and hence the new sum is greater than the latest max. Instead of building a new masked subtractor, we reuse the existing masked adder to also function as a subtractor through a flag, which is set while computing max. In parallel, the hardware uses one more masked multiplexer-based update-circuit to update the Boolean shares of the index corresponding to the current max node (not shown in the Figure). This is to prevent known-ciphertext attacks, ciphertext being the classification result in our case. Finally, the Masked Output Logic computes the classification result in the form of (Boolean) shares of the node’s index having the maximum confidence score.
Subtraction is essentially adding a number with the 2’s complement of another number. 2’s complement is computed by taking bitwise 1’s complement and adding 1 to it. A bitwise 1’s complement is implemented as an XOR operation with 1 and the addition of 1 is implemented by setting the initial carry-in to be equal to 1. Since this only requires additional XOR gates, which is a linear operator, nothing changes with respect to the masking of the new adder-subtractor circuit.
4.7. Scheduling of Operations
We optimize the scheduling in such a way that the hardware maintains a throughput of 1 addition per cycle. The latency of the masked 20-bit adder is 100 cycles. Therefore, the result from the adder will only be available after 101 cycles (need an additional cycle for the accumulator register as well) from the time it samples the inputs. The hardware cannot feed the next input in the sequence until the previous sum is available because of the data dependency between the accumulated sum and the next accumulated sum. This incurs a stall for 101 cycles leading to a total of cycles for each node computation. That is a performance drop over the unmasked implementation with a regular adder.
We solve the problem by finding useful work for the adder that is independent of the summation in-flight, during the stalls. We observe that computing the weighted summation of one node is completely independent of the next node’s computation. The hardware utilizes this independence to improve the throughput by starting the next node computation while the result for the first node arrives. Similarly, all the nodes up till 101 can be computed upon concurrently using the same adder and achieve the exact same throughput as the baseline design. This comes at the expense of additional registers (see Figure 11222The register file also has a demultiplexing and multiplexing logic to update and consume the correct accumulated sum in sequence, which is not shown for simplicity.) for storing 101 summations333This is why we use 1010 neurons, which is a multiple of 101, in the hidden layers. plus some control logic but a throughput gain of 784 (or 1010 in hidden layers) is worthwhile. The optimization only works if the number of next-layer nodes is greater than, and a multiple of 101. This restricts optimizing the output layer (of 10 nodes) and contributes to the 3.5% increase in the latency of the masked design.
In this section, we describe the hardware setup used to implement the neural network and capture power measurements, the leakage assessment methodology that we follow to evaluate the security of the proposed design, and the hardware implementation results.
5.1. Hardware Setup
We implement the neural network in Verilog and use Xilinx ISE 14.7 for synthesis and bitstream generation. We use the DONT_TOUCH attribute in the code and disable the options like LUT combining, register reordering, etc. in the tool to prevent any type of optimization in the masked components.
Our side-channel evaluation platform is the SAKURA-G FPGA board (sakurag). It hosts Xilinx Spartan-6 (XC6SLX75-2CSG484C) as the main FPGA that executes the neural network inference. An on-board amplifier amplifies the voltage drop across a 1 shunt resistor on the power supply line. We use Picoscope 3206D (picoscope) as the oscilloscope to capture the measurements from the dedicated SMA output port of the board. The design is clocked at 24MHz and the sampling frequency of the oscilloscope is 125MHz. A higher sampling frequency leads to the challenges that we discuss in Section 6.2
. However, to ensure a sound evaluation, we perform first and second-order t-tests on a smaller unit of the design at a much higher precision: we conduct the experiment at a design frequency of 1.5MHz and sampling frequency of 500MHz, which translates to 333 sample points per clock cycle.
We use Riscure’s Inspector SCA (inspector) software to communicate with the board and initiate a capture on the oscilloscope. By default, the Inspector software does not support SAKURA-G board communication. Hence, we develop our own custom modules in the software to automate the setup. The modules implement the FPGA communication protocol and perform the register reads and writes on the FPGA to start the neural network inference and capture the classification result.
5.2. Leakage Evaluation
We perform the leakage assessment of the proposed design using the non-specific fixed vs random t-tests, which is a common and generic way of assessing the side-channel vulnerability in a given implementation (becker2013test). A t-score lying within the threshold range of 4.5 implies that the power traces do not leak any information about the data being processed, with up to 99.99% confidence. The measurement and evaluation is quite involved and we refer the reader to Section 6.2 for further details. We demonstrate the security up to 2M traces, which is much greater than the first-order security of the currently best-known defense that leaks at 40k traces (dubey2019maskednet).
Pseudo Random Number Generators (PRNG) produce the fresh, random masks required for masking. We choose TRIVIUM (prng) as the PRNG, which is a hardware implementation friendly PRNG specified as an International Standard under ISO/IEC 29192-3, but any cryptographically-secure PRNG can be employed. TRIVIUM generates 2 bits of output from an 80-bit key; hence, the PRNG has to be re-seeded before the output space is exhausted.
5.2.1. First-order tests
We first perform the first-order t-test on the design with PRNGs disabled, which is equivalent to an unmasked (baseline/unprotected) design. Figure 12 (left) shows the result for this experiment where we clearly observe significant leakages since the t-scores are greater than the threshold of 4.5 for the entire execution. Then, we perform the same test, but with PRNGs switched on this time, which is equivalent to a masked design. Figure 12 (right) shows the results for this case, where we observe that the t-scores never cross the threshold of 4.5 except the initial phase.
The initial phase leakages are due to the input correlations during input layer computations. The hardware loads the input pixel after every 101 cycles and feeds it to the masked multiplexer. The secret variable is the weight, which is never exposed because the masked multiplexer randomises the output using a fresh, random mask.
5.2.2. High Precision First and Second-order tests
We performed univariate second-order t-test on the fully masked design (schneider2016leakage), but 1M traces were not sufficient to reveal the leakages. Due to the extremely lengthy measurement and evaluation times it was infeasible to continue the test for more number of traces. Therefore, we perform first and second-order evaluation on the isolated synchronized Trichina’s AND gate, which is one of the main building blocks of the full design. We reduce the design frequency to 1.5MHz to increase the accuracy of the measurement and prevent any aliasing between clock cycles. The SNR for a single gate was not sufficient to see leakage even at 10M traces, hence we amplify the SNR by instantiating 32 independent instances of the Trichina’s AND gate in the design, driven by the same inputs. We present the results for this experiment in Figure 13 that shows no leakage in the first-order t-test but significant leakages in the second-order t-tests for 500k traces. Thus, by ensuring success in the second-order t-tests we validate the correctness of our measurement setup and the first-order masking implementation.
|Area||1833/1125/163||9833/7624/163||5.3 / 6.8 / 1|
”-” denotes no change in the area of the unmasked and masked design.
5.3. Masking Overheads
Table 1 shows that the impact of masking on the performance is 1.04, and on the number of LUTs and FFs is 5.3 and 6.7 respectively. We also summarize the area contribution from each design component in Table 2. The fourth column indicates what fraction of the total increase in area (i.e., 8000 LUTs and 6499 FFs) does each component contribute. Most of the area increase is due to the throughput optimization logic—the register file accumulator logic described in subsection 4.7. The masked adder contributes 12% and 16% to the overall increase in the LUTs and FFs respectively. The increase due to the output layer logic is minimal. ROMs refer to the read-only memories storing the weights and bias values where the increase is minimal444The slight increase in the number of LUTs is because one of the memories is implemented using LUTs that might redistribute even for the same memory size.. RWMs refer to the read-write memories storing the layer activations, which also do not show any increase as the masked version stores two bits (the Boolean shares) instead of one for the activations accommodated in the same BRAM tile.
We compare the area-delay product (ADP) of our proposed design, BoMaNet, to MaskedNet (dubey2019maskednet), where area is defined as the sum of the number of LUTs and FFs, and delay is defined as the latency in number of cycles. The ADP of our design is whereas the ADP of MaskedNET is , which is approximately 100 lower. This is expected since MaskedNet was designed for cost-effectiveness using hiding and partial masking, but on BoMaNet every operation is masked at the gate-level to improve side-channel security. Similar overheads were observed in previous works on Boolean masking of AES (maskOverhead).
6.1. Proof-of-Concept vs. Optimizations
The solution we propose utilizes simple yet effective techniques to mask an inference engine. But certainly, there is scope for improvement both in terms of the hardware design and the security countermeasures. In this section, we discuss some possible optimizations/extensions of our work and alternate approaches taken in the field of privacy for ML.
6.1.1. Design Optimizations
The ripple-carry adder used in this work can be replaced with advanced adder architectures like carry-lookahead (cla), carry-skip (csa), or Kogge-Stone (koggestone). These architectures commonly possess an additional logic block that pre-computes the generate and propagate bits. Therefore, additional randomness will be needed to mask the non-linear generate expression. All these adders have more combinational logic than the ripple-carry adder, which may make it harder to avoid glitches. To that end, prior work on TI-based secure versions of ripple-carry and Kogge-Stone adders can be extended (boma-adder15). Another potential optimization is the use of other masking styles like DoM (gross2016domain) or manual techniques (manualGlitch) to reduce the area and randomness overheads.
We reduce glitch-related vulnerabilities using registers at each stage, which is a low-cost, practical solution. Other works have proposed stronger countermeasures, at the cost of higher performance and area overheads (ICICS:NikRecRij06; gross2016domain). The quest for stronger and more efficient countermeasures is never-ending; masking of AES is still being explored, even 20 years after the initial work (CHES:AkkGir01), due to the advent of more efficient or secure masking schemes (d+1shares) and more potent attacks (moos2017static; TKL05).
Our solution is first-order secure but there is scope for construction of higher-order masked versions. However, higher-order security is a delicate task; Moos et al. recently showed that a straightforward extension of masking schemes to higher-order suffers from both local and compositional flaws (TCHES:MMSS19) and masking extensions were proposed in another recent work (cassiershardware).
This is the first work on fully-masked neural networks and we foresee follow ups as we have experienced in the cryptographic research of AES masking, even after 20 years of intensive study.
6.2. Measurement Challenges
We faced some unique challenges that are not generally seen with the symmetric-key cryptographic evaluations. Inference becomes a lengthy operation, especially for an area-optimized design—the inference latency of our design is roughly 3 million cycles. For a design frequency of 24MHz, the execution time translates to 122ms per inference. If the oscilloscope samples at 125MHz (sample interval of 8ns) the number of sample points to be captured per power trace is equal to 15 million. This significantly slows down the capturing of power traces. In our case, capturing 2 million power traces took one week, which means capturing 100 million traces as AES evaluation (d+1shares) will take roughly a year. Performing TVLA on such large traces ( 28TB, in our case) also takes a significant amount of time: it took 3 days to get one t-score plot during our evaluations on a high-end PC555Intel Core i9-9900K, 64GB RAM.. One possibility to avoid this problem is looking at a small subset of representative traces of the computation (DLRSA), but, we instead conduct a comprehensive evaluation of our design.
6.3. Theoretical vs Side-Channel Attacks
Theoretical model extraction by training a proxy model on a synthetically generated dataset using the predictions from the unknown victim model is an active area of research (jagielski2019high; carlini2020cryptanalytic). These attacks mostly assume a black-box access to the model and successfully extract the model parameters after a certain number of queries. This number ranges typically in the order of (carlini2020cryptanalytic). By contrast, physical side-channel attacks only require a few thousand queries to successfully steal all the parameters (dubey2019maskednet). This is partly due to fact that physical side-channel attacks can extract information about intermediate computations even in a black-box setting. Physical side-channel attacks also do not require the generation of the synthetic dataset, unlike most theoretical attacks.
6.4. Orthogonal ML Defences
There has been some work on defending the ML models against stealing of inputs and parameters using other techniques like Homomorphic Encryption (HE) and Secure Multi-Party Computation (SMPC) (gazelle18; xonn19; delphi20), Watermarking (rouhani2018deepsigns; adi2018turning), and Trusted Execution Engines (TEE) (preventingnn18; tramer2018slalom; mlcapsule18). The survey by Isakov et al. and the draft by NIST is a good reference for a more exhaustive list (mlNIST; isakov2019survey). The computational needs of HE might not be suitable for edge computing. The current SMPC defenses predominantly target a cloud-based ML framework, not edge. We propose masking, which is an extension of SMPC on hardware and we believe that it is a promising direction for ML side-channel defenses as it has been on cryptographic applications. Watermarking techniques are punitive methods that cannot prevent physical side-channel attacks. TEEs are subject to ever-evolving microarchitectural attacks and typically are not available in edge/IoT nodes.
7. Conclusions and Future Outlook
Physical side-channel analysis of neural networks is a new, promising direction in hardware security where the attacks are rapidly evolving compared to defenses. This work proposed the first fully-masked neural network, demonstrated the security with up to 2M traces, and quantified the overheads of a potential countermeasure. We have addressed the key challenge of masking arithmetic shares of integer addition (dubey2019maskednet) through Boolean masking. We furthermore presented ideas on how to mask the unique linear and non-linear computations of a fully-connected neural network that do not exist in cryptographic applications.
The large variety in neural network architectures in terms of the level of quantization, the types of layer operations (e.g., Convolution, Maxpool, Softmax), and the types of activation functions (e.g., ReLU, Sigmoid, Tanh) presents a large design space for neural network side-channel defenses. This paper focused on BNNs as they are a good starting point. The ideas presented in this work serve as a benchmark to analyze the vulnerabilities that exist in neural network computations and to construct more robust and efficient countermeasures.