CSI Neural Network: Using Side-channels to Recover Your Artificial Neural Network Information

10/22/2018 ∙ by Lejla Batina, et al. ∙ 0

Machine learning has become mainstream across industries. Numerous examples proved the validity of it for security applications. In this work, we investigate how to reverse engineer a neural network by using only power side-channel information. To this end, we consider a multilayer perceptron as the machine learning architecture of choice and assume a non-invasive and eavesdropping attacker capable of measuring only passive side-channel leakages like power consumption, electromagnetic radiation, and reaction time. We conduct all experiments on real data and common neural net architectures in order to properly assess the applicability and extendability of those attacks. Practical results are shown on an ARM CORTEX-M3 microcontroller. Our experiments show that the side-channel attacker is capable of obtaining the following information: the activation functions used in the architecture, the number of layers and neurons in the layers, the number of output classes, and weights in the neural network. Thus, the attacker can effectively reverse engineer the network using side-channel information. Next, we show that once the attacker has the knowledge about the neural network architecture, he/she could also recover the inputs to the network with only a single-shot measurement. Finally, we discuss several mitigations one could use to thwart such attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 8

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning, and more recently deep learning, has become hard to ignore for research in distinct areas, such as image recognition 

[1], robotics [2]

, natural language processing 

[3], and also security [4, 5] mainly due to its unquestionable practicality and effectiveness. Ever increasing computational capabilities of the computers of today and huge amounts of data available are resulting in much more complex machine learning architectures than it was envisioned before. As an example, AlexNet architecture consisting of 8 layers was the best performing algorithm in image classification ILSVRC2012 (http://www.image-net.org/challenges/LSVRC/2012/) classification task. In 2015, the best performing architecture for the same task was ResNet consisting of 152 layers [6]. This trend is not expected to stagnate any time soon, so it is prime time to consider machine learning from a novel perspective and in new use cases.

In this work, we focus on the widely used machine learning family of algorithms: the neural networks family. With the increasing number of design strategies and elements to use, fine tuning of hyperparameters of these algorithms is emerging as one of the main challenges. When considering distinct industries, we are witnessing an increase in intellectual property (IP) models strategies. Basically, for those cases when optimized networks are of commercial interest their details are kept undisclosed. For example, EMVCo (formed by MasterCard and Visa to manage specifications for payment systems and to facilitate worldwide interoperability) nowadays requires deep learning techniques for security evaluations 

[7]. This has an obvious consequence in: on one hand security labs generating (and using) neural networks for evaluation of products and on the other hand they treat them as IP, exclusively for their customers.

There are also other reasons for keeping the neural network architectures secret. Often, these pre-trained models might provide additional information regarding the training data, which can be very sensitive. For example, if the model is trained based on a medical record of a patient [8], confidential information could be encoded into the network during the training phase. Also, machine learning models that are used for guiding medical treatments are often based on a patient’s genotype making this extremely sensitive from the privacy perspective [9]. Even if we disregard privacy issues, obtaining useful information from neural network architectures can lead to acquiring trade secrets from competition, which can lead to competitive products without violating intellectual property rights [10]. Hence, determining the layout of the network with trained weights is a desirable target for the attacker. One could ask the following question: why would an attacker want to reverse engineer the neural network architecture instead of just training the same network on its own? There are several reasons that are complicating this approach. First, the attacker might not have access to the same training set in order to train his own neural network. Second, as the architectures have become more complex, there are more parameters to tune and it could be extremely difficult for the attacker to pinpoint the same values for the parameters as in the architecture of interest.

Our main question relates to the feasibility of reverse engineering such architectures. Although binary analysis can already give useful information about the network, in practical cases, binary readback could be disabled by e.g., blocking JTAG access [11]. However, exploiting side-channel leakages remains a viable option. Side-channel analyses have been widely studied in the community of information security and cryptanalysis, due to its potentially devastating impact on otherwise (theoretically) secure algorithms. Concretely, it has been observed that different physical leakages from devices on which cryptography is implemented, such as timing delay, power consumption, and electromagnetic emanation (EM) during the computation of the data is dependent on the processed internal state and thus data. By statistically combining this physical leakage of the specific internal state and hypothesis on the data being manipulated, it is possible to recover the intermediate state being processed by the device.

In this study, our aim is to highlight the potential vulnerabilities of standard (perhaps still naive from the security perspective) implementations of neural networks. At the same time, we are unaware of any neural network implementation in the public domain that includes side-channel protection. For this reason, we do not just point to the problem but also suggest some means of protection of neural networks against side-channel attacks. Here, we consider some of the basic building blocks of neural networks: the number of hidden layers, the basic multiplication operation, and the activation functions. Assuming that the multiplications are performed on one known and one unknown operand and by observing e.g., power consumption as the leakage, additional information about the output of the multiplication becomes available. In this case, different hypotheses of the possible values can be correlated with the leakage to recover the unknown input up to a certain precision. We show that for our target implementation, the value of an unknown input to the multiplication could be estimated with up to 0.01 precision.

The complex structure of activation function often leads to conditional branching due to the necessary exponentiation and division operations. Thus, conditional branching introduces input dependent timing differences resulting in different timing behavior for different activation function, allowing function identification. Basically, simply by observing the side-channel signatures, it is possible to deduce number of nodes, and also the number of layers in the networks. By using the usual divide-and-conquer approach for side-channel analysis, the information at each layer could be recovered, and the recovered information can be used as input for recovering the subsequent layers. Consequently, in this work, we show it is possible to recover the layout of unknown networks by exploiting the side-channel information.

To our best knowledge, this kind of observation has never been used before in this context. At least not for leveraging on (power/EM) side-channel leakages with reverse engineering the neural networks architecture as the main goal. We position our results in the following sections of this work.

The motivation for our work comes from ever more pervasive use of neural networks in security-critical applications and the fact that the architectures are becoming proprietary knowledge for the security evaluation industry. Thus, reverse engineering a neural net has become a new target for the adversaries and we need a better understanding of the vulnerability to side-channel leakages in those cases to be able to protect the users’ rights and data.

I-a Related Work

There are many papers considering machine learning and more recently, deep learning for improving the effectiveness of side-channel attacks. For instance, a number of works have compared the effectiveness of classical profiled side-channel attacks against various machine learning techniques [12, 13, 14]. Lately, several works explored the power of deep learning in the context of side-channel analysis [15]

. However, that line of work is putting a classifier from machine learning in the context of side-channel distinguishers i.e. the selection function leading typically to e.g., the key recovery.

On the other hand, using side-channel analysis in order to attack machine learning architectures has been much less investigated. Shokri et al. investigate the leakage of sensitive information from machine learning models about individual data records on which they were trained [16]. They show that such models are vulnerable to membership inference attacks and they also evaluate some mitigation strategies. Song et al. show how a machine learning model from a malicious machine learning provider can be used to obtain information about the training set of a model [17]

. Hua et al. were first to reverse engineer two convolutional neural networks, namely AlexNet and SqueezeNet through memory and timing side-channel leaks 

[18]. The authors measure side-channel through an artificially introduced hardware Trojan. They also need access to original training data for part of the attack, which might not always be available. Lastly, in order to obtain the weight of the neural networks, they attack very specific operation i.e., zero pruning [19]

, which to an extent is more common for ReLU. Wei et al. have also performed an attack on an FPGA-based convolutional neural network accelerator 

[20]. They recovered the input image from the collected power consumption traces. The proposed attack exploits a specific design choice i.e., the line buffer in a convolution layer of a CNN. In a nutshell, both previous reverse engineering efforts using side-channel information were performed on very special design choices for neural networks and having specific goals for the attacks.

Ohrimenko et al. used a secure implementation of MapReduce jobs and analyzed intermediate traffic between reducers and mappers [21]. They showed how an adversary observing the runs of typical jobs can infer precise information about the inputs. Xu et al. introduced controlled-channel attacks, which is a type of side-channel attack allowing an untrusted operating system to extract large amounts of sensitive information from protected applications [22]. Ohrimenko et al. discussed how machine learning algorithms data-oblivious algorithms can be exploited by various side-channels [23]. Consequently, they propose data-oblivious machine learning algorithms that prevent exploitation of side channels induced by memory, disk, and network accesses. Still, they note that side-channel attacks based on power and timing analysis are outside of the scope of their research.

Orthogonally to those works, we explore the problem of reverse engineering of neural networks from a more generic perspective and in a grey to black-box setting. To be specific, the closest previous works to ours have reverse engineered neural networks by using cache attacks which work on distinct CPUs and are basically micro-architectural attacks (although using timing side-channel). Our approach utilizes power side-channel on small embedded devices and it is supported by practical results obtained on a real-world architecture.

I-B Contribution and Organization

The main contributions of this paper are:

  1. We describe a full reverse engineering of neural network parameters based on side-channel analysis. A combination of side-channel leakages is used to recover key parameters i.e., activation function, pre-trained weights, number of hidden layers and neurons in each layer. The proposed attack does not need any information on the (sensitive) training data as that information is often not even available to the attacker. We emphasize that, for our attack to work, we require only the knowledge of some inputs/outputs and side-channel measurements, which is a standard assumption for side-channel attacks.

  2. All the proposed attacks are practically implemented and demonstrated on two distinct microcontrollers (i.e. 8-bit AVR and 32-bit ARM), allowing full reverse engineering of the network architecture.

  3. Further, a single trace input recovery attack has been proposed, which recovers a dataset when applied on the initial layers. This implies that the attacker can recover all the inputs tested with a known neural network, recovering each input from a single measurement. Such attacks can put user’s sensitive data at great risk.

  4. We highlight some interesting aspects of side-channel attacks when dealing with real numbers, unlike in everyday cryptography. For example, we show that even a side-channel attack that failed can provide sensitive information about the target due to precision error.

  5. Finally, we propose a number of mitigation techniques that will render side-channel attacks more difficult.

We emphasize again that the simplicity of our attack is its strongest point, as it minimizes the assumption on an adversary. This makes the underlying problem even more serious as the attack does not require any pre-processing, chosen-plaintext messages, etc.

The rest of this paper is organized as follows. In Section II, we give details about specific machine learning algorithms we consider and side-channel analysis techniques we use. Section III gives results on reverse engineering of various elements of neural networks and Section IV on input recovery attack. Section V demonstrates the feasibility of attack on modern 32-bit ARM microcontrollers. In Section VI, we briefly discuss possible countermeasures one could apply to make our attacks more difficult. Finally, in Section VII, we conclude the paper and discuss potential future research directions.

Ii Background

In this section, we give details about artificial neural networks and their building blocks. Next, we discuss the concepts of side-channel analysis and several types of attacks we use in this paper.

Ii-a Artificial Neural Networks

Artificial neural networks (ANNs) is an umbrella notion for all computer systems loosely inspired by biological neural networks. Such systems are able to “learn” from examples, which makes them a strong (and very popular) paradigm in the machine learning domain. Any ANN is built from a number of nodes called artificial neurons. The nodes are connected in order to transmit a signal. Usually, in an ANN, the signal at the connection between artificial neurons is a real number and the output of each neuron is calculated as a nonlinear function of the sum of its inputs. Neurons and connections have weights that are adjusted as the learning progresses. Those weights are used to increase or decrease the strength of a signal at a connection. In the rest of this paper, we use the notions of an artificial neural network, neural network, and network interchangeably.

A very simple type of a neural network is called perceptron. A perceptron is a linear binary classifier applied to the feature vector. Each vector component has an associated weight

and each perceptron has a threshold value

. The output of a perceptron equals “1” if the direct sum between the feature vector and the weight vector is larger than zero and “-1” otherwise. A perceptron classifier works only for data that are linearly separable, i.e., if there is some hyperplane that separates all the positive points from all the negative points 

[24]. We depict a model of an artificial neuron in Figure 1. In the case of the perceptron, the activation function is the step function.

Fig. 1: Depiction of an artificial neuron.

By adding more layers to perceptron, we arrive to the multilayer perceptron algorithm. Multilayer perceptron (MLP) is a feed-forward neural network that maps sets of inputs onto sets of appropriate outputs. It consists of multiple layers of nodes in a directed graph, where each layer is fully connected to the next one. Consequently, each node in one layer connects with a certain weight

to every node in the following layer. Multilayer perceptron algorithm consists of at least three layers: one input layer, one output layer, and one hidden layer. Those layers must consist of nonlinearly activating nodes [25].

We depict a model of a multilayer perceptron in Figure 2

. Note, if there is more than one hidden layer, then it can be considered a deep learning architecture. At the same time, if the activation function for a neuron is the step function, it is easy to show that any number of layers can be reduced to two layers (one input and one output layer). Differing from linear perceptron, MLP can distinguish data that are not linearly separable. To train the network, the backpropagation algorithm is used, which is a generalization of the least mean squares algorithm in the linear perceptron. Backpropagation is used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function 

[24].

Fig. 2: Multilayer perceptron.

An activation function of a node is a function defining the output of a node given an input or set of inputs, see Eq. (1). In order for ANN to be able to calculate nontrivial functions using a small number of nodes, we need to use nonlinear activation functions.

(1)

In this paper, we consider the logistic (sigmoid) function, tanh function, softmax function, and Rectified Linear Unit function. The logistic function is a nonlinear function giving smooth and continuously differentiable results 

[26]. The range of a sigmoid function is , which means that all the values going to the next neuron will have the same sign.

(2)

The tanh function is a scaled version of logistic function where the main difference is that it is symmetric over the origin. The tanh function ranges in .

(3)

The softmax function is a type of sigmoid function able to map values into multiple outputs (e.g., classes). The softmax function is ideally used in the output layer of the classifier in order to obtain the probabilities defining a class for each input 

[27].

(4)

The Rectified Linear Unit (ReLU) is a nonlinear function that is differing from the previous two activation functions as it does not activate all the neurons at the same time [28]. By activating only a subset of neurons at any time, we make the network sparse and easier to compute. Consequently, such properties make ReLU probably the most widely used activation function in ANNs today.

(5)

Ii-B Side-channel Analysis

Side-channel Analysis (SCA) exploits weaknesses on the implementation level [29]. More specifically, all computations running on a certain platform result in unintentional physical leakages. Those leakages are a sort of physical signatures from the reaction time, power consumption, and EM emanations released while the device was manipulating data. SCA exploits those physical signatures aiming at the key (secret data) recovery. In its basic form, SCA was proposed to perform key recovery attacks on implementation of cryptography [30, 31]. One advantage of SCA over traditional cryptanalysis is that SCA can apply a divide-and-conquer approach. Thus, instead of testing and recovering the full key at once, SCA can be used to recover small parts of the key independently, exponentially reducing the attack complexity.

However, the scope of SCA is much wider. For example, SCA was recently used to demonstrate IP theft from 3D printers [32]. Based on the analysis technique different variants of SCA are known. In the following, we recall a few analysis techniques used later in the paper. Although the following terms suggest power analysis, these techniques apply to other side-channels as well.

Ii-B1 Simple Power Analysis (SPA)

Simple power analysis, as the name suggests, is the most basic form of SCA [31]. It targets information from a sensitive computation which can be recovered from a single or a few traces.

As a common example, SPA can be used against a straightforward implementation of the RSA algorithm. Namely, the RSA exponentiation is composed of a sequence of square and multiply operations which depend on secret key bit (multiply follows square only when the secret bit is 1, else only square is executed). As square and multiply have distinct physical signatures the adversary can directly read out the key bits from e.g., a power trace on a digital oscilloscope. Similar attacks have been applied to secret-key algorithm like AES [33] but then targeting key schedule. In this work, we apply SPA to reverse engineer the architecture of the neural network.

Ii-B2 Differential Power Analysis (DPA)

DPA is an advanced form of SCA, which applies statistical techniques to recover secret from physical signatures when SPA is not possible. The attack normally tests for dependencies between actual physical signature (or measurements) and hypothetical physical signature i.e., predictions on intermediate data. The hypothetical signature is based on a leakage model and key hypothesis. With the divide-and-conquer approach, parts of the secret key (e.g., one byte) can be tested independently, allowing exhaustive search on key hypothesis. The knowledge of the leakage model comes from the adversary’s intuition and expertise. Some commonly used leakage models for representative devices are the Hamming weight for microcontrollers and the Hamming distance in FPGA, ASIC, and GPU [34, 35] platforms.

As the measurements can be noisy, the adversary often needs many measurements, sometimes millions. Next, statistical tests like correlation [36] are applied to identify correct key hypothesis from other wrong hypotheses. As we show later in the paper, DPA is used to recover secret weights from a pre-trained network.

Ii-B3 Horizontal Power Analysis (HPA)

HPA is another sort of side-channel attack using power as the source of leakage [37]

. While DPA recovers the secret key statistically over multiple measurements, HPA is a single trace attack exploiting several elementary operations in a single computation. The idea behind it is that identical data being manipulated even in different computation steps will have the same power signatures and can be recovered by e.g., pattern recognition techniques. HPA can be used against protected implementation, for example with exponent blinding, where an adversary is limited to only one measurement. In this paper, we use HPA to perform input recovery attack for a known network where we prove the technique to be effective for medium to large sized networks.

Iii Side-Channel Based Reverse Engineering of Neural Networks

As already discussed, side-channel leakages have been frequently used for cryptanalysis, in particular for key recovery attacks in cryptography and for the reverse engineering of cryptographic algorithms. In this work, we demonstrate the first application of SCA for reverse engineering of neural networks, with practical measurements on embedded platforms.

Iii-a Threat Model

The two main goals of this paper are to recover the neural network architecture and its inputs using only side-channel information.
Scenario. We select to work with MLP since 1) it is a commonly used machine learning algorithm in modern applications, see e.g., [38, 39, 40, 41]

; 2) it consists of fully connected layers which are also occurring in other architectures like convolutional neural networks or recurrent neural networks; and 3) the layers are all identical, which makes it more difficult for SCA and could be consequently considered as the worst-case scenario. We choose our attack to be as generic as possible while discarding common assumptions, which would make the attack easier but also more limited in scope. For instance, we have no assumption on the type of inputs or its source, as we work with real numbers. If the inputs are in form of integers (like the MNIST database), the attack becomes easier, since we would not need to recover mantissa bytes and deal with precision. We also assume that the implementation of the machine learning algorithm does not include any side-channel countermeasures. Currently, to the best of our knowledge, no public implementation of ANN deploys side-channel countermeasures.

Attacker’s capability. We consider a passive attacker who is only capable of acquiring measurements of the device while operating “normally” and not interfering with its operations. We consider two settings:

  1. Attacker does not know the architecture of the used network but can feed random (or known) inputs to the architecture

    An adequate use case would be when the attacker legally acquires a copy of the network in a black box setting and aims at recovering its internal details, for IP theft. The attacker can query the device with random/chosen inputs and perform side-channel measurements while processing the data. The goal for this setting is to reverse engineer the following information about neural network architecture: number of layers, number of outputs, activation functions, weights in the network.

  2. Attacker knows the architecture but does not know the inputs to it

    A suitable use case is where a secret dataset is tested with a public MLP network. The input can correspond to sensitive data such as medical records of patients. The goal for this setting is to obtain the inputs (the data to be classified) to the network and we achieve this with a single measurement only.

Iii-B Experimental Setup

Here we describe the attack methodology, which is first validated on Atmel ATmega328P. Later, we also demonstrate the proposed methodology on ARM Cortex-M3. The side-channel measurements are collected during the execution of the classification and they are captured using the Lecroy WaveRunner 610zi oscilloscope. The oscilloscope measurements are synchronized with the operations by common hand shaking signals like start and stop of computation. To further improve the quality of measurements, we opened the chip package mechanically (see Figure (a)a). An RF-U 5-2 near-field electromagnetic (EM) probe from Langer is used to collect the measurements (see Figure (b)b). Note that EM measurements also allow to observe the timing of all the operations and thus the setup allows for timing side-channels based analysis as well. The setup is depicted in Figure (c)c.

Our choice of the target platform is motivated by:

  • Atmel ATmega328P: This processor allows for high quality measurements. We are able to achieve a high signal-to-noise ratio (SNR) measurements, allowing us to focus on developing the methodology.

  • ARM Cortex-M3: A modern 32-bit microcontroller architecture with multiple stages of pipeline, on chip co-processors, low SNR measurements, and wide application. We show that the developed methodology is indeed versatile across targets with a relevant update of measurement capability.

For different platforms, the leakage model could change, but this would not limit our approach and methodology. In fact, those leakage models are well known for other common platforms like FPGA [34] and GPU [35]. Moreover, as for ARM Cortex-M3, low SNR of the measurement might force the adversary to increase the number of measurements and apply signal pre-processing techniques, but the principles of the analysis remain valid.

(a) Target 8-bit microcontroller mechanically decapsulated
(b) Langer RF-U 5-2 Near-field Electromagnetic passive Probe
(c) The complete measurement setup
Fig. 6: Experimental Setup

As already stated above, the exploited leakage model of the target device is the Hamming weight (HW) model. A microcontroller loads sensitive data to a data bus to perform indicated instructions. This data bus is pre-charged to all ’0’s’ before every instruction. Note that data bus being pre-charged is a natural behavior of the microcontroller and not a vulnerability introduced by the attacker. Thus, the new power consumption (or EM radiation) is modeled as the number of bits equal to ’1’ in the loaded data. In other words, the power consumption of loading data is:

(6)

where represents the bit of . In our case, it is the secret pre-trained weight which is regularly loaded from memory for processing and results in the HW leakage. To conduct the side-channel analysis, we perform the divide-and-conquer approach, where we target each operation separately. The full recovery process is described in Section III-F.

Several pre-trained networks are implemented on the board. The training phase is conducted offline, and the trained network is then implemented in C language and compiled on the microcontroller. In our experiments, we consider multilayer perceptron architectures consisting of a different number of layers and nodes in those layers. Note that, with our approach, there is no limit in the number of layers or nodes we can attack, as the attack scales linearly with the size of the network. The methodology is developed to demonstrate that the key parameters of the network, namely the weights and activation functions can be reverse engineered. Further experiments are conducted on deep neural networks with three hidden layers. We emphasize that the method we use can be applied to larger networks as well.

Iii-C Reverse Engineering the Activation Function

We remind the reader that nonlinear activation functions are necessary in order to represent nonlinear functions with a small number of nodes in a network. As such, they are elements used in virtually any neural network architecture today [1, 6]. If the attacker is able to deduce the information on the type of used activation functions, he/she can use that knowledge together with information about input values to deduce the behavior of the whole network.

Fig. 7: Observing pattern and timing of multiplication and activation function

We analyze the side-channel leakage from different activation functions. We consider the most commonly used activation functions, namely ReLU, sigmoid, tanh, and softmax [26, 28]. The timing behavior can be observed directly on the EM trace. For instance, as shown later in Figure (a)a, a multiplication is followed by activation with individual signatures. For a similar architecture, we test different variants with each activation function. We collect EM traces and measure the timing of the activation function computation from the measurements. The measurements are taken when the network is processing random inputs in the range, i.e., . A total of EM measurements are captured for each activation function. As shown in Figure 7, the timing behavior of the four tested activation functions have distinct signatures allowing easy characterization.

(a) ReLU
(b) Sigmoid
(c) Tanh
(d) Softmax
Fig. 12: Timing behavior for different activation functions

Different inputs result in different processing times. Moreover, the timing behavior for the same inputs largely varies depending on the activation function. For example, we can observe that ReLU will require the shortest amount of time, due to its simplicity (see Figure (a)a). On the other hand, tanh and sigmoid might have similar timing delays, but with different pattern considering the input (see Figure (b)b and Figure (b)b), where tanh is more symmetrical in pattern compared to sigmoid, for both positive and negative inputs. We can observe that softmax function will require most of the processing time, since it requires the exponentiation operation which also depends on the number of neurons in the output layer. As neural network algorithms are often optimized for performance, the presence of such timing side-channels is often ignored. A function such as tanh or sigmoid requires computation of and division and it is known that such functions are difficult to implement in constant time. In addition, constant time implementations might lead to a substantial performance degradation. Other activation functions can be characterized similarly. Finally, Table I presents the minimum, maximum, and mean computation time for each activation function over captured measurements. While ReLU is fastest, the timing difference of each function stands out individually, thus allowing a straightforward recovery.

Activation Function Minimum Maximum Mean
ReLU 5 879 6 069 5 975
Sigmoid 152 155 222 102 189 144
Tanh 51 909 210 663 184 864
Softmax 724 366 877 194 813 712
TABLE I: Minimum, Maximum, and Mean computation time (in ) for different activation functions

Iii-D Reverse Engineering of the Multiplication Operation

(a) First byte mantissa for weight = 2.43
(b) Second byte mantissa for weight = 2.43
(c) Third byte mantissa for weight = 2.43
Fig. 16: Correlation of different weights candidate on multiplication operation
(a) weight = 1.635
(b) weight = 0.890
Fig. 19: Correlation comparison between correct and incorrect mantissa of the weights

A well-trained network can be of a significant value. What distinguishes a good versus poorly trained network for a given architecture are the weights. With fine-tuned weights, we can improve the accuracy of the network, which has both commercial and academic interest. In the following, we demonstrate a way to recover those weights by using SCA.

For the recovery of the weights, we use the Correlation Power Analysis (CPA) i.e., a variant of DPA using the Pearson’s correlation as a statistical test. CPA targets the multiplication of a known input with a secret weight . Using the HW model, the adversary correlates the activity of the predicted output for all hypothesis of the weight. Thus, the attack computes , for all hypothesis of the weight , where is the Pearson correlation coefficient and is the side-channel measurement. The correct value of the weight will result in a higher correlation standing in this way out from all other wrong hypotheses , given enough measurements. Although the attack concept remains the same as in the case of an attack on cryptographic ciphers, the actual attack used here is quite different. While cryptographic operations are always performed on fixed length integers, in ANN we are dealing with real numbers.

We start by analyzing the way the compiler is handling floating-point operations for our target. The generated assembly is shown in Table II, which confirms the usage of IEEE 754 compatible representation as stated above. The knowledge of the representation allows one to better estimate the leakage behavior. Since the target device is an 8-bit microcontroller, the representation follows 32-bit pattern , which is stored in 4 registers. The 32-bit consist of: 1 sign bit , 8 biased exponent bits and 23 mantissa (fractional) bits . It can be formulated as:

For example, the value can be expressed as . The measurement is considered when the computed result is stored back to the memory, leaking in the HW model i.e., . Since 32-bit is split into individual 8-bits, each byte of is recovered individually. Hence, by recovering this representation, it is enough to recover the estimation of the real number value.

# Instruction Comment
11a ldd r22, Y+1 0x01
11c ldd r23, Y+2 0x02
11e ldd r24, Y+3 0x03
120 ldd r25, Y+4 0x04
122 ldi r18, 0x3D 61
124 ldi r19, 0x0A 10
126 ldi r20, 0x17 23
128 ldi r21, 0x40 64
12a call 0xa0a multiplication
12e std Y+1, r22 0x01
130 std Y+2, r23 0x02
132 std Y+3, r24 0x03
134 std Y+4, r25 0x04
TABLE II: Code snippet of the returned assembly for multiplication: or 0x3D0A1740 in IEEE 754 representation). The multiplication itself is not shown here, but from the registers assignment, our leakage model assumption holds.

To implement the attack two different approaches can be considered. The first approach is to build the hypothesis on the weight directly. For this experiment, we target the result of the multiplication of known input values and unknown weight . For every input, we assume different possibilities for weight values. We then perform the multiplication and estimate the IEEE 754 binary representation of the output. To deal with the growing number of possible candidates for the unknown weight , we assume that the weight will be bounded in a range , where is a parameter chosen by the adversary, and the size of possible candidates is denoted as , where is the precision when dealing with floating-point numbers.

Then, we perform the recovery of the 23-bit mantissa of the weight. The sign and exponent could be recovered separately. Thus, we are observing the leakage of 3 registers, and based on the best CPA results for each register, we can reconstruct the mantissa. Note that the recovered mantissa does not directly relate to the weight, but with the recovery of the sign and exponent, we could obtain the unique weight value. The traces are measured when the microcontroller performs secret weight multiplication with uniformly random values between -1 and 1 () to emulate normalized input values. We set and to reduce the number of possible candidates, we assume that each floating-point value will have a precision of 2 decimal points, . Since we are dealing with mantissa only, we can then only check the weight candidates in the range , thus reducing the number of possible candidates.

In Figure 16, we show the result of the correlation for each byte with the measured traces. The horizontal axis shows time of execution and vertical axis correlation. The experiments were conducted on 1 000 traces for each case. In the figure, the black plot denotes the correlation of the “correct” mantissa weight (), whereas the red plots are from all other weight candidates in the range described earlier. Since we are only attacking mantissa in this phase, several weight candidates might have similar correlation peaks. After the recovery of the mantissa, the sign bit and exponent can be recovered similarly, which narrows down the list candidate to a unique weight. Another observation is that the correlation value is not very high and scattered across different clock cycles. This is due to the reason that the measurements are noisy and since the operation is not constant-time, the interesting time samples are distributed across multiple clock cycles. Nevertheless, it is shown that the side-channel leakage can be exploited to recover the weight up to certain precision. Multivariate side channel analysis [42] can be considered if distributed samples hinder recovery.

We emphasize that attacking real numbers as in the case of weights of ANN can be simpler than attacking cryptographic implementations. This is because cryptography works on fixed length integers and exact values must be recovered. When attacking real numbers, small precision errors due to rounding off the intermediate values still result in useful information.

To deal with more precise values, we can target the mantissa multiplication operation directly. In this case, the search space can either be to cover all possible values for the mantissa (hence, more computational resources will be required) or we can focus only on the most significant bits of the mantissa (lesser candidates but also with lesser precision). Since the 7 most significant bits of the mantissa are processed in the same register, we can aim to target only those bits, assigning the rest to 0. Thus, our search space is now . The mantissa multiplication can be performed as , then taking the 23 most significant bits after the leading 1, and normalization (updating the exponent if the result overflows) if necessary.

In Figure 19, we show the result of the correlation between the HW of the first 7-bit mantissa of the weight with the traces. Except for Figure (b)b, the other results show that the correct mantissa can be recovered. The most interesting result is shown in Figure (b)b, which at the first glance looks like a failure of the attack. Here, the target value of the mantissa is 1100011110…10, while the attack recovers 1100100000..00. Considering the sign and exponents, the attack recovers 0.890625 instead of 0.89, i.e., a precision error at place after decimal point. Thus, in both cases, we have shown that we can recover the weights from the SCA leakage.

(a) First byte recovery (sign and 7-bit exponent)
(b) Second byte recovery (lsb exponent and mantissa)
Fig. 22: Recovery of the weight

Lastly, in Figure 22, we show the composite recovery of 2 bytes of the weight representation i.e., a low precision setting where we recover sign, exponent and most significant part of mantissa. Again, the targeted (correct) weight can be easily distinguished from the other candidates. Hence, once all the necessary information has been recovered, the weight can be reconstructed accordingly.

Iii-E Reverse Engineering the Number of Neurons and Layers

After the recovery of the weights and the activation functions, in this step, we use SCA to determine the structure of the network. Mainly, we are interested to see if we can recover the number of hidden layers and the number of neurons for each layer. To perform the reverse engineering of the network structure, we first use SPA. SPA is the simplest form of SCA which allows information recovery in a single (or a few) traces with methods as simple as visual inspection. The analysis is performed on three networks with different layouts.

(a) One hidden layer with 6 neurons
(b) 2 hidden layers (6 and 5 neurons each)
(c) 3 hidden layers (6,5,5 neurons each)
Fig. 26: SPA on hidden layers

The first analyzed network is an MLP with one hidden layer with 6 neurons. The EM trace corresponding to the processing of a randomly chosen input is shown in Figure (a)a. By looking at the EM trace, the number of neurons can be easily counted. The observability arises from the fact that multiplication operation and the activation function (in this case, it is the Sigmoid function) have completely different leakage signatures. Similarly, the structures of deeper networks are also shown in Figure (b)b and Figure (c)c. The recovery of output layer then provides information on the number of output classes. However, distinguishing different layers might be difficult, since the leakage pattern is only dependent on multiplication and activation function, which are usually present in most of the layers. We observe minor features allowing identification of layer boundaries but only with low confidence. Hence, we develop a different approach based on CPA to identify layer boundaries.

The experiments follow similar methodology as in the previous experiments. To determine if the targeted neuron is in the same layer as previously attacked neurons, or in the next layer, we perform a weight recovery using two sets of data.

Let us assume that we are targeting the first hidden layer (the same approach can be done on different layers as well). Assume that the input is a vector of length , so the input can be represented . For the targeted neuron in the hidden layer, perform the weight recovery on 2 different hypotheses. For the first hypothesis, assume that the is in the first hidden layer. Perform weight recovery individually using , for . For the second hypothesis, assume that is in the next hidden layer (the second hidden layer). Perform weight recovery individually using , for . For each hypothesis, record the maximum (absolute) correlation value, and compare both. Since the correlation depends on both inputs to the multiplication operation, the incorrect hypothesis will result in a lower correlation value. Thus, this can be used to identify layer boundaries.

Fig. 27: Methodology to reverse engineer the target neural network

Iii-F Recovery of the Full Network Layout

The combination of previously developed individual techniques can thereafter result in a full reverse engineering of the network. The full network recovery is performed layer by layer, and for each layer, the weights for each neuron have to be recovered one at a time. Let us consider a network consisting of layers, , with being the input layer and being the output layer. The reverse engineering is performed with the following steps:

  1. The first step is to recover the weight of each connection from the input layer and the first hidden layer . Since the dimension of the input layer is known, the CPA can be performed times (the size of ). The correlation is computed for hypotheses ( is the number of bits in IEEE 754 representation, normally it is 32 bits, but to simplify, 16 bits can be used with lesser precision for the mantissa). After the weights have been recovered, the output of the sum of multiplication can be calculated. This information provides us with the input to the activation function

  2. In order to determine the output of the sum of the multiplications, the number of neurons in the layer must be known. This can be recovered by the combination of SPA and DPA technique described in the previous subsection (2 times CPA for each weight candidate , so in total CPA required), in parallel with the weight recovery. When all the weights of the first hidden layer are recovered, the following steps are executed.

  3. Using the same set of traces, timing patterns for different inputs to the activation function can be built, similar to Figure 12. Timing patterns or average timing can then be compared with the profile of each function to determine the activation function (a comparison can be based on simple statistical tools like correlation, distance metric, etc). Afterward, the output of the activation function can be computed, which provides the input to the next layer.

  4. The same steps are repeated in the subsequent layers (, so in total at most , where is ) until the structure of the full network is recovered.

The whole procedure is depicted in Figure 27. In general, it can be seen that the attack scales linearly with the size of the network. Moreover, the same set of traces can be reused for various steps of the attack and attacking different layers, thus reducing measurement effort.

Iv Single Trace Input Recovery Attack on MLP

Fig. 28: Illustration of a recovery of multiple measurements from a single measurement processing several elementary operations sequentially
(a) First byte recovery (sign and 7-bit exponent)
(b) Second byte recovery (lsb exponent and mantissa)
Fig. 31: Input recovery attack on the initial layer

In the previous section, the methodology to reverse engineer a neural network has been described and practically demonstrated. In this section, we consider an alternate scenario, where an unknown or secret input is fed to a known network. By known network, we mean that the architecture and weights are either public or known to the adversary (e.g., recovered by reverse engineering). Generally, it can be extremely complex to recover the input by observing outputs from a known network. It involves several classifications in order to solve a system of equations, while some of the functions might not be invertible, i.e., ReLU. When considering theoretical attacks, the system of equations can soon become unmanageable as the architecture of the network becomes complex.

The proposed attack targets the multiplication operation in the first hidden layer. It is exactly the opposite of the previous weight recovery attack, as the weights are known while input is unknown. However, there is a strong limitation with this attack. As changes from one measurement to another, information learned from one measurement cannot be used with another measurement, preventing any statistical analysis. In this case, the adversary is forced to exploit all the measurements from a single measurement. Thus, to perform information exploitation over a single measurement, we use HPA. The weights in the first hidden layer are all multiplied with the same input , one after the other. Drawing analogy with SCA on cryptography, several known plaintexts (weights in case of MLP) are processed for a single unknown key (input here). The only difference is that all the processing is done in different parts of a single trace. An input recovery attack was proposed in [20], which requires multiple traces targeting a line buffer, which is an optimization oriented design choice. Contrary, our proposed attack targets the generic multiplication in a single trace setting.

We measured the EM trace to perform an input recovery attack. multiplications, corresponding to different weights (or neurons), in the first hidden layer were isolated. An illustrative example is shown in Figure 28 where traces corresponding to 4 weights are recovered from a single trace. Thus, a single trace is cut into smaller traces, each one corresponding to one multiplication with an associated weight. Next, the value of the input is statistically inferred by applying a standard DPA on the smaller traces. The results are shown in Figure 31 for different bytes of the same input. The black curve shows the correlation of the correct input while all wrong inputs are represented in red. The attack needs 20 or more multiplications to reliably recover the input. This means that in the current setting, the proposed attack works very well on medium to large sized networks, with at least 40 neurons in the first hidden layer (which is no issue in modern architectures used today).

V Experimental Validation on ARM Cortex-M3

A methodology to reverse engineer sensitive parameters of a neural network and input recovery was proposed in previous sections. The attack was practically validated on an 8-bit AVR (Atmel ATmega328P). In this section, we extend the presented attack on a 32-bit ARM microcontrollers. ARM microcontrollers form a fair share of the current market with huge dominance in mobile applications, but also seeing rapid adoption in markets like IoT, automotive, virtual and augmented reality etc.

Our target platform is the widely available Arduino due development board which contains an Atmel SAM3X8E ARM Cortex-M3 CPU with a 3-stage pipeline, operating at 84 MHz. The measurement setup is similar to previous experiments (Lecroy WaveRunner 610zi oscilloscope and RF-U 5-2 near-field EM probe from Langer). The point of measurements was determined by a benchmarking code running AES encryption. After capturing the measurements for the target neural network, one can perform the reverse engineering.

(a) ReLU
(b) Sigmoid
(c) Tanh
Fig. 35: Timing behavior for different activation functions

The timing behavior of various activation functions are shown in Figure 35. The results, though different from previous experiments on AVR, have unique timing signatures, allowing identification of each activation function. The activity of a single neuron is shown in Figure 36, which uses sigmoid as an activation function (separated by multiplication a vertical red line).

A known input attack is mounted on the multiplication to recover the secret weight. One practical consideration in attacking multiplication is that different compilers will compile it differently for different targets. Modern microcontrollers also have dedicated floating point units for handling operations like multiplication of real numbers. To avoid the discrepancy of a difference of multiplication operation, we target the output of multiplication. In other words, we target the point when multiplication operation with secret weight is completed and the resultant product is updated in general purpose registers or memory. Figure 37 shows the success of attack recovering secret weight of , with known input. As stated before, side-channel measurements on modern 32-bit ARM Cortex-M3 may have lower SNR thus making attack slightly harder. Nevertheless, the attack is shown practical even on ARM with more measurements. In our setup, getting extra measurement takes less than a minute. Similarly, the setup and number of measurements can be updated for other targets like FPGA, GPU, etc.

Fig. 36: Observing pattern and timing of multiplication and activation function
Fig. 37: Correlation comparison between correct and incorrect mantissa for weight=

Finally, the full network layout is recovered. The activity of a full network with 3 hidden layers composed of 6, 5, and 5 neurons each is shown in Figure 38. All the neurons are observable by a visual inspection. The determination of layer boundaries (shown by solid red line) can be determined by attacking the multiplication operation and following the approach discussed in Section III-F.

Fig. 38: SPA on hidden layers with 3 hidden layers (6,5,5 neurons each)

Vi Mitigations

As demonstrated above, various side-channel attacks can be applied to reverse engineer certain components of a pre-trained network. To mitigate such a recovery, several countermeasures can be deployed:

  1. Hidden layers of an MLP must be executed in sequence but the multiplication operation in individual neurons within a layer can be executed independently. An example is shuffling [43] as a well-studied side-channel countermeasure. It involves shuffling/permuting the order of execution of independent sub-operations. For example, given sub-operations () and a random permutation , the order of execution becomes instead. In this case, we propose to shuffle the order of multiplications of individual neurons within a hidden layer during every classification step. Shuffling modifies the time window of operations from one execution to another, mitigating a classical DPA attack.

  2. Weight recovery, as well as the single trace input recovery, can benefit from the application of masking countermeasures [44, 42]. Masking is another widely studied side-channel countermeasure that is even accompanied by a formal proof of security. It involves mixing of sensitive computations with random numbers to remove the dependencies between actual data and side-channel signature, thus preventing the attack. For every operation , it is transformed into , where are uniformly drawn random mask, and is the masked function which apply mask at the output of , given masked inputs and . If each neuron is individually masked with an independently drawn uniformly random mask for every iteration and every neuron, the proposed attacks can be prevented. However, this might result in a substantial performance penalty.

  3. The proposed attack on activation functions is possible due to the non-constant timing behavior. Mostly considered activation functions perform exponentiation operation. Implementation of constant time exponentiation has been widely studied in the domain of public key cryptography [45]. These well-studied ideas can be adjusted to implement constant time activation function processing.

Clearly, all those countermeasures come with an area and performance cost. In particular, shuffling and masking require a true random number generator that is typically very expensive in terms of area and performance. Similarly, constant time implementations of exponentiation [46] also come at performance efficiency degradation. Thus, the optimal choice of protection mechanism should be done after a systematic resource and performance evaluation study.

Vii Further Discussions and Conclusions

Neural networks are widely used machine learning family of algorithms due to its versatility across domains. Their effectiveness depends on the chosen architecture and fine-tuned parameters along with the trained weights, which can be a proprietary information. In this work, we practically demonstrate reverse engineering of a neural network using side-channel analysis techniques. Practical attacks are performed on measured data corresponding to chosen networks. To make our setting more general, we do not assume any specific form of the input data (except that inputs are real values).

We conclude that using an appropriate combination of SPA and DPA techniques, all sensitive parameters of the network can be recovered. Moreover, a powerful HPA method is used to recover secret inputs from a known network in a single shot side-channel analysis. The proposed methodology is practically demonstrated on two different modern controllers, a classic 8-bit AVR and a modern 32-bit ARM Cortex-M3 microcontroller. As shown, the attack on modern devices are slightly harder to mount due to lower SNR for side-channel attacks but are still practical. In the presented experiments, the attack took extra measurement, which require roughly seconds extra measurement time. Overall, the attack methodology scales linearly with the size of the network.

Multilayer perceptron architectures are widely used but arguably not the most common choice in state-of-the-art applications. Modern deep learning techniques like convolutional neural networks or recurrent neural networks recently took over and judging on the results they will remain as preferred methods of choice in coming years. Yet, even those networks use the same activation functions we consider here as well as the fully connected layers (the difference is that they also have other types of layers). Since we are able to differentiate between the same type of layers in architectures, we expect the difference to be even more profound when comparing with other layer types.

When considering the weight vectors, here we consider the case where each node has a separate weight. Convolutional neural networks can actually also share those weights to lower the degree of the problem. The same technique we use here to obtain the independent weights can be used to obtain the shared weights (with in the worst case scenario, multiple unnecessary calculations for those shared weights).

The proposed attacks are both generic in nature and more powerful than the two previous works in this direction. Finally, suggestions on countermeasures are provided to help designer mitigate such threats. However, the proposed countermeasures are borrowed mainly from side-channel literature and can incur huge overheads. Nevertheless, we believe that they could motivate further research on optimized and effective countermeasures for neural networks. Besides continuing working on countermeasures, as the main future research goal we envision the need to explore other types of layers, like convolution layers or max pooling layers.

References

  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.   USA: Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257
  • [2] J. Kober and J. Peters, Reinforcement Learning in Robotics: A Survey.   Berlin, Germany: Springer, 2012, vol. 12, pp. 579–610.
  • [3] P. Teufl, U. Payer, and G. Lackner, “From nlp (natural language processing) to mlp (machine language processing),” in Computer Network Security, I. Kotenko and V. Skormin, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 256–269.
  • [4] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17.   New York, NY, USA: ACM, 2017, pp. 363–376. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134018
  • [5] M. Kučera, P. Tsankov, T. Gehr, M. Guarnieri, and M. Vechev, “Synthesis of probabilistic privacy enforcement,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17.   New York, NY, USA: ACM, 2017, pp. 391–408. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134079
  • [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
  • [7] Riscure, “https://www.riscure.com/blog/automated-neural-network-construction-genetic-algorithm/,” 2018. [Online]. Available: www.riscure.com
  • [8] N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig, and J. Wernsing, “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16.   JMLR.org, 2016, pp. 201–210. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045390.3045413
  • [9] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart., “Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing,” in USENIX Security, 2014, pp. 17–32.
  • [10] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici, “Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers,” Int. J. Secur. Netw., vol. 10, no. 3, pp. 137–150, Sep. 2015. [Online]. Available: http://dx.doi.org/10.1504/IJSN.2015.071829
  • [11] A. Khan, G. Goodhue, P. Shrivastava, B. Van Der Veer, R. Varney, and P. Nagaraj, “Embedded memory protection,” Nov. 22 2011, uS Patent 8,065,512.
  • [12]

    L. Lerman, R. Poussier, G. Bontempi, O. Markowitch, and F.-X. Standaert, “Template attacks vs. machine learning revisited (and the curse of dimensionality in side-channel analysis),” in

    International Workshop on Constructive Side-Channel Analysis and Secure Design.   Springer, 2015, pp. 20–33.
  • [13] D. Jap, M. Stöttinger, and S. Bhasin, “Support vector regression: exploiting machine learning techniques for leakage modeling,” in Proceedings of the Fourth Workshop on Hardware and Architectural Support for Security and Privacy.   ACM, 2015, p. 2.
  • [14] S. Picek, A. Heuser, A. Jovic, S. A. Ludwig, S. Guilley, D. Jakobovic, and N. Mentens, “Side-channel analysis and machine learning: A practical perspective,” in Neural Networks (IJCNN), 2017 International Joint Conference on.   IEEE, 2017, pp. 4095–4102.
  • [15] H. Maghrebi, T. Portigliatti, and E. Prouff, “Breaking cryptographic implementations using deep learning techniques,” in International Conference on Security, Privacy, and Applied Cryptography Engineering.   Springer, 2016, pp. 3–26.
  • [16] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (SP), May 2017, pp. 3–18.
  • [17] C. Song, T. Ristenpart, and V. Shmatikov, “Machine learning models that remember too much,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17.   New York, NY, USA: ACM, 2017, pp. 587–601. [Online]. Available: http://doi.acm.org/10.1145/3133956.3134077
  • [18] W. Hua, Z. Zhang, , and G. E. Suh, “Reverse engineering convolutional neural networks through side-channel information leaks),” 2018, preprint.
  • [19] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), June 2017, pp. 27–40.
  • [20] L. Wei, Y. Liu, B. Luo, Y. Li, and Q. Xu, “I know what you see: Power side-channel attack on convolutional neural network accelerators,” CoRR, vol. abs/1803.05847, 2018. [Online]. Available: http://arxiv.org/abs/1803.05847
  • [21] O. Ohrimenko, M. Costa, C. Fournet, C. Gkantsidis, M. Kohlweiss, and D. Sharma, “Observing and preventing leakage in mapreduce,” in Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15.   New York, NY, USA: ACM, 2015, pp. 1570–1581. [Online]. Available: http://doi.acm.org/10.1145/2810103.2813695
  • [22] Y. Xu, W. Cui, and M. Peinado, “Controlled-channel attacks: Deterministic side channels for untrusted operating systems,” in Proceedings of the 2015 IEEE Symposium on Security and Privacy, ser. SP ’15.   Washington, DC, USA: IEEE Computer Society, 2015, pp. 640–656. [Online]. Available: https://doi.org/10.1109/SP.2015.45
  • [23] O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa, “Oblivious multi-party machine learning on trusted processors,” in Proceedings of the 25th USENIX Conference on Security Symposium, ser. SEC’16.   Berkeley, CA, USA: USENIX Association, 2016, pp. 619–636. [Online]. Available: http://dl.acm.org/citation.cfm?id=3241094.3241143
  • [24] T. M. Mitchell, Machine Learning, 1st ed.   New York, NY, USA: McGraw-Hill, Inc., 1997.
  • [25] R. Collobert and S. Bengio, “Links Between Perceptrons, MLPs and SVMs,” in Proceedings of the Twenty-first International Conference on Machine Learning, ser. ICML ’04.   New York, NY, USA: ACM, 2004, pp. 23–. [Online]. Available: http://doi.acm.org/10.1145/1015330.1015415
  • [26] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed.   Upper Saddle River, NJ, USA: Prentice Hall PTR, 1998.
  • [27] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Berlin, Heidelberg: Springer-Verlag, 2006.
  • [28]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10.   USA: Omnipress, 2010, pp. 807–814. [Online]. Available: http://dl.acm.org/citation.cfm?id=3104322.3104425
  • [29] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealing the Secrets of Smart Cards.   Springer, December 2006, ISBN 0-387-30857-1, http://www.dpabook.org/.
  • [30] P. C. Kocher, “Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems,” in Annual International Cryptology Conference.   Springer, 1996, pp. 104–113.
  • [31] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Annual International Cryptology Conference.   Springer, 1999, pp. 388–397.
  • [32] A. Faruque, M. Abdullah, S. R. Chhetri, A. Canedo, and J. Wan, “Acoustic side-channel attacks on additive manufacturing systems,” in Proceedings of the 7th International Conference on Cyber-Physical Systems.   IEEE Press, 2016, p. 19.
  • [33] S. Mangard, “A simple power-analysis (spa) attack on implementations of the aes key expansion,” in International Conference on Information Security and Cryptology.   Springer, 2002, pp. 343–358.
  • [34] S. Bhasin, S. Guilley, A. Heuser, and J.-L. Danger, “From cryptography to hardware: analyzing and protecting embedded xilinx bram for cryptographic applications,” Journal of Cryptographic Engineering, vol. 3, no. 4, pp. 213–225, 2013.
  • [35] C. Luo, Y. Fei, P. Luo, S. Mukherjee, and D. Kaeli, “Side-channel power analysis of a gpu aes implementation,” in Computer Design (ICCD), 2015 33rd IEEE International Conference on.   IEEE, 2015, pp. 281–288.
  • [36] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with a leakage model,” in International Workshop on Cryptographic Hardware and Embedded Systems.   Springer, 2004, pp. 16–29.
  • [37] C. Clavier, B. Feix, G. Gagnerot, M. Roussellet, and V. Verneuil, “Horizontal correlation analysis on exponentiation,” in International Conference on Information and Communications Security.   Springer, 2010, pp. 46–61.
  • [38] A. Heuser, S. Picek, S. Guilley, and N. Mentens, “Lightweight ciphers and their side-channel resilience,” IEEE Transactions on Computers, pp. 1–1, 2017.
  • [39] R. Gilmore, N. Hanley, and M. O’Neill, “Neural network based attack on a masked implementation of AES,” in 2015 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), May 2015, pp. 106–111.
  • [40]

    P. Naraei, A. Abhari, and A. Sadeghian, “Application of multilayer perceptron neural networks and support vector machines in classification of healthcare data,” in

    2016 Future Technologies Conference (FTC), Dec 2016, pp. 848–852.
  • [41] P. Thomas and M.-C. Suhner, “A new multilayer perceptron pruning algorithm for classification and regression applications,” Neural Processing Letters, vol. 42, no. 2, pp. 437–458, Oct 2015. [Online]. Available: https://doi.org/10.1007/s11063-014-9366-5
  • [42] E. Prouff and M. Rivain, “Masking against side-channel attacks: A formal security proof,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques.   Springer, 2013, pp. 142–159.
  • [43] N. Veyrat-Charvillon, M. Medwed, S. Kerckhof, and F.-X. Standaert, “Shuffling against side-channel attacks: A comprehensive study with cautionary note,” in International Conference on the Theory and Application of Cryptology and Information Security.   Springer, 2012, pp. 740–757.
  • [44] J.-S. Coron and L. Goubin, “On boolean and arithmetic masking against differential power analysis,” in International Workshop on Cryptographic Hardware and Embedded Systems.   Springer, 2000, pp. 231–237.
  • [45] G. Hachez and J.-J. Quisquater, “Montgomery exponentiation with no final subtractions: Improved results,” in International Workshop on Cryptographic Hardware and Embedded Systems.   Springer, 2000, pp. 293–301.
  • [46] A. Al Hasib and A. A. M. M. Haque, “A comparative study of the performance and security issues of aes and rsa cryptography,” in Convergence and Hybrid Information Technology, 2008. ICCIT’08. Third International Conference on, vol. 2.   IEEE, 2008, pp. 505–510.