I Introduction
Deep neural networks (DNNs) are widely used in many safety and securitysensitive artificial intelligence (AI) applications such as biometric authentication, autonomous driving, and financial fraud analysis. Consequently, their security is a serious concern and requires urgent research attention.
Recent research has shown the security and privacy of DNN models can be compromised. In [1], attackers deceive the DNN system to make misclassifications by adding small perturbation to the original images. In [2]
, malicious parties are able to recover private images that are used to train a face recognition system by analyzing the outputs of the DNN model. However, the success of these attacks requires the full knowledge of the DNN model, which is not usually available in realworld applications as model parameters are most valuable assets for deep learning tasks and are always kept confidential. In addition, a variety of privacyleaking prevention techniques
[3, 4, 5] has emerged to mitigate these attacks.As illustrated in Fig. 1, apart from the privacy leakage from DNN models, the information of realworld inputs can also be leaked at inference stage. In privacysensitive systems, the direct access to the input data of the inference engine is often strictly limited. For instance, encrypted medical images are provided to the DNN inference engine and they are decrypted inside the inference engine for privacy protection. Under such circumstances, we are concerned whether attackers can use side channel information (e.g., power consumption) to retrieve private data.
Dedicated hardware DNN accelerators are expected to gain mainstream adoption in the foreseeable future due to their high computation efficiency [7]. In this paper, we present a power sidechannel attack on an FPGAbased convolutional neural network (CNN) accelerator. The accelerator is used to perform image classification and we aim to recover its input image from collected power traces. We set the attack target on the hardware component executing the convolution in the first layer of CNN whose computation is directly related to the input image.
To the best of our knowledge, the proposed side channel attack is the first one that exploits the privacy leakage in neural accelerators. Particularly, unlike previous attacks, our proposed attack does not assume the prerequisite knowledge of model parameters and model outputs. The main contributions of our work include:

Power side channel is often quite noisy and the collected power trace contains distortions brought by various circuit components. We present a novel power extraction technique to precisely recover the power consumption for each clock cycle.

We develop novel algorithms to retrieve each pixel value of the input image. To be specific, as the convolution operation only relates to a limited number of pixels, we develop algorithms to infer the values of pixels either from power traces directly or from a prebuilt powerpixel template. Finally, the image can be reconstructed by piecing all inferred pixels together.
The remainder of this paper is organized as follows: Section II introduces the background knowledge with threat model follows in Section III. Next we give an overview on the proposed attack flow in Section IV
for two attacking scenarios. We introduce how to accurately estimate the power from the noisy power side channels in Section
V. The details of the two attack scenarios are then introduced in Section VI and in Section VII, respectively. We discuss the limitation and countermeasures in Section VIII. Finally, we introduce related work in Section IX and conclude this work in Section X.Ii Background
In this section, we first review the concept of convolutional neural network (CNN), and then introduce the architecture of typical CNN accelerators and finally discuss the basics on power sidechannel leakage.
Iia Convolutional Neural Network
Convolutional Neural Network (CNN) [6] is a neural network architecture used for image applications. It is constructed by a pipeline of sequentially connected layers and may consist of four types of computation: convolution, pooling, normalization and fullconnected. The structure of the network, such as total number of layers and the type of computation in each layer, is determined by designers prior to the training stage. Then the parameters in each layer, namely weights, are acquired through dedicated training algorithms. In the inference stage with structure and weights ascertained, CNN can make predictions with the input images. In particular, the input of first layer of CNN is image itself and the computation in the first layer is usually convolution.
As our focus is on the convolution layer of CNN, here we briefly introduce its details and illustrate the calculation in Fig. 2 (a). The input to the convolution layer is a image of size and we call the 2D pixel array feature map. For each input feature map, to calculate the pixel value of the output feature map, a kernel (or filter) of size is applied to construct a convolutional window for each input pixel capturing its neighbors. We then get an output feature map with the convolutional window sliding by steps of and in two directions of the input feature map. We can represent the convolution operation formally with following formula:
(1) 
where the is the pixel value of position in th output feature map, and are the kernel and bias value between the th input feature map and the th output feature map respectively, and
is a nonlinear activation function such as
or .IiB CNN Accelerator Design
An accelerator is usually used in the inference stage to boost the computational efficiency in a number of lowpower platforms. The accelerators are usually implemented by dedicated hardware, such as FPGA and ASIC and there are a number of designs available [9, conti2015ultra, 10] in both academia and industry. The general architecture of these accelerators is similar, as shown in Fig. 2 (b) wherein typically five components are involved: DMA, controller, data buffers, parameter buffers and compute units. DMA is used for the data transmission with main processors while controller is responsible for coordinating computation tasks among components. The parameter buffers store the weights used in the CNN model and shall be ready prior to any inference operation. The data buffer stores the input feature maps for every layer and caches the output feature map from computing units to be used in the next layer. Compute units contain dedicated hardware to accelerate different operations in the neural network, e.g., convolution, pooling, etc.
Specifically, as the target of our attack is the convolutional layer in the CNN, we present the detailed design for the convolution operation in the compute unit. Line buffer [11] is an efficient hardware structure to implement convolution and it has been adopted by a number of CNN accelerators [9, 10]. Fig. 2 (c) shows the structure of line buffer to execute 2D convolution with a kernel. There are three line registers to compute the convolution with a kernel of size as we need to cache the pixel values in recent three rows of the image. The length of each line is equivalent to the row size of input image. The convolution is achieved by a set of dedicated hardware multiplier and adders. At each cycle, one pixel is put into the line buffer, and a convolution is computed. The intermediate result is passed through a nonlinear function to generate one output value per cycle. If the input image contains several channels (e.g., three channels for an RGB image), multiple instances of line buffer are synthesized for parallel procession. When all input pixels are processed by the line buffer, one output feature map is finished and stored on the data buffer for processing in the next round. The above mentioned procedure is repeated until we generate all output feature maps with different kernels. As we can see from the operation of line buffer, at each cycle, the output only depends on a limited number of input pixels (inside the convolution window), which serves as the foundation to efficiently launch our proposed attack.
In this paper, we follow the implementation proposed by Zhao et al [8] who implement accelerator for a compressed version of CNN [12] on FPGA. The convolution unit in their proposed architecture is based on the line buffer. In their neural accelerator, the parameters and activations inside the network model is limited to either 1 or 1 so that the weights of compressed network can be completely stored inside the RAM of FPGA.
IiC Basics on Power Side Channel
Power Constitution and Measurement: The power consumption of circuits can be divided into two categories: static and dynamic. Static power consumption arises from the leakage current of transistors and is typically very low. Dynamic power consumption comes from internal transitions of transistors which closely relate to its input data and it usually dominates the total power consumption in its magnitude. To measure the power consumption, a 1 resistor is placed on the power supply line and the voltage drop on it is measured using a highresolution oscilloscope.





IC: 1, LS: 28, KS: 3 x 3  0.57mW  0.67mW  
IC: 1, LS: 32, KS: 3 x 3  0.64mW  0.79mW  
IC: 1, LS: 28, KS: 5 x 5  1.25mW  1.51mW  
IC: 3, LS: 28, KS: 3 x 3  1.78mW  2.07mW 

IC – Image Channel, LS – Line Size, KS – Kernel Size
Power Consumption of Line Buffer: As the line buffer is the main attack target, we estimated the power consumption of the convolution unit and total power consumption with Xilinx XPower Analyzer, a software power emulator for FPGA. We implemented the line buffer in RTL with various configurations and the result is shown in Table I, wherein the convolution unit dominates the total power consumption. To be specific, we implemented four common configurations of line buffer: three of them have only one input channel, but the line size is 28 and 32 respectively and kernel size can be either 3x3 or 5x5. The last configuration is of three input channels, its line size and filter size is identical to that in the first row. From the statistics of Table I, the power consumption of the convolutional unit increases significantly due to the increase of kernel size and input channel. It is because in these two cases, the pixels involved in the convolution unit increases. The change of line size does not affect much of the power consumed by convolution units. Whatever the configuration, the power in the convolution unit occupies more than 80% of the total consumption. Therefore, we can regard the measured power as a coarsegrain estimate for the power of convolution unit.
Iii Threat Model
Scenario:
We consider the adversaries come from the DNN accelerator design team or are insiders in companies hosting DNN infrastructures. At inference stage, DNN model designers usually deploy their trained model to a machinelearning service operator, such as BigML
[13] or Microsoft Azure [14], who possesses dedicated DNN accelerators for the inference operation. They may also put their models on computing platforms (e.g., Qualcomm’s Snapdragon 835 [15]) with DNNaccelerating hardware. In many privacysensitive applications, the inputs to the DNN accelerators are often protected with strict access control policies or strong encryption schemes. Thus, for attackers it is not easy to obtain the inputs directly. However, the side channels, especially the power side channel, are exposed unprotected to malicious insiders in the host of machine learning service and DNN accelerator design team. They are capable to access to the power side channel output via implanted Trojans or measurement circuits when the accelerator is actually running with realworld users’ inputs.Capability: Firstly, we assume attackers are knowledgeable about the structure of the neural network and the input image size, but not the detailed parameters in the network. To be specific, for the targeted first convolution layer, the attackers need to know its filter size, number of input feature maps, and number of output feature maps. We consider this assumption practical because many imagerelated tasks adopt existing neural network architecture (e.g., VGG16/19 and ResNet) whose structure is fixed and public. Secondly, adversaries can acquire the power trace with high resolution of the DNN accelerator either by oscilloscope measurement or powermonitoring Trojan. Thirdly, according to the ability of freely launching inference operation, we further divide the adversaries into two categories: passive adversary and active adversary. Passive adversary can only eavesdrop on the power consumption when an input is processed by the DNN accelerator at inference stage. Active adversary has an extra ability of profiling the relationship between power and input pixels by freely launching inference operation with arbitrary inputs on the targeted accelerator. The profiling phase can only be carried out prior to any actual calculation of user’s private data.
Iv Overview
The primary goal in this paper is to recover the input image from power traces of the targeted CNN accelerator. The reason we choose the convolution in the first layer as attack target is as follows: firstly it directly processes the input image so the power obtained closely relates to the input. Secondly, the inherent characteristic of convolution, which performs computation on a small bunch of pixels, can reduce the effort needed to infer the pixel values.
To evaluate the proposed attack, we implement a CNN accelerator [8] in a Xilinx Spartan6 LX75 FPGA chip [16] on the SAKURAG board [17]. This board is designed for the purpose of evaluating power side channel attacks. We setup a Tektronix MDO3034 oscilloscope [18], with a sampling frequency of 2.5GHz, to acquire the power trace from the FPGA board.
For passive and active adversaries, we propose attack methods for them separately. The whole attack flow is illustrated in Fig. 3. In the first step, we collect the power traces of the FPGA when it performs the convolution with different kernels. Then we adopt an extraction algorithm to filter out noise and get the real power consumption, details will be shown in Section V. After the power extraction stage, passive adversaries try to locate pixels belonging to image background from the extracted power. Then the silhouette of foreground objects is revealed. The details of this attack are shown in Section VI. For active adversaries, before the actual attack, they build a “power template” using the power measured with different kernels and the input image. The power template exploits the relations between power consumption and pixel values and can generate a set of pixel candidates when queried with power consumption in actual attacks. The final step for active adversaries is to recover the image by selecting the best pixel candidate from the generated set. Section VII introduces the power template attacks in details.
We conducted experiments with images from the MNIST dataset [19], a dataset for handwritten digits recognition. We try to recover the image with both background detection and power template, shown in Fig. 4. For two recovery methods, we select the correctly classified images with the same input image so that we can compare the quality of recovered images directly. For images from background detection, the general shape of original image is retained while the images recovered with power template keep more details and they are more similar to the original image in visual effect.
V Power Extraction
Ideally, the power collected from the oscilloscope is periodic and its period shall be same with the clock signal, as the internal activity is triggered by the clock pulse. The power trace in one period shall reflect the total power consumed in this cycle. However, this assumption is not valid due to noises and distortions in the power collecting procedure. Some of the noise sources can be modeled as a capacitorbased filter system which blends power consumption of neighboring cycles and thus makes the raw power trace inaccurate for pixel inference. In this section, we present an efficient method to extract real power consumption from the noisy and distorted power traces.
Va Interference Sources
We illustrate three critical components on the power measurement path in Fig. 5. Driven by clock pulses, CMOS transistors in the FPGA used for computation become active and draw current from power supply. The current is delivered through the power distribution network which leads to a voltage drop in the power measurement circuit. The voltage drop is then captured by the oscilloscope’s probe placed on the power supply line and recorded as the power trace.
All three components incur certain kinds of interference on the measured power signals. The noise in the measurement of oscilloscope is white noise, which mainly comes from environmental fluctuations.
The adopted FPGA board [20] offers two options for power measurement: we can either directly measure the raw voltage on the resistor or the amplified signal through an amplifier. It is crucial for the success of our attack to use the amplified signal as the raw voltage on the resistor is about several millivolts, which is around the same level with noise. However, the amplifier circuit is only able to amplify the AC components of the power traces, which results in the voltage drop below zero at the end of the power traces, as illustrated in Fig. 6 (a). The drop not only induces inaccuracies when we recover the power for each cycle and but also hinders correct curvefitting procedure in latter procedures. We analyze the frequency response of the power measurement circuit with a simulator NI MultiSim [21], and find that the whole circuit behaves like a highpass filter with the cutoff frequency of around 250Hz.
The exact effect of FPGA’s power distribution network on power signals is hard to model as we are not knowledgeable about the design details, but we assume it can be regarded as an RC filter. This is because power distribution network is often in a treelike shape and implemented with metal wires. The distributed wire’s resistance and the interwire capacitance can be regarded as an RC filter in a lumped model.
VB Extraction Methods
For the noise in the oscilloscope measurement, we use low pass filters to eliminate them. For the distortion from the RC filters, though techniques directly reversing the distortion effect exist, they are very sensitive to small deviation [22] in the original signal. Thus, they are not applicable to the power traces from noisy channels. We propose to solve the problem by analyzing the approximate effect of the RC filters with two dedicated methods: DC component recovery, power alignment and curve fitting.
LowPass Filter: For the noises induced in the oscilloscope measurement, a lowpass filter is enough by filtering out most highfrequency noises as the lowfrequency noises are small compared to useful signals. We apply a filter whose cutoff frequency is 60MHz to the acquired power traces and the result is shown in Fig. 6 (b).
DC Component Restoration: For the distortion induced by the power measurement circuit, we propose to recover the DC component. From the simulation result of NI Multisim, the cutoff frequency of equivalent highpass filter (250Hz) is far lower than the accelerator’s working spectrum (more than 15kHz, as the total running time is around 70s). So only the DC component is filtered by the power measurement circuit. To recover it, we obtain the discrete time impulse response of the power measurement circuit via simulation as follows:
(2) 
wherein stands for the sampling interval and it is 0.4ns in our case. represents the time constant, which is the reciprocal of the angular cutoff frequency . So we propose to recover the original power trace by reversing effects of the power measurement circuit, which can be modelled as , using the following formula:
(3) 
wherein the represents power samples collected while the stands for sample points in the recovered trace.
Power Alignment and Curve Fitting: Though the FPGA’s power distribution network is also RCfilterlike, it is hard to approximate it to simple lowpass or highpass filters as its frequency response overlaps the spectrum of power traces. Alternatively, based on this RC filter assumption, we further assume the power trace acquired per cycle is similar to the capacitor’s charging and discharging curve. Then we use curve fitting tools to obtain the exact power consumption in one cycle. In the first step, we need to align the power trace with the clock signal. A template signal, representing a typical power trace in one clock cycle as shown in Fig. 6 (c), is carefully chosen from the filtered power trace manually and we calculate the Pearson correlation coefficient of the template signal with each sample point on the original power trace. We choose the points with maximum coefficients to be alignment points. The aligned power trace is shown in Fig. 6 (d).
For power signals in each cycle, they all rise sharply at first and then gradually descend, which comes from the charging and discharging of the equivalent capacitor in power distribution network. Thus, we fit the power curve with capacitor’s charging formula and discharging formula as follows:
(4) 
in which the represents the final voltage at the charging stage and the initial voltage for the discharging phase. is the product of equivalent resistance and capacitance of the power distribution network, also known as RC time constant, represented by . The whole power extraction algorithm is listed in Algorithm 1. Also we illustrate this procedure in Fig. 6 (e). The algorithm is run cyclewise: for each cycle, we estimate optimal and from the power trace using curve fitting function and calculate the trailing power in subsequent cycles. The final power for current cycle is accumulated by the power in this cycle and the trailing power. The trailing power is then subtracted from following power traces. The computation continues until all aligned cycles are processed. The solid red line in Fig. 6 (e) shows the optimal curve we find while the dash red line shows the trailing power for each cycle.
Vi Background Detection
In this section, we first discuss the intuition of our background detection attack. Then we introduce the thresholdbased attack method and at last we evaluate it with MNIST datasets [19].
Via Intuitions
For passive adversaries, the intuition to attack the DNN accelerator comes from the power model: the power consumption is determined by the internal activities, especially by those in the convolution unit which takes the largest portion of power consumption. If the data inside the convolution unit remain unchanged between cycles, the internal transitions induced are limited. Thus, the power consumption shall be small. Based on this insight, by observing the magnitude of power consumption in each cycle, passive adversaries can determine whether the related pixels share similar values. These similar pixels most probably belong to the pure background of the image. As a result, the silhouette of the foreground object naturally revealed by locating all pixels belonging to background and the privacy of user’s information may be infringed via adversaries’ visual inspection.
Though many realworld images have a messy background, some types of privacysensitive images happen to contain pure background, such as medical images from ultrasonography or radiography. If the adversaries are able to recover the shape of foreground object, they may be able to identify the organ being scanned and thus infer the health condition of a particular patient.
ViB Attack Method
The basic idea of the attack is to find a threshold to distinguish cycles processing background pixels based on the magnitude of power consumption. However, deciding the exact threshold is not a trivial task as we cannot observe a clear gap in the distribution of power consumed in each cycle, as shown in histogram in Fig. 7 (a). We assume the power consumed in cycles processing foreground pixels are evenly distributed across a large range while the power consumption of rest cycles aggregate at the bins of smaller values. So we are expected to observe a peak in cycle counts for smaller cycle power consumption. In this case, we decide the threshold by finding the maximal decrease in cycle count:
(5) 
wherein the is the function returning the cycle count for a particular power consumption, is the bin size.
After the threshold is determined, we filter out all cycles whose power consumption is above the threshold. Then we locate all corresponding pixels for the left cycles. These pixels are regarded as background pixels and then we can get a blackandwhite image for further examination and analysis.
ViC Evaluation
Experiment Setup: We performed our attack on the CNN accelerator used to classify the digits in MNIST datasets. The size of the images in MNIST datasets is 2828. The images have a clear black background which satisfies the prerequisite of our background detection method. For the CNN accelerator [8], we set the line size of the line buffer to 28, input channel to 1 and the kernel size to 33 and 55. We adopted two models for experiment with their details shown in Table II. The only difference between the two models is the kernel size as it directly determines the number of pixels involved in the convolution unit and affects the granularity of recovered image.
Model 1  Model 2  

No. of layers  4  
Accuracy on testing sets  99.42%  99.27% 
Type of 1st layer  Convolution  
Kernel size in 1st layer  3 3  5 5 
No. of kernels in 1st layer  64 
We synthesized the CNN accelerator design to FPGA and loaded the model parameters into the accelerator before the inference stage. We randomly chose 500 images from the MNIST testing set to evaluate our attack method. Both models contain 64 different kernels in the first layer and for each kernel, we recorded the power trace when the accelerator performed the convolution in first layer. As our algorithm recovers the pixel values on a cycle base, we need to precisely identify the power trace fragment for the convolution in first layer. It is trivial to locate the start point on the power trace and the length of fragment can be determined from the total clock cycles needed to finish the convolution computation.
Evaluation Metric: We evaluate the quality of recovered images with two metrics: pixellevel accuracy and recognition accuracy. Pixellevel accuracy is to evaluate the precision of our attack algorithm and it is defined as follows:
(6) 
in which represents a pixel in the targeted image . means the background marker (whether it belongs to background) predicted by our algorithm while
represents the golden marker. For MNIST images, we regard all pixels with value 0 (i.e., pure black pixel) as the background pixels. We also evaluate the cognitive quality of the recovered image with recognition accuracy. We feed every recovered image to a highaccuracy MNIST classification model and compare the prediction with its golden label. In the following experiment, we use a multilayer perceptron network
[12] with an accuracy of 99.2% as a golden reference to evaluate the cognitive quality.Choices of Threshold and Kernel: We show a histogram of power consumed in each cycle in Fig. 7 (a). In the figure, we draw the histograms for the power computed with two different kernels from model 1 and they manifest similar trends: with the increase of power consumed per cycle, the cycle count rises at first and then descends sharply at the value of 0.5. After that, the cycle count gradually decreases and finally reaches 0. Based on the threshold selection criteria, we choose the threshold at 0.5 for this image.
To demonstrate the importance of threshold choice in the attack, we recover the silhouette images using various threshold values from 0.1 to 3.0 with a step size of 0.1 and illustrate the two metrics in Fig. 7 (b). The pixellevel accuracy is drawn with solid lines while the recognition accuracy is drawn using the dotdash lines. We observe that as the threshold increases, the pixellevel accuracy increases to its peak value around 85.6% for both kernels at first and then it gradually decreases to 83.3%. The recognition accuracy for these two kernels also follows similar trends: they first rise to its peak accuracy, but they drop significantly as the threshold increases. For kernel 1, it reaches its peak value 81.6% at threshold 0.5 while for kernel 2, it reaches peak of 81.8% at threshold 0.3.
When the threshold approaches its optimal value, more and more background pixels are correctly identified, so we observe an increase in both pixellevel accuracy and recognition accuracy. After the threshold exceeds the optimal value, the pixellevel accuracy only drops a little, while the recognition accuracy falls remarkably. This is because the actual number of foreground pixels is smaller than the background pixels. When threshold increases, the number of misclassified pixels is limited thus it does not affects the pixellevel accuracy much. However, as foreground pixels are key components to construct digit strokes, their misclassification leads to significant loss in recognition accuracy. According to these two curves, the optimal range of threshold lies between [0.3, 0.5], which is the region in two green dashed lines in Fig. 7 (b). It is consistent with the threshold selection criteria we raised in last subsection. Another conclusion from the experiment is that both accuracy metrics are seldom affected by the choices of kernel. Therefore, adversaries can acquire acceptable images by only attacking on the power trace for one kernel.
Kernel Size: Kernel size can be a significant factor affecting both the pixellevel and recognition accuracy. The average pixellevel accuracy for model 1 (33 kernel size) is 86.2% while the accuracy for model 2 (55 kernel size) is around 74.6%. The recognition accuracy is shown in Fig. 8: on average case, 81.6% for images recovered from power acquired with model 1 and 64.6% with model 2. The accuracy degradation comes from the information loss when kernel size increases. It is because our algorithm is only able to find cycles that deal with background pixels via thresholding, which requires all the pixels inside the convolutions units to be identical. In other words, if the convolution unit contains a nonbackground pixels, all other background pixels may be misidentified as foreground pixels by our proposed algorithm. This effect is similar to the morphological dilation operation [23] in the digital image processing which widens the shape of foreground objects. The recovered image looks “fatter” than the original image in visual effect. Key structures smaller than the kernel size are more probable to disappear, resulting in the degradation of recognition accuracy.
Another discovery is that the recognition accuracy is various for different digits, especially for model 2. Fig. 8 shows the recognition accuracy categorized by the class of digits. The accuracy of classifying original image is almost the same and nearly perfect for all classes. The recognition rates of digits 3, 7 and 9 are below average for images recovered with model 1. Meanwhile, the accuracy of digits 1, 3, 4, 7 and 9 drop significantly for model 2. We consider the discrepancy among different digits comes from inherent structure of digits and equivalent dilation effect of recovered image. For example, the image of digits 1 recovered from model 2 is much “fatter” than that from model 1 due to dilation effect, so it is more probable for the classification network to misclassify it as digits 8, causing a low recognition rate.
We also investigate the classification result of recovered images for each digits with both models. The results are drawn in the classification map shown in Fig.9. Each cell in the map represents the portion of images with golden class which are predicted as class , wherein the portion is illustrated with the darkness. For both models, the darkest color all lies on the diagonal of the map, which means the classification network is able to correctly predict in most cases. We observe the recognition accuracy of digits 8 is quite high (around 90%) for model 2 in Fig. 8. However, the precision is not. From Fig. 9 (b), the cells in the column of inferred digits 8 are darker than other cells in the same row except for genuine class 8. So for a image inferred as digit 8 may have larger probability to be other digits actually because of its low precision. This is because the inherent shape of 8 is large than other digits and the classification network is more inclined to classify a “fatter” image, which is caused by dilation effect, to digits 8.
To conclude, the kernel size affects both the pixellevel accuracy and recognition accuracy due to their equivalent dilation effect induced by kernels. The recognition accuracy of different digits also varies because of their inherent structure.
Complexity: The attack method only attacks one power trace. As the power extraction and background detection procedure are cyclebased, the time complexity is proportional to the total number of cycles to compute the convolution which is determined by the image size , where the and is the length of the image in two dimensions. The total time used is short in practice. It takes around 6s to obtain one power trace and 5.7s for power extraction. For actual image reconstruction it only takes 0.01s.
Vii Image Reconstruction via Power Template
In this section, we propose an attack method, for active adversaries, to recover the details of images used in the inference process. Instead of predicting background marker, we try to obtain values for each pixel. The section is organized similarly with Section VI with three subsections: intuition, attack method and evaluation.
Viia Intuitions
The search space to recover pixel values is prohibitively large even if only considering the pixels in a small local region. Suppose the targeted model uses a 33 kernel size for the first convolution layer, the number of pixels involved in the convolution in one cycle is 12 (see the analysis in Section VIIB). Typically a pixel can have a value ranging from 0 to 255, so the total combinations for the pixels involved is around . Iterating all combinations of pixel values in brute force is inefficient to perform the attack. Thus, for active adversaries, we propose to reduce the search space significantly by building a “power template”. As active adversaries are able to profile the relationship between power consumption with arbitrary input images, the prebuilt “power template” is able to efficiently predict the pixel value at actual attack with knowledge acquired at profiling stage.
As illustrated in Section IIC, the power consumed in each cycle is determined by the data inside convolution unit, which comprise pixel values and kernel parameters. Typically, the same inputs are convolved with different kernels in the convolutional layer. For a specific region of pixels processed by convolution unit, we can regard the power consumption acquired from different kernels as a unique feature to infer the value of the pixels. Based on this intuition, we build a “power template” storing the mapping of power consumption to pixel values so that adversaries can produce a set of possible pixel values from a vector of power consumption retrieved at attack time. Finally, after we acquired many candidate values for each pixel from multiple power vectors, an image reconstruction algorithm is adopted to select the best candidate. As one pixel is processed in multiple cycles, the target of the selection is to find candidates in these cycles predicting similar values at this pixel.
ViiB Attack Method
In this subsection, we introduce the detailed steps to recover pixel values of the input image. First we discuss how to build the power template at the profiling stage. Then, with the extracted power from different kernels acquired at attack time, we demonstrate the method to get candidate pixel values from power template. Finally we present an algorithm to reconstruct the image from these candidates.
Power Template Building: Power template stores the mapping between pixel values and its corresponding power consumption when convolved with different kernels. In the profiling phase, for each input image, we collect multiple power traces from the FPGA loaded with different kernels and obtain power consumption at each cycle using power extraction in Section V.
The power consumed in each cycle is determined by the state transitions of the convolution unit. The kernel remains constant between cycles, while the pixels are shifted within a row. So the number of related pixels in one cycle is when the kernel size of is . For example, suppose at cycle , pixels from position to are inside the convolution unit, while at cycle , pixels from position to are processed. The power consumed in cycle is induced by change in the convolution unit, so all pixels from to determine the power consumption. We represent the this region as and the pixel values in this region as . These pixel values are named related pixels for th cycle. Further, we represent the power consumption in th cycle when the image is computed with th kernel as .
So for each cycle, we obtain the related pixels and the power collected with different kernels, namely power feature vector, represented as . is the number of kernels used. For one input image at profiling stage, the power template is constructed by adding all pairs of related pixels and corresponding power feature vector for all cycles, i.e., . The final power template is the union set of power templates constructed by every input image.
Candidates Generation: Based on the assumption that similar pixels processed in the convolution unit generate similar power feature vector, the straightforward way to get pixel candidates is to find pixel values in the power template whose corresponding power feature vector is closest from that extracted during attack. However, this method easily fails due to limited samples enrolled in the power template. Hence, we propose to divide the power feature vector into several groups and search them in the power template respectively. After we get the pixel candidates for each group, we take the intersection of them to generate the final candidate set for image reconstruction.
To be specific, for a specific cycle , we acquire a power feature vector from measured power traces and separate them into several groups of same size . For each entry in the power template, we do the same separation, i.e., . For each group of vectors , we search the same group of power feature vectors in the power template and return the related pixels if the distance between two groups is within a threshold . The candidate set, consisting of all returned related pixels, is given by
where the distance metric is defined by
The represents the kernel indexes of power features grouped to the th group. The final candidate set for the specific cycle is given by the intersection of the candidate sets from all different groups, i.e., .
Image Reconstruction Algorithm: After obtaining the candidate sets for all cycles, we have many possible values for each pixel and the target is to find the closest one to the actual value.
As we have noted that the same pixel is processed in different cycles, so the candidates selected among these cycles shall be consistent on the value of the same pixel. We use this as a criterion to find the optimal selection. Suppose for a pixel at position , there are t cycles processing this pixel: . The candidate sets for these cycles are . From each candidate set, we choose a candidate with a selector as , and we find the pixel value at position in
. The variance of pixel values at position
selected from different candidate set shall be small. The objective of our image reconstruction algorithm is to find a selector vector so that the selected candidates minimize the sum of the variance of all pixels in the image. It can be represented as follows:(7) 
After the selector vector is determined, for each cycle, we get only one candidate. But for each pixel, we get multiple candidates from cycles processing it. To get the final value of the pixel, we take the average of these candidates.
This optimization problem is not easy to solve, here we present a greedy heuristic method shown in Algorithm
2.In Algorithm 2, we starts with a random candidate set and for each candidate in the set, we initialize a empty image with the pixels in the candidate (Line 6–8), other pixels are left undecided. Then we greedily search the unprocessed candidate set and find the candidate whose overlaps current image to the largest extent (Line 10) and who has the smallest distance with the overlapped pixels in the current image (Line 11). The image is then updated accordingly with the candidate and its index is recorded (Line 12–13) . This process is repeated until all candidate set is processed, and then a selector of size is generated. We calculate the variance defined in Eq. 7 for the selector (Line 15–16). After all candidate in the original set is processed, we find the selector with minimal variance and return it as the final result (Line 18–19).
ViiC Evaluation
Experiment Setup: We followed the same experiment setup in Section VIC except that we used 300 images to build the power template and the left 200 digit images to evaluate attack method. The pixel value in the MNIST digits is in the range of [0, 255]. We chose to build the power template with power traces collected with 9 different kernels instead of all 64 kernels, because it already provides enough precision to recover the input image.
Evaluation Metric: We use the same evaluation metrics with those in the background detection except the pixellevel accuracy is redefined with pixel values instead of background markers as follows:
(8) 
in which represents the pixel value in the recovered image while means the pixel value in the original image.
Candidates Generated: To evaluate the effectiveness of grouping power vectors in the power template, we list the statistics of candidates returned by power template in Table III. As we collected power traces from totally 9 different kernels, so the length of power feature vector for each cycle is 9. In the experiment, we divided the power features into 4 groups of size 2 (using first 8 features) and 3 groups of size 3 respectively. The number of candidates returned by the power template for one group is denoted as and the distance threshold is . We also calculate the distance between each candidate and the genuine related pixels and represent the minimal distance as , which serves as the quality metric of returned candidate set: the smaller, the better. For the final candidate set, i.e. the intersection of all candidate set from different groups, we also report the number of candidates in the set and the minimal distance . Table III shows the average of these numbers among all cycles.
From the table, the number of candidates increases with the increase of threshold as larger search space is included. Also the average of minimal distance decreases when more candidates are included. For smaller , such as 0.1 and 0.2, the number of candidates returned are small, and for many cycles, we are not able to find a match inside the template. Thus, smaller may lead to lower precision in finding the related pixels. For two experiments with different group size investigated, both of them achieves significant reduction in the size of final candidate set, while maintaining similar capability to recover more precise pixels (reflected by small changes of ) at medium or large s.
In all, the grouping of power feature vectors and intersection of candidate sets from power template is effective in reducing the size of pixel candidates for each cycle while maintaining the accuracy at the same time.
GroupSize = 2  GroupSize = 3  

0.1  767  57  107  190  325  153  48  155 
0.2  1448  45  351  116  787  90  170  102 
0.5  3847  33  1086  90  2447  48  715  68 
1.0  9457  26  2223  67  5890  34  1571  56 
Image Quality: Based on the experimental result in Table III, we proceed the image reconstruction with group size 3 as the final candidate size is relatively small to group size 2. We also determine the to be 1.0 to maintain a high accuracy candidate set for further reconstruction. For the left 200 images used for evaluation, using Algorithm 2, we recover them from the candidate sets from the power template. We also generate images without using this algorithm for comparison. Without Algorithm 2, for a particular pixel in image, it value is given by the average of all possible values for this pixel in the returned candidates. The average pixellevel distance, defined in Eq. 8, is 1.65 for image generated with Algorithm 2 and 2.98 for images without it. On an average case, both of them are quite close to the genuine image considering the pixel value range is 0 to 255. This is because the candidates generated from power template are already close to the genuine pixels.
However, as illustrated in Fig. 10, the recognition accuracy is much higher with Algorithm 2. The recognition accuracy of images recovered for model 1 (33 kernel size) with the algorithm is 89.8% while the accuracy drops down to 15% if we take the average of all the pixel candidates. The same accuracy drop also happens on the images recovered from power trace collected with model 2 (55 kernel size), from 79% to 10%. Though the images recovered without the proposed algorithm achieve relatively good pixellevel accuracy, the low recognition accuracy results from its incapability to reconstruct the structure of digits at some critical points, especially at the edge of digits. On the contrary, Algorithm 2 considers the consistence of related pixels recovered among cycles, thus it is able to filter out most unrelated pixels.
Finally, enlarging kernel size incurs a little degradation in the recognition accuracy, from 89.8% to 79% as more pixels are involved in one cycle so that it is relatively harder to distinguish the genuine pixels.
Complexity: We analyze the complexity in three phases: The time complexity to build power template is where stands for total number of images enrolled, means the cycles needed to generate one feature map and is the number of kernels. The memory complexity in power template building is , where represents the total amount of entries in power template and is the entry size. stands for the size of related pixels. In the candidates generation, the time complexity of proportional to the size of the power template and the size of returned candidate sets. Finally, for the image reconstruction algorithm, the most timeconsuming part comes from the loop in it (Line 3 – 17), so its complexity is , where is total number of candidate set (equal to the cycles needed to generate a output feature map) and stands for the average size of the candidate set. All methods in three phases can be implemented efficiently and we report their running time as follows: it takes 215.6s to build the power template from 300 images and 157.2s to generate candidates for all cycles in recovering one image. The image reconstruction algorithm costs around 43.2s to finish. The size of the power template built with 300 images enrolled is around 44MB.
Viii Discussion and Future Work
In this section, we first discuss the applicability of our proposed attack and attack target of background detection method. We also discuss the countermeasures and future work.
Applicability: Though we evaluate our power sidechannel attack on the accelerator implemented on FPGA, the actual attack target is the structure of line buffer where we exploit the power consumption with the sliding convolutional window over the input image. Thus our attack is applicable for whatever designs adopting the line buffer to execute the convolutional operation. Though line buffer is not suitable for DNN system on CPU or GPU, it enjoys popularity among a variety of FPGA or ASICbased neural network accelerators [9, 10]. Considering the promising application of neural accelerators, the proposed attack is a severe threat for the security of them.
Attack Target of Background Detection: Firstly, the background detection method proposed in Section VI is not guaranteed to find all pixels in pure background because its recovery granularity is limited by the kernel size. Thus, the background detection method can fail to recover the images with a messy background. Secondly, the threshold used in background detection is determined by the sharp descend of cycle counts of power consumed per cycle. We may not be able to observe the decline if the number of background pixels is far less than the foreground pixels. To summarize, the background detection method can recover the images which contains a pure and relatively large background region.
Countermeasures: The most straight forward way to counter the side channel attacker is to add noises on power side channel, but it does not grant strong guarantees on the privacy protection as noises can still be somehow cancelled with its distribution. For performance and security reason, countermeasures against power side channel attack can be implemented by mainly two ways: random masking and random scheduling. Random masking breaks the correlation between the power consumption and the sensitive data by masking the intermediate result with a random number. For instance, before the convolution, each pixel value used for computation is added with a random number. Then after the convolution result is obtained, the result is subtracted by the sum of these random numbers weighted by the convolution kernel. Random scheduling is effective against active attackers who utilize power from multiple kernels. If the convolution computation for each kernel is executed in a random order rather than sequentially, active adversaries will not be able to build an accurate power feature vector and they can fail in producing a recognizable image.
Future Work:
Our proposed attack algorithm is currently profiling the power consumption with images coming from the same sampling set. They inherently resemble each other so that we can achieve high recognition accuracy with relatively low overhead. If the attackers target at images with multiple input channel or they are not able to get the input images with same distribution of attack target in the profiling phase, more data need to be enrolled to achieve acceptable results. Thus, it is essential to handle the performance problem incurred by complex image recovery task and limited capability of obtaining data from similar distribution. We may resort to following techniques to tackle the problem: we can use PCA to compress the power feature vector and related pixel values to reduce the size of the prebuilt power template and use SVM or random forest to choose candidates in actual attacks. We plan to incorporate them in our future work and validate our method on more complicated datasets, such as CIFAR10 or even ImageNet.
Ix Related Work
Neural Network Privacy: In [24], authors made an successful attempt to correlate the dosage of certain medicine with a specific patient’s genotype from a model used in pharmacogenetics. Also on a face recognition system, they managed to reconstruct users’ face images enrolled in the training stage from the neural network models [2]. Shokri et al [25] presented a membership inference attack to decide whether a particular data record belongs to the model the training set with a blackbox access to the model. Tramer et al [26]
demonstrated a model inversion attack by exploiting the relationship of queries and confidence values on different machine learning models, such as DNN, logistic regressions, etc.
Power Sidechannel Attack: The power sidechannel leakage can be exploited to recover the secret keys in cryptographic devices. By analyzing the difference of multiple power traces with diverse inputs, attackers are able to uncover the secret key in widely used symmetric encryption standards, such as DES [27] and AES [28]. Eisenbarth [29] and Msgna [30]
showed they can recover the instruction type executed by processor via power side channel using hidden Markov model.
X Conclusion
In this paper, we demonstrate the first power side channel attack on an FPGAbased convolutional neural network accelerator. Its input image is successfully recovered using the power traces measured for inferencing operation. In the attack, we firstly filter out the noises and distortions in power measurement process. We consider two attacking scenarios for adversaries of different abilities and for these two adversaries, we propose two methods: background detection and power template, to recover the input image in different granularity. We demonstrate the practicality of our proposed attack on an accelerator executing classification task for handwritten digits in MNIST datasets and the experimental results show we achieve high recognition accuracy.
References
 [1] N. Papernot, et al. The limitations of deep learning in adversarial settings. In Proc. of IEEE European Symposium on Security and Privacy, EuroS&P, pages 372–387, 2016.
 [2] M. Fredrikson, et al. Model inversion attacks that exploit confidence information and basic countermeasures. In Proc. of ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1322–1333, 2015.
 [3] M. Abadi, et al. Deep learning with differential privacy. In Proc. of ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 308–318, 2016.
 [4] P. Mohassel and Y. Zhang. Secureml: A system for scalable privacypreserving machine learning. In Proc. of IEEE Symposium on Security and Privacy SP, pages 19–38, 2017.
 [5] N. Papernot, et al. Distillation as a defense to adversarial perturbations against deep neural networks. In Proc. of IEEE Symposium on Security and Privacy SP, pages 582–597, 2016.
 [6] Y. LeCun, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 [7] V. Sze, et al. Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039, 2017.
 [8] R. Zhao, et al. Accelerating binarized convolutional neural networks with softwareprogrammable fpgas. In Proc. of the International Symposium on FieldProgrammable Gate Arrays FPGA, pages 15–24, 2017.
 [9] J. Qiu, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 26–35. 2016.
 [10] C. Zhang and V. Prasanna. Frequency domain acceleration of convolutional neural networks on cpufpga shared memory system. In Proceedings of International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 35–44. 2017.
 [11] B. Bosi, et al. Reconfigurable pipelined 2d convolvers for fast digital signal processing. IEEE Transactions on VLSI Systems, 7(3):299–308, 1999.
 [12] I. Hubara, et al. Bengio. Binarized neural networks. In Annual Conference on Neural Information Processing Systems 2016, pages 4107–4115, 2016.
 [13] BigML. https://www.bigml.com/, 2017.
 [14] Microsoft azure machine learning. https://azure.microsoft.com/enus/services/machinelearning/, 2017.
 [15] Artificial intelligence tech in snapdragon 835. https://www.qualcomm.com/news/onq/2017/04/13/artificialintelligencetechsnapdragon835personalizedexperiencescreated, 2017.
 [16] Xilinx spartan6 fpga family. https://www.xilinx.com/products/silicondevices/fpga/spartan6.html, 2017.
 [17] Sakurag. http://satoh.cs.uec.ac.jp/SAKURA/hardware/SAKURAG.html, 2017.
 [18] Mdo3000 mixed domain oscilloscope. http://www.tek.com/oscilloscope/mdo3000mixeddomainoscilloscope, 2017.

[19]
Y. LeCun, et al.
THE MNIST DATABASE of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 2001–.  [20] Sakura: Sidechannel attack user reference architecture – specification. http://satoh.cs.uec.ac.jp/SAKURA/hardware/SAKURAG_Spec_Ver1.0_English.pdf, 2017.
 [21] NI Multisim http://www.ni.com/multisim, 2017.
 [22] Tom O’Haver. A pragmatic introduction to signal processing, 1997.
 [23] C. R. Gonzalez and R. Woods. Digital image processing. Pearson Education, 2002.
 [24] M. Fredrikson, et al. Privacy in pharmacogenetics: An endtoend case study of personalized warfarin dosing. In Proc. of the 23rd USENIX Security Symposium, pages 17–32, 2014.
 [25] R. Shokri, et al. Membership inference attacks against machine learning models. In Proc. of IEEE Symposium on Security and Privacy, SP, pages 3–18, 2017.
 [26] F. Tramèr, et al. Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium, pages 601–618, 2016.
 [27] P. C. Kocher, et al. Differential power analysis. In Proc. of Annual International Cryptology Conference (CRYPTO), pages 388–397, 1999.
 [28] E。 Brier, et al. Correlation power analysis with a leakage model. In Proc. of Cryptographic Hardware and Embedded Systems (CHES), pages 16–29, 2004.
 [29] T. Eisenbarth, et al. Building a side channel based disassembler. Trans. Computational Science, 10:78–99, 2010.
 [30] M. Msgna, et al. The bside of side channel leakage: Control flow security in embedded systems. In Proc. of Internal Conference on Security and Privacy in Communication Networks, pages 288–304, 2013.
Comments
There are no comments yet.