Recently, deep neutral networks (DNNs) have been the focus of research and widely used in many artificial intelligence (AI) related areas
, such as image classification, object detection, video recognition, and natural language processing etc. Many DNNs are deployed and implemented on embedded devices, e.g., robots, self-driving cars, and smartphones etc, or to solve specific industrial tasks like pearl classification[2, 3], fault diagnosis [4, 5, 6] and soft sensors [7, 8]. With the miniaturization of DNNs and the development of AI chips [9, 10], DNNs on embedded hardware are becoming increasingly common and meanwhile, vulnerable to various attacks. In this work, we demonstrate, for the first time, a powerful side-channel attack (SCA) based technique to the embedded DNN devices which has the potential to reveal critical internal information of DNNs.
Currently, it was found that DNNs are vulnerable to adversarial attacks in the form of tiny perturbations . In other words, by adding small controllable noises to the input, we can mislead the network to generate incorrect results. This poses a security problem for the application of DNN, e.g., attacks on road signs  may bring a safety hazard to autonomous driving; attacks on face attributes [13, 14]
may invalidate many face recognition applications; and attacks on robot vision may challenge the application of robots. The mainstream attacking models include black-box and white-box attacks . The difference lies at the structural knowledge of the network. The black-box attack assumes no prior knowledge of the neural network model [17, 18]. While the white-box attack, on the other hand, relies upon the complete structure information, including both network architecture and parameter values [19, 20, 21]. Apparently, the white-box is much more powerful and resilient than the black-box model at the cost of feasibility, i.e., most of the embedded devices are considered as black-box. To leverage this tradeoff and significantly improve the attack performance, it is required to obtain additional neural network information for the embedded hardware.
SCA is a powerful tool to obtain hardware information [22, 23]. It takes advantage of side-channel signals, such as power consumption, computing time, and electromagnetic radiation etc., to reveal hidden information inside the embedded hardware . Traditionally, it is used to extract the secretive keys during the encryption/decryption process. However, since during the computation process of the DNNs, the side channel information shows strong correlations to the network structure and its parameters, we envision that SCA can be used for embedded AI devices and reveal their network architectures and even the corresponding parameters. In other words, we intend to use SCA to open the black-box of DNNs, which can facilitate adversarial attacks by transforming a black-box attack to an at least partial white-box, or gray-box, attack. Note that, as a Chinese aphorism says “the law of defending a city is born from siege; if you want to defend well, you must attack well”, we hope this study of SCA on DNNs could provide useful insights for researchers to propose defensive strategies in the future to better protect AI devices.
In this work, we propose a SCA based technique to explore the network structure and parameter properties, which might be used for subsequent attacks. First, we collect the device power consumption traces. Then we use machine learning techniques to identify the structure and parameter values. Our technique is based on the assumption that the AI device employs existent architectures and pre-trained parameters, which is appropriate for many real-world applications [25, 26, 27]. Specifically, we have made the following contributions.
We are the first to employ the SCA method to identify the DNN structures in embedded devices.
We are also the first to use SCA to estimate the sparsity of DNN parameters. Based on the sparsity features, we derive the pre-trained parameter values.
We validate the effectiveness of our techniques on a real-world embedded platform.
To perform SCA, we use the power reading measurement during DNN processing. The experiment results show that our technique can achieve more than 96.5% attacking success rate.
The remaining parts of this paper is organized as follows. Section II introduces the related works. Section III describes basic knowledge of DNN structures and parameters. Section IV presents the underlying theories and details of our technique. Section V shows the experiment setup and results. Section VI concludes the paper.
Ii Related Work
The current work is related to deep neural networks, embedded AI hardware, and side channel attacks, which will be briefly reviewed.
Ii-a Deep neural networks
is popular for its success in the 2012 ImageNet competition. GoogleNet  significantly increases the depth of DNN. ResNet  beats human experts in image recognition. VGGNet  and RCNN  are widely used for their breakthrough in object detection. There are also networks specific to mobile applications [9, 10]. Currently, most engineers design their AI product based on the exist architectures. Therefore, by identifying the existent popular architectures, we expect to be able to break a large portion of such AI products.
Ii-B Embedded AI hardware
Embedded AI devices are becoming more and more popular recently, and many AI specific embedded products are announced. Cambricon developed a series of embedded AI hardware based on the Diannao architecture [36, 37]
. They can be used as AI processing units and handle jobs such as face recognition and object detection etc. Intel proposed the Movidius Myriad X Visual Processing Unit (VPU), a deep learning based visual AI core to accelerate applications such as drones, smart cameras, and VR/AR helmets, etc. Huawei released Ascend series of AI processing units which are designed for full range scenarios. Nvidia and Facebook announced plans for DNN processing units. Google also designed dedicated hardware Tensor Processing Units (TPU) for its AI services.
In this work, we develop a raspberry pi based experimental platform. As an Arm cortex based system, Raspberry Pi shares the common architecture of many existent devices. Therefore, the experiments performed on this system can be easily migrated to other similar systems.
Ii-C Side-channel attack
Side-channel attack (SCA) is a very powerful tool in attacking encrypted systems. Traditionally, the encryption process is considered as a perfect black-box. However, in real-world applications, information can be leaking . Initially, SCA is focused on differential power analysis  and timing attacks . Later, more side-channel information and attacking methods are developed. Yuval et al.  propose a cache based SCA to extract the private encryption keys. Genkin et al.  extract full 4096-bit RSA keys successfully using the computer audio information during the decryption process. By cloning the USIM cards, Liu et al.  can recover the encryption key and other information contained from the 3G/4G USIM cards. Defense against SCA is also well studied .
Naturally, this powerful attacking method can be applied to reveal DNN architectures or some related information. Duddu et al.  used timing side channels to infer the depth of the network; Batina et al. 
show that side channel attacks can roughly obtain information on activation functions, number of network layers, number of neurons, number of output categories, and weights in the neural network; Another close work is to obtain the input image by analyzing the power trace in the first convolution layer. So far, the current work is the first attempt trying to reveal the internal DNN architectures of embedded devices using power SCA.
In this section, we introduce the basics of DNN and the corresponding parameters.
Iii-a DNN architectures
The mainstream DNN architectures typically share some common and critical components, e.g., in visual applications, convolutional layers are usually used to extract features, while fully connected layers are used for classifications. In this paper, we mainly focus our research on the computer vision related DNN models. Fig.1 shows the typical components, including convolutional layers, pooling layers, fully connected layers, and activation function. The convolution layers use various convolution kernels to filter the images. The pooling layer is essential to reduce the layer dimensions. The fully connected layers consist of fully connected neurons. The activation function is typically a non-linear function added to the output of each neuron.
One critical observation is that different components require different computational cost. Therefore, different architectures have different power consumption patterns, which makes DNN architectures vulnerable to SCAs.
Iii-B Parameters and sparsity
Parameters, or neuron weights and biases, define a DNN model. Typically, a DNN model is first trained using back propagation. During the training phase, the parameters are continuously updated. Then in the inferring phase, the parameters combined are used to perform various classification operations. Training a DNN from scratch requires numerous computation resources and time [45, 46, 30]. Therefore, in real-world applications, people usually derive the DNN based on pre-trained parameters on existent models.
Another problem for embedded AI applications is that the computational resources on embedded platforms are very limited. To address this problem, various parameter pruning techniques were proposed. Han et al.  deeply compressed neural networks by pruning and trained quantization and Huffman coding and 
proposed an energy efficient inference engine, which infers the compressed network model and acceleration vector multiplication by weight sharing. Yu et al. proposed the Neuron Importance Score Propagation algorithm to better reduce redundant connections. Kang et al. 
proposed a pruning scheme for convolutional neural networks (CNN) running on accelerators. Lin et al. proposed a novel global and dynamic pruning scheme to prune redundant filters for CNN acceleration. The basic idea of parameter pruning is to set some unimportant weights to be zero. Therefore, the computational cost is reduced while the performance of the whole network retains. Parameter sparsity is defined as the proportion of zero-valued parameters. Obviously, for different DNN models with the same architecture, power consumptions can vary significantly because of different parameter sparsity. In that case, other than the DNN architecture, it is possible that the device power traces can be used to reveal the actual weights of a large portion of pre-trained neurons.
Iv Side channel power based technique
In this section, we present our technique in detail. Specifically, we first develop power consumption theories and models on various DNN layers. Then, we discuss the impact of parameter sparsity on DNN power cosumptions. Finally, we describe our SCA based technique.
Iv-a DNN power models
First, we build power computation models for each kind of neural network layer.
Iv-A1 Convolutional layer
As shown in Fig .1 (a), convolutional layer typically consists of many filters and is used to extract features at different level of abstractions. Specifically, the convolutional operation can be described by
where represents the output of a single convolutional filter, represents the input value, represents the number of input channels, represents the filter kernel size, and and represent the parameter weights and bias, respectively. Therefore, according to Eq. (1), the total operation number of a single convolutional filter is calculated by
where is the number of multiplication operations and is the number of addition operations. Assuming that the input size is
, the filter stride is, and the number of convolutional kernels is , the total operation number of the convolutional layers is calculated by
Thus, the power consumption of a single convolutional layer can be derived as
where and are the average multiplication and addition operation power consumptions, respectively, of the device. It is observed that the power consumption of the convolution layer is strongly correlated to the input and kernel sizes, i.e., the convolution layer architectures.
Iv-A2 Pooling layer
To reduce the growing feature dimensions without hurting the performance, there is typically a pooling layer between two consecutive convolutional layers, as shown in Fig. 1 (b). Without loss of generalization, we employ the maximum pooling method here. Thus, the total number of operations for the pooling layer is
where is the input size, is the pooling stride, and is the pooling window size, for simplicity. Therefore, the power consumption of a pooling layer is calculated by
where represents the average comparison operation power consumption of the device. Eq. (6) indicates that the pooling layer power consumption is also largely determined by the DNN architecture.
Iv-A3 Fully connected layer
After the convolution layer and pooling layer, DNNs typically use fully connected layers, as shown in Fig. 1 (c), to process the extracted features. The fully connected layers consists of several layers of fully connected neurons. The operation number of a single fully connected layer can be calculated by
where is the number of input neurons and is the number of output neurons. Thus, the power of fully connected layer can be derived as
Iv-A4 Activate function
For each neuron, its value should be judged by an activation function, as shown in Fig. 1
(d). There are many different types of activation functions, such as relu, tanh, sigmod, and softmax, etc. The total number of operations for the activation function is calculated by
where is the operational coefficient which is determined by the specific type of the activation function. Therefore, the power consumption of the activate function can be derived as
where is the power consumed by one operation in activation function. In general, the power consumption of the activation function is linear to the inputs and relatively small compared to other layers.
Iv-A5 Overall power consumption
So far, we have built the power consumption models for the major components of modern DNNs. There are many other special operations. However, their computational cost is usually negligible. Therefore, in this work, we construct our general power model based on Eq. (4), (6), (8), and (10). Fig. 2 shows a simple example of our model. This simple DNN consists of five convolutional layers, three pooling layers, three fully connected layers, and common activation functions. The operation of each layer can be represented by our corresponding power model.
It is observed that the variation of DNN architecture can have significant impact on the power consumptions. Thus, theoretically we can identify the specific architecture by analyzing and classifying the operational power traces.
Iv-B Parameter sparsity model
In the previous section, we have generalized the models to identify the DNN architectures. In order to further improve the efficiency of adversarial attacks on DNN, it is desirable to derive the actual parameters, such as weights and biases, of each neuron. However, unlike the architecture, the power variation of different parameter values is very insignificant, unless it is zero. it is observed that many AI models use at least partial pre-trained parameters. Moreover, some advanced AI chips have hardware-level parameter pruning to improve the computational efficiency while maintaining the performance. Combining the pre-training and parameter pruning techniques, it is possible to further identify the actual parameter values.
It is assumed that the pre-training and pruning techniques for various models are known to the attackers. Thus, during the operation of the AI device, different pre-trained parameter sets with various pruning method can generate unique sparsity in the neurons, which can lead to differentiable power traces. Parameter pruning mainly occurs in the convolution and the fully connected layers. The power consumption of the convolution layer and that of the fully connected layer are thus changed to
where are the parameter sparsity coefficients of the convolution layer and the fully connected layer, respectively. When no pruning operation is applied, the values of is simply set to 1. Different pre-training parameters combined with different parameter pruning techniques can lead to significant power variations. Thus, this observation can be used to identify the actual parameter values of DNN model in the AI device, which makes it even more vulnerable.
Iv-C General power model
Based on the above DNN architecture and parameter power models, we build our general power model to classify the specific architecture and the corresponding parameter values. It includes training and testing phases.
The framework of our SCA on DNN models is shown in Fig 3. We continuously collect voltage and current data while the AI device is running a model, calculate the power and power features to form a power-feature data set. We then randomly choose some data to train a classifier, and input the rest of the power-feature data into the trained classifier to get the model structure and the parameter sparsity. More specifically, for each model, we sample pairs of current and voltage data (different DNNs have different
, for the different time they take to calculate a single image) and then calculate the average, median, and standard deviation of the power as the power features. For each model, we repeat the above sampling and data processing many times to form a power-feature data set. Finally, we design classifierand use the data set for training. To further evaluate the parameters, we set different sparsity for each model, so the final labels include the model structure and parameter sparsity .
Our SCA is summarized as Algorithm 1, where we use the classifier , voltage array , and current array as the inputs, and then use to classify and generate the outputs.
In this section, we describe the experiments to validate the effectiveness of our techniques.
V-a Experiment setup
To extract valid power data, we use an external data acquisition card running at 400. Fig. 4 shows an example of our data collection setup.
In this example, we use Alexnet for image classification and continuously input 24 images (an epoch). The processing of the DNN requires a large amount of computing resources. Therefore, the device power increases significantly. We can observe 24 peaks, corresponding to 24 images processed. Note that there is a start and end phase when the device runs the DNN model. Therefore, we remove those low-power phases and take the middle part of the data (about
sequence number in the figure) as the input. Due to the instantaneous nature of the power, in the process of inference, we divide per five-images’ inference into a group, take all the power data, calculate their mean, median and variance as the power-feature data of the model.
We implement six common DNN models on the Raspberry Pi and test them with the same set of images. After data acquisition and processing, we obtain a power-feature data set, divided into 6 categories, as shown in Fig. 5.
V-B Architecture identification
In this work, we use machine learning techniques to identify the DNN architectures. First, we randomly divide the power-feature data set into training set and testing set with a ratio of 4:1. Here, we simply employ the widely used SVM classifier. The results on test set are shown in Fig. 6, where the red bars is the architecture identification accuracies. The average classification accuracy reaches 96.50%. It should be noted that we do not intend to compare the performance of different machine learning algorithms, since this work focus on proposing the overall SCA framework and SVM already achieves quite high accuracy. By using more advanced algorithms, the experimental results could be further improved.
|Alexnet||Change the number of the last three layers||4096*4096*1000|
|InceptionV3||Change the number of the output layer||1000|
|Resnet50||Change the number of the output layer and the different building blocks ||1000,(3,4,6,3)|
|Resnet101||Change the number of the output layer and the different building blocks||1000,(3,4,23,3)|
|MobilenetV1||Change the number of the output layer||1000|
|MobilenetV2||Change the number of the output layer||1000|
In the previous experiment setup, we assume that the pre-trained DNNs are deployed exactly unchanged. However, this may not be true in the real-world applications. In general, many DNN models can be divided into two parts. The first part is for the feature extraction, such as the convolutional layers and pooling layers. The second part is for the classification, such as the fully connected layers. In real-world applications, the feature extraction part is typically with little or no change while on the other hand, the classification part is always fine-tuned based on the actual application.
Therefore, to further explore the performance of our technique with the changing classification layers, we repeat the experiment with changing fully connected layer hype-parameters. We employ three fine-tuning methods for each original architecture, as shown in Table I, with each architecture containing four entries. The first one is the original structure and the others are the fine-tuned ones. The power traces from all the fined-tuned architectures are mixed with the original ones. Then the architecture identification process is repeated on the new data set. The classification results for fine-tuning models are still shown in Fig. 6 (blue bars). The average performance is slightly reduced to 95.17%, which demonstrates that our SCA is quite robust even with the fine-tuning of the classification layers.
V-C Parameter evaluation
The evaluation of the model parameters is also important. For example, the traditional white-box attack requires full knowledge of the DNN model, including parameter values. However, estimating the model parameters are usually quite challenging. In this work, we assume that pre-trained architectures and partial model parameters could be employed in certain real-world applications. Moreover, due to the techniques such as parameter tunings, different models could have varying sparsity, i.e., the ratio of zeros. We thus use the model sparsity as features to infer the model parameters.
Specifically, we employ dropout  in all the convolutional and fully connected layers and set different dropout scales to derive various parameter sparsity. The dropout scale is set to 1.0, 0.8, 0.6, and 0.4, respectively. In this experiment, we select four DNN models. The classification results of the same model with varying parameter sparsity are shown in Table II. The average identification rate is about 76.38%. The Alex network is relatively hard to identify since it is relatively small and consumes less power. In general, in the large networks, the accuracy is well above 82%.
We have performed model architecture and model parameter estimation. However, these two processes may affect each other. For example, the varying model sparsity may lead to wrong identification of the model architecture. Therefore, we need to check the general model identification results which include both varying architecture and varying parameters.
Thus, by mixing up the architecture and parameter sparsity variations, we have a total of 16 different categories in the data set. The experimental results are shown in Fig 7. The confusion matrix demonstrates that, even with varying parameter sparsity, we can still get the structure of a DNN model with relatively high accuracy, i.e., more than 95%. The accuracy of parameter sparsity recognition under the fine classification task is 75.88%, which is reduced a little but generally acceptable.
In this paper, we propose a side-channel attack (SCA) method to reveal the internal structure and model parameters for DNN models. We design a raspberry pi based platform to derive the power signature of embedded AI devices, and then use machine learning algorithms to identify the specific DNN architectures. Moreover, we differ the parameter sparsity to model the pre-training of DNNs. In general, our technique can identify both the architecture and model parameters with quite high accuracy, indicating that we should pay strong attention to the security problem of many AI applications.
In the future, we will first improve the experimental platform by considering more diverse DNN architectures and parameters; then, we will use more advanced machine learning algorithms to identify DNN models more precisely; finally, we will try to propose defensive strategies to protect the model information from SCA.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
-  Q. Xuan, B. Fang, Y. Liu, J. Wang, J. Zhang, Y. Zheng, and G. Bao, “Automatic pearl classification machine based on a multistream convolutional neural network,” IEEE Transactions on Industrial Electronics, vol. 65, no. 8, pp. 6538–6547, 2018.
-  Q. Xuan, Z. Chen, Y. Liu, H. Huang, G. Bao, and D. Zhang, “Multiview generative adversarial network and its application in pearl classification,” IEEE Transactions on Industrial Electronics, vol. 66, DOI 10.1109/TIE.2018.2885684, no. 10, pp. 8244–8252, Oct. 2019.
-  L. Wen, X. Li, L. Gao, and Y. Zhang, “A new convolutional neural network-based data-driven fault diagnosis method,” IEEE Transactions on Industrial Electronics, vol. 65, DOI 10.1109/TIE.2017.2774777, no. 7, pp. 5990–5998, Jul. 2018.
-  Y. Liu, J.-Y. Wu, K. Liu, H.-L. Wen, Y. Yao, S. Sfarra, and C. Zhao, “Independent component thermography for non-destructive testing of defects in polymer composites,” Measurement Science and Technology, vol. 30, DOI 10.1088/1361-6501/ab02db, no. 4, p. 044006, Mar. 2019. [Online]. Available: https://doi.org/10.1088%2F1361-6501%2Fab02db
J. Chen, Y. Yang, K. Hu, Q. Xuan, Y. Liu, and C. Yang, “Multiview transfer learning for software defect prediction,”IEEE Access, vol. 7, pp. 8901–8916, 2019.
-  Y. Liu, Y. Fan, and J. Chen, “Flame images for oxygen content prediction of combustion systems using dbn,” Energy & Fuels, vol. 31, no. 8, pp. 8776–8783, 2017.
-  Y. Liu, C. Yang, Z. Gao, and Y. Yao, “Ensemble deep kernel learning with application to quality prediction in industrial polymerization processes,” Chemometrics and Intelligent Laboratory Systems, vol. 174, pp. 15–21, 2018.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” arXiv preprint arXiv:1801.04381, 2018.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
-  I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song, “Robust physical-world attacks on machine learning models,” CoRR, vol. abs/1707.08945, 2017. [Online]. Available: http://arxiv.org/abs/1707.08945
-  A. Rozsa, M. Günther, E. M. Rudd, and T. E. Boult, “Facial attributes: Accuracy and adversarial robustness,” Pattern Recognition Letters, 2017.
-  V. Mirjalili and A. Ross, “Soft biometric privacy: Retaining biometric utility of face images while perturbing gender,” in 2017 IEEE International joint conference on biometrics (IJCB), pp. 564–573. IEEE, 2017.
-  M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli, “Is deep learning safe for robot vision? adversarial examples against the icub humanoid,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 751–759, 2017.
-  N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” arXiv preprint arXiv:1801.00553, 2018.
-  J. Su, D. V. Vargas, and S. Kouichi, “One pixel attack for fooling deep neural networks,” arXiv preprint arXiv:1710.08864, 2017.
-  F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “The space of transferable adversarial examples,” arXiv preprint arXiv:1704.03453, 2017.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples (2014),” arXiv preprint arXiv:1412.6572.
-  A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” arXiv preprint arXiv:1607.02533, 2016.
-  J. Chen, H. Zheng, H. Xiong, and M. Su, “Finefool: Fine object contour attack via attention,” CoRR, vol. abs/1812.01713, 2018. [Online]. Available: http://arxiv.org/abs/1812.01713
-  P. C. Kocher, “Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems,” in Annual International Cryptology Conference, pp. 104–113. Springer, 1996.
-  P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Annual International Cryptology Conference, pp. 388–397. Springer, 1999.
-  Y. Zhou and D. Feng, “Side-channel attacks: Ten years after its publication and the impacts on cryptographic module security testing.” IACR Cryptology ePrint Archive, vol. 2005, p. 388, 2005.
-  A. Singla, L. Yuan, and T. Ebrahimi, “Food/non-food image classification and food categorization using pre-trained googlenet model,” in Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, pp. 3–11. ACM, 2016.
-  M. Schwarz, H. Schulz, and S. Behnke, “Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 1329–1335. IEEE, 2015.
-  Keras, “Using pre-trained models,” https://cran.rstudio.com/web/packages/keras/vignettes/applications.html, accessed November 18, 2018.
-  Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
-  J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
-  Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society, 2014.
-  S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 20. IEEE Press, 2016.
-  Y. Yarom and K. Falkner, “Flush+ reload: A high resolution, low noise, l3 cache side-channel attack.” in USENIX Security Symposium, vol. 1, pp. 22–25, 2014.
-  D. Genkin, A. Shamir, and E. Tromer, “Rsa key extraction via low-bandwidth acoustic cryptanalysis,” in International cryptology conference, pp. 444–461. Springer, 2014.
-  J. Liu, Y. Yu, F. Standaert, Z. Guo, D. Gu, W. Sun, Y. Ge, and X. Xie, “Small tweaks do not help: Differential power analysis of milenage implementations in 3g/4g usim cards,” pp. 468–480, 2015.
-  K. Tiri and I. Verbauwhede, “A vlsi design flow for secure side-channel attack resistant ics,” in Proceedings of the conference on Design, Automation and Test in Europe-Volume 3, pp. 58–63. IEEE Computer Society, 2005.
-  V. Duddu, D. Samanta, D. V. Rao, and V. E. Balas, “Stealing neural networks via timing side channels,” CoRR, vol. abs/1812.11720, 2018. [Online]. Available: http://arxiv.org/abs/1812.11720
-  L. Batina, S. Bhasin, D. Jap, and S. Picek, “CSI neural network: Using side-channels to recover your artificial neural network information,” CoRR, vol. abs/1810.09076, 2018. [Online]. Available: http://arxiv.org/abs/1810.09076
-  L. Wei, Y. Liu, B. Luo, Y. Li, and Q. Xu, “I know what you see: Power side-channel attack on convolutional neural network accelerators,” arXiv preprint arXiv:1803.05847, 2018.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” neural information processing systems, pp. 2377–2385, 2015.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,”international conference on machine learning, pp. 448–456, 2015.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254. IEEE, 2016.
-  R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” Preprint at https://arxiv. org/abs/1711.05908, 2017.
-  H.-J. Kang, “Accelerator-aware pruning for convolutional neural networks,” arXiv preprint arXiv:1804.09862, 2018.
-  S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, “Accelerating convolutional networks via global & dynamic filter pruning.” in IJCAI, pp. 2425–2432, 2018.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012. [Online]. Available: http://arxiv.org/abs/1207.0580