1 Introduction
Neural networks (NNs) Schmidhuber15 ; HTF01
are layered computational networks that try to imitate the function of neurons in a human brain during object recognition, decision making, and other cognitive tasks. They are one of the most widely used machine learning techniques due to their good performance in practice. Certain variants of neural networks were shown to be more suitable for different learning applications. For instance, deep convolutional neural networks (CNNs)
JKR09 ; KSH12were found to be effective at recognizing and classifying images. Recurrent neural networks (RNNs)
GLF09 ; HAF14 provide stronger performance at sequence predictions, e.g. speech or text recognition. A neural network is determined by the connections between the neurons and weights/biases associated with them which are trained using the backpropagation algorithm RB93 ; Nielsen89 .In order to fit highly nonlinear functions and thus to achieve a high rate of correctness in practice, neural networks usually contain many layers, each containing a large number of weights. The most power and timeconsuming computation of a neural networks is the matrixvector multiplication between weights and incoming signals. In powerrestricted applications, such as an inference engine in an embedded system, the size of the neural networks is effectively limited by the power consumption of the computations. One attractive method of lowering this power consumption is by use of neuromorphic computing ZADFBG10 , where analog signals passed through memorycell arrays are sensed to accomplish matrixvector multiplications. Here weights are programmed as conductances (reciprocal of resistivity) of memory cells in a twodimensional array, such as resistive RAM (ReRAM), phasechange RAM (PCM) or NAND Flash MME13 . According to Ohm’s law, if input signals are presented as voltages to this layer of memory cells, the output current is the matrixvector multiplication of the weights and the input signals. The massive parallelism of this operation also promises significant savings in latency over sequential digital computations.
One of the problems with neuromorphic computing is the limited precision of the computations, since the programming of memory cells, as well as the measurement of the analog outputs, inevitably suffers from analog noise, which affects performance of deep learning systems in a harmful way. Some researchers have proposed to correct the errors between A/D and D/A converters of each layer FWI18 ; JRRR17 , which induces extra overhead in latency, space, and power as well. Existing noisy inference literature for neuromorphic computing is mostly in the device/circuit domain, typically focusing on redesigning memory devices/circuits to reduce the noise power. But to our knowledge, the exploration of robustness of neural networks against analog and noisy computations inside the network in an algorithmic point of view has not been studied in detail. It is the first goal of this paper to present methods and observations to mitigate the harmful effects brought by inference noise.
During its study, we find out that noisy inference can also be helpful to defend against blackbox adversarial attacks of neural networks, the resistance of which is becoming more crucial for a wide range of applications, such as deep learning systems deployed in financial institutions and autonomous driving. Adversarial attacks manipulate inputs to a neural networks and generate adversarial examples that have rarely perceptible differences to a natural image, but they lead to incorrect predictions Szegedy2013IntriguingPO . Whitebox attacks can be categorized into three categories.

Single step attack, e.g., fast gradient sign methods (FGSM) GoodfellowExplaining14 ,
where and is the natural and adversarial examples, respectively. is the upper bound on the distance between and .
is the loss function for an input
being predicted to have label and is the derivative operator with respect to input . 
Iterative methods, e.g., iterative FGSM KurakinAdversarialPhysical16 ,
where is the learning rate of the adversary, Clip is a function that clips the output to be within the cube of such that the distance between and is always less than or equal to .

Optimization based attacks, e.g., CarliniEvaluatingRobustness16 ,
where balances the norm distortion and the loss function.
Blackbox attacks are possible based on the transferability PapernotTransferability16 of adversary such that adversarial examples generated from one neural network often lead to incorrect predictions of another independently trained neural network.
Several defensive mechanisms such as distillations PapernotDistill15 , feature squeezing XuFeatureSqueezing17 , and adversary detection Feinman2017DetectingAS , are proposed against attacks. madry2018towards models the process as a minmax problem and builds a universal framework for attacking and defensive schemes, where it tries to provide a guaranteed performance for the firstorder attacks, i.e., attacks based soly on derivatives. They also suggested that adversarial training with projected gradient descent (PGD) provides a strongest robustness against firstorder attacks. Based on their study, we further investigate noisy inference of adversarially trained neural networks. Experimental results show improvement on robustness against blackbox adversarial attacks and indicate that adversarial examples are more sensitive to inference noise than natural examples.
The contribution of this paper is summarized as follows.
1. We model the analog noise of neuromorphic circuits as additive and multiplicative Gaussian noise. The impact on accuracy of noisy inference is shown to be severe by three neural network architectures (a regular CNN, a DenseNet HLW16 , and a LSTMbased RNN) for three datasets (MNIST, CIFAR10, and MNIST stroke sequences JongMnistStroke2016 ), e.g., when noise power equals the signal power, the accuracy is as low as , respectively.
2. We observe that the performance of noisy inference can be greatly improved by injecting noises to all layers in the neural networks during training. Noiseinjected training was used as a regularization tool for better model generalization, but its application to noisy inference has not been studied in details. We provide a quantitative measurement on the impact that the inference noise has on noiselesstrained and noiseinjected trained neural networks, where the power of training and inference noise might not match. The performance of noisy inference with lowtomedium noise power is improved to almost as good as noiseless inference. For large noise power (equal to signal power), the accuracy is increased to for the three datasets, respectively.
3. We further improve the performance of noisy inference by proposing a voting mechanism for large noise power. The accuracy has been further increased to for the three datasets when noise power equals the signal power. In addition, we observe that with noiseinjected training and the proposed voting mechanism combined, noisy inference can give higher accuracy than noiseless inference for LSTMbased recurrent neural networks, which is counterintuitive since it is often believed that noise during inference would be harmful rather than helpful to the accuracy.
4. A further study on adversarially trained neural networks for MNIST and CIFAR10 has shown that noisy inference improves the robustness of deep neural networks against blackbox attacks. The accuracy of adversarial examples has been improved by and (in absolute values) for MNIST and CIFAR10 when validated on a separately and adversarially trained CNN and DenseNet, respectively.
2 Preliminaries
A neural network contains input neurons, hidden neurons, and output neurons. It can be viewed as a function where the input is an dimensional vector and the output is an dimensional vector. In this paper, we focus on classification problems where the output is usually normalized such that and
can be viewed as the probability for some input
to be categorized as the th class. The normalization is often done by the softmax function that maps an arbitrary dimensional vector into normalized , denoted by , as . For top decision problems, we return the top categories with the largest output . In particular for hard decision problems where , the classification results is then .(1) 
where is the weights matrix,
is the bias vector, and
is an elementwise activation function that is usually nonlinear, e.g., tanh, sigmoid, rectified linear unit (ReLU)
HSMDS00 and leaky ReLU MHN13 . Both and are trainable parameters.In this paper, two noise models are assumed, which are additive Gaussian noise model and multiplicative Gaussian noise model , respectively, in the forward pass after each matrixvector multiplication. Then Eq. becomes
(2) 
for additive Gaussian noise or
(3) 
for multiplicative Gaussian noise. Additive Gaussian noise
models procedures of neuromorphic computing where the noise power is irrelevant to the signals, such as signal sensing, memory reading, and some random electrical perturbation of circuits. On the other hand for multiplicative noises, we can show that multiplying a unitmean Gaussian random variable
to a signal is equivalent to adding a zeromean Gaussian random variableto the signal where the standard deviation of
is proportional to the magnitude of the signal, as(4) 
where . Therefore, multiplicative noise models the procedures where the noise power is proportional to the signal power, such as memory programming and computations.
For both additive and multiplicative Gaussian noise models, we denote and the standard deviation of the noise during training and validation (inference), respectively. We denote , , , if specific additive or multiplicative noise models are referred to. Note that is usually set to be in conventional training; Nonetheless, random noise is sometimes injected () during training to provide better generalization. Conventional deep learning architecture assumes since digital circuits are assumed to have no errors during computations.
For both noisy models, the signaltonoise power ratio (SNR) is a defining parameter that measures the strength of the noise relative to the signal, which is defined as the ratio between the power of signal and noise. It is usually expressed in dB where SNR(dB) = . For example, if the signal and noise has the same power, SNR = 1 (or dB). For multiplicative noise models, we have constant SNR = ; on the other hand, SNRs for additive noise models with fixed ’s depend highly on signal power and it would be a fair comparison only if the signal power is invariant. Therefore, we mainly use multiplicative noise models in our experiment except for the case where we can normalize the signals to have constant power.
3 Robustness of NNs against Noisy Inference
In this section we explore the robustness of neural networks against noisy computations modeled in Eq. and Eq. . We will show the robustness of different neural network architectures for different datasets against noisy inference and provide two techniques that improve the robustness, namely, noiseinjected training and voting.
3.1 Datasets and NN architectures
Three datasets are used in our experiments, which are MNIST images ( for training and for validation), CIFAR10 images ( for training, for validation, no data augmentation applied), and the stroke sequence of MNIST JongMnistStroke2016 for each MNIST image.
For MNIST images, we use a 6layer convolutional neural network described in Table 1. The noiseless trained model has prediction accuracy of
for noiseless inference. For multiplicative Gaussian noise models, the parameters in batch normalization layers are trainable. However, for additive Gaussian noise models, we use batch normalization layers with fixed parameters
(as opposed to trainable mean and variance) to minimize the signal power variations for the reason mentioned in Section
2.Layer  Output shape 

Conv. 2D  Batch Norm  
Gaussian Noise  ReLu  
Conv. 2D  Batch Norm  
Gaussian Noise  ReLu  
Maxpooling  
Conv. 2D  Batch Norm  
Gaussian Noise  ReLu  
Conv. 2D  Batch Norm  
Gaussian Noise  ReLu  
Maxpooling  
Fully connected  Batch Norm  
Gaussian Noise  ReLu  
Fully connected  Gaussian Noise 
For CIFAR10, we use a densely connected convolutional neural network (DenseNet) HLW16 . DenseNet is an enhanced version of ResNet HeResNet15 where all feature maps of previous layers are presented as the input to later convolutional layers. The depth of the DenseNet is and growth rate is . All parameters in convolutional, fullyconnected and batch normalization layers are trainable since fixed batch normalization parameters cannot provide satisfactory accuracy even if the inference is noiseless. The total number of trainable parameters of the DenseNet is around one million. The noiseless trained model has prediction accuracy of with noiseless inference, which is slightly below the reported value () in HeResNet15 . We inject multiplicative Gaussian noise after each matrixvector multiplication.
For MNIST stroke sequences extracted by JongMnistStroke2016 , we use a LSTMbased recurrent neural network with cells corresponding to the first two dimensional coordinates of penpoints of each written digit. If the total number of penpoints is less than
, it is padded with
’s. Since the penpoints are two dimensional, the input dimension to each LSTM cell is . The number of hidden units in each cell is , and the last LSTM cell are connected with an output layer of neurons for classification. The noiselessly trained model has accuracy around with noiseless inference. The drop of accuracy from MNIST images to MNIST stroke sequences may due to that we have to truncate or pad penpoints to fit the LSTM cells and gray level information is lost when converting images to stroke sequences. There are four matrixvector multiplications in each LSTM cell (see four yellow boxes in LSTM cells in Fig. 1) and one between the LSTM cell and the output layer, where multiplicative noise are injected.3.2 Noisy Inference by Noiseless and NoiseInjected Training
In this section, we show the impact of noisy inference on accuracy for noiselessly and noiseinjected trained models. The noiseinjected training applies noise layers during training and thus the information in the forward pass is changed and consequently influences the weight updates during backward pass. We train models with different ’s and use different ’s to test the accuracy on all trained models. That is, the noise power in training does not necessarily match the noise power during inference. We will show that such noiseinjected training uniformly improves accuracy on all ’s.
3.2.1 The CNN for MNIST images
For MNIST images with the 6layer CNN architecture, we train 100 epochs using stochastic minibatch training with Adam optimizer and batch size being 32 for each
to with a step size of . The accuracy of noiselessly trained model with noiseless inference is around . Then the trained models are tested on validation set with to with a step size of . We present the results of a subset of ’s in Figure 2. Validation accuracy for each pair is the average of independent runs and we observe that all results are highly concentrated in their average (within less than variation). This phenomenon is also confirmed from neural network models for other datasets, thus we use the average accuracy as a measurement of robustness against noisy inference. Also note that axis represents the error rate (accuracy) in logarithm scale. Table 2 and Table 3 summarize the results. The first row is the inference noise levels from to with a step size of ; the second row shows the accuracy for noiselessly trained models; the third row shows the best accuracy for in that column, which is achieved by noiseinjected training with in the fourth row.0.0  0.2  0.4  0.6  0.8  1.0  
=0  99.5%  99.3%  93.5%  65.7%  38.8%  21.1% 
Best accu.  99.6%  99.5%  99.3%  94.6%  86.1%  77.7% 
By  0.2  0.5  0.5  0.7  1.0  1.0 
0.0  0.2  0.4  0.6  0.8  1.0  
=0  99.5%  99.3%  95.8%  80.5%  58.8%  38.7% 
Best  99.6%  99.5%  99.4%  99.2%  99.0%  98.4% 
By  0.4  0.4  0.4  0.6  0.9  1.0 
1. Conventional noiseless training () does not provide satisfactory performance against noisy inference even when the noise power is from low to medium (second row in Table 2 and Table 3). For example, when , the accuracy reduces by and for multiplicative and additive noise models, respectively.
2. If the inference noise power is known, training with properly injected noise can greatly improve noisy inference (third and fourth row in Table 2 and Table 3) when the noise power is from low to medium (e.g,. accuracy decreases less than when and less than when ). Note that is the ratio between standard deviations of noise and magnitude of signals and tolerating (SNR = 7.96 dB) is already a good relaxation on the requirement for neuromorphic circuits.
3. If the inference noise power is unknown, noiseinjected training with a specified also provides consistent accuracy improvement for lowtomedium values of inference noise power (Figure 2). For example, the accuracy decreases less than when for all inference noise level in both multiplicative and additive noise models.
We further investigate the learning speed and the weights of noise injected training. The result for multiplicative noise models is presented in Figure 3, where the learning curve (validation accuracy and loss) for the first stochastic minibatches of size is shown. It can be observed that higher noise power in training generally results in a lower training speed (with learning rate is fixed to be ). Table 4 shows the expected norm of weights for 11 trained CNN models with different . It can be seen that the magnitude of weights for are similar and are slightly greater than noiselessly trained model.
0  0.1  0.2  0.3  0.4  0.5  

0.049  0.054  0.054  0.059  0.060  0.060  
0.6  0.7  0.8  0.9  1.0  
0.058  0.056  0.060  0.058  0.059 
In order to understand the difference between models from noiseless and noiseinjected training, we choose two 6layer CNN models, corresponding to and , and run multiplicative noisy inference with to for
times for one same image (which has label “6”) and then plot the distributions of 10 outputs after the softmax layers, which can be thought as the probability of that image being classified as the corresponding digits. Figure
4 and Figure 5 shows the evolution of the output distributions with increasing . Note that the upper right part in each of the six plots indicates higher probability of having large output values. Therefore, it is likely for the neural network to predict the image as labels/curves that appear in that area. It can be seen that noiselessly trained model (Figure 4) starts to be confused with digit “6” (pink) and “8” (dark yellow) when ; on the other hand, noiseinjected training constantly favors the prediction of “6”.3.2.2 A Depth40 DenseNet for CIFAR10
We train the nonaugmented dataset of CIFAR10 for 300 epochs using momentum optimizer with minibatch of size 32. We inject multiplicative Gaussian noise right after each matrixvector multiplication. Figure 6 shows the average accuracy for pairs of , where Table 5 summarizes the results.
Similar to the 6layer CNN for MNIST images, we can also conclude from Table 5 and Figure 6 for the multiplicative noise models that noiseinjected training provides large robustness gain against noisy inference, e.g., accuracy decreases less than (see Table 5) if (SNR = 7.96 dB).
0.0  0.2  0.4  0.6  0.8  1.0  
=0  92.5%  91.9%  88.5%  78.7%  60.4%  29.9% 
Best  92.8%  92.4%  91.5%  85.6%  75.7%  66.9% 
By  0.1  0.2  0.4  0.6  0.8  1.0 
3.2.3 LSTMs for MNIST stroke sequences
This section provides results on classifying stroke sequences on MNIST. We train LSTMbased RNNs for 100 epochs with stochastic minibatch of batchsize 32 with Adam optimizer, where multiplicative noise are injected after four matrixvector multiplications within the LSTM cells and one for the final output layer. Figure 7 shows the accuracy for different pairs and Table 6 summarizes the results. The accuracy for noiseinjected training decreases less than when , providing much larger robustness than noiseless training against noisy inference.
0.0  0.2  0.4  0.6  0.8  1.0  
=0  94.8%  88.0%  49.8%  23.6%  17.0%  15.5% 
Best accu.  95.6%  95.7%  94.3%  87.3%  77.0%  66.9% 
By  0.3  0.3  0.4  0.6  0.8  1.0 
3.3 Noisy Inference with Voting
Inspired by the comparison of Figure 4 and Figure 5, we find many instances where the correct prediction is favored in the histogram however not correctly predicted when is large, e.g., . The reason is that the prediction is probabilistic, i.e., if there is a small overlap in the histogram, e.g., , between the favored correct label and a wrong label, the accuracy will be reduces by around
. Observing that the correct prediction appears more often in multiple runs of independent noisy inference, we propose a voting mechanism that can magnify the favored label based on Law of Large Numbers, where we collect a number of predictions of noisy inference and claim the prediction with the most votes (also known as the
mode) as the final prediction. Figure 8, Figure 9, and Figure 10 show a detailed comparison for inference with voting (b) and without voting (a), where three (a) subfigures are replots of Figure 2(a), Figure 6, and Figure 7 for comparing purposes. Table 7, Table 8, and Table 9 summarize the results, where the second row repeats the best accuracy for noiseinjected training and the third row are the best accuracy achieved by 20 votes on noisy inference using neural network models trained with equal to the corresponding column in the fourth row.We can make a few observations based on Figure 8, Figure 9, Figure 10, Table 7, Table 8, and Table 9.
1. While the accuracy in lowtomedium (i.e., high SNR) regime is similar, voting further improves the accuracy of all three neural networks for three datasets when the noise power is large ( close to ). For example, when (SNR = 0 dB), accuracy is improved by more than , from to for MNIST images, CIFAR10, and MNIST stroke sequences, respectively. Compared horizontally, accuracy maintains almost the same ( vs ) for MNIST images when (SNR=1 in value or 0 dB). Noiseinjected training without voting has such high accuracy only when (SNR=25 in value or 14 dB), where a 14 dB gain is realized by voting. Similarly, the tolerable increases from to (a 6 dB SNR gain) for DenseNet of CIFAR10 if accuracy loss is acceptable; the tolerable increases from to (a 3.5 dB SNR gain) for LSTMs of MNIST stroke sequences if accuracy loss is acceptable. Such SNR gains further relax the requirements for neuromorphic circuit designs and enable them to have competitive accuracy with GPU/CPU centered digital computations.
2. We observe in Table 6 that with voting and noiseinjected training combined, noisy inference improves accuracy upon noiseless inference for LSTMbased recurrent neural networks, which is counterintuitive since it is often believed that noise during inference would be harmful rather than helpful to the performance. The improvement by noisy inference is also confirmed by independently training multiple LSTM models.
0.0  0.2  0.4  0.6  0.8  1.0  
No voting  99.6%  99.5%  99.3%  94.6%  86.1%  77.7% 
Best voting  99.6%  99.6%  99.6%  99.6%  99.6%  99.5% 
By  0.2  0.5  0.5  0.5  0.8  0.9 
0.0  0.2  0.4  0.6  0.8  1.0  
No voting  92.8%  92.4%  91.5%  85.6%  75.7%  66.9% 
Best voting  92.8%  92.7%  92.7%  92.8%  91.3%  89.1% 
By  0.1  0.1  0.2  0.3  0.6  0.8 
0.0  0.2  0.4  0.6  0.8  1.0  
No voting  95.6%  95.7%  94.3%  87.3%  77.0%  66.9% 
Best voting  95.6%  96.1%  95.9%  94.6%  92.7%  89.6% 
By  0.3  0.3  0.4  0.6  0.7  0.8 
Having observed the benefits of the voting mechanism, we then discuss how to mitigate their weakness in practical applications since latency and power consumption will be increased when a prediction is made by multiple runs of noisy inference. Fig. 11 and shows the tradeoffs between the number of votes and accuracy for MNIST images and CIFAR10, respectively. It can be seen that the accuracy improves fast when the number of votes increases from to , in particular for large noise power. Beyond votes, the accuracy curve becomes almost flat. Fast convergence of the voting is particularly critical for neuromorphic circuitsbased deep neural networks to have smaller latency and power overhead.
The latency overhead can be reduced in two ways. If the chip area is not a bottleneck, then multiple inference can be made in parallel from multiple copies of neuromorphic circuitsbased neural networks, resulting in zero latency overhead. Otherwise, it can also be minimized by pipelining multiple inference on a single neuromorphic circuitsbased neural network. Each layer of a neural network can be implemented as one memory array and instead of waiting for one inference to finish before the next one to start, images can be fed into the neuromorphic computing system at each time stamp and all layers form a pipeline. Assume the number of layers is , each layer contributes latency , and the number of votes is , then the overall latency with pipelining is (instead of without pipelining). Compared to the latency of for a single inference, the overhead is slightly greater than for large , i.e., deeper neural networks, which is becoming widely used in practice.
If a single inference costs power , then the power consumption of votes would be , which cannot be reduced by parallel computing or pipelining. However, , where is the memory programming/sensing voltage, and is positively related to SNRs, which is the inverse of . Therefore, smaller brings with it smaller power consumption at the cost of increasing . The tradeoffs between power consumption and accuracy, connected by and is complicated and beyond the scope of this paper.
4 Noisy Inference as a Defensive Mechanism for Blackbox Adversarial Attacks
We show in this section that noisy inference can enhance adversarially trained neural networks against blackbox adversarial attacks. Noisy inference provides stochastic gradients which has been shown in CarliniObfuscated18 to provide a false sense of security against whitebox attacks. However, we observe that the transferability of adversarial examples from one neural network to the other has been weaker if the neural networks used for prediction have noisy inference. We focus on attacks where all pixel values of an adversarial example are within an cube of a natural one. Our experiments are setup as follows.
1. We use two datasets, MNIST images and CIFAR10 where all pixel values are normalized between and .
2. The adversarial attacks are iterative fast gradient sign methods (FGSM) with projected gradient descent (PGD) and multiple restarts from a random image within the cube of a natural image. It is shown to be a strongest attack based on firstorder statistics madry2018towards . The detailed parameters of the attacks are listed in Table 10.
MNIST  CIFAR10  

0.3  
Adv. learn. rate  0.05  
Iterations  20  7 
No. restarts  20  10 
3. Two neural networks are noiselessly and adversarially trained madry2018towards , where each minibatch of size 32 consists of equal number of natural and adversarial samples. One neural network is used to generate adversarial examples (denoted by ) and the other is used to validate the accuracy of those adversarial examples (denoted by ). For MNIST, is a 4layer CNN as it is in madry2018towards and is the 6layer CNN we presented in Table 1. Both neural networks are adversarially trained for 100 epochs. For CIFAR10, is the depth40 Densenet with growth rate 12 and is a depth100 Densenet with growth rate 12. Both neural networks are adversarially trained for 300 epochs.
Tabel 11 shows the average validation accuracy for MNIST and CIFAR10 under blackbox attacks with 20 runs of noisy inference. It can be observed that the robustness of the adversarially trained neural networks (the 6layer CNN and the depth100 Densenet) against the attacks has been further enhanced from to and from to , respectively. The and improvements are achieved by multiplicative noisy inference with and , respectively.
MNIST  CIFAR10  

97.40%  66.52%  
97.79%  67.65%  
97.90%  67.43%  
97.71%  66.57%  
97.00%  63.39% 
5 Conclusion
In this paper, we propose to use training with injected noise and inference with voting that imbues neural networks with much greater resilience to imperfect computations during inference, which has potential applications on deep learning with ultralow power consumptions. Three examples of neural network architectures show remarkable improvement on accuracy using these two methods. With strong noise , these two methods can improve accuracy from to for the three datasets, respectively. With lowtomedium values of noise power (), noiseinjected training itself can improve accuracy from to for the three datasets at no cost of latency or power consumption.
Further study of blackbox attacks against neural networks show and enhancement for MNIST and CIFAR10, which is brought by noisy inference on adversarially trained neural networks.
References
 (1) J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85–117, published online 2014; based on TR arXiv:1404.7828 [cs.NE]. doi:10.1016/j.neunet.2014.09.003.
 (2) T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, New York, 2001.

(3)
K. Jarrett, K. Kavukcuoglu, M. Ranzato, What is the best multistage architecture for object recognition?, in: IEEE Int. Conf. Computer Vision, Kyoto, Japan, 2009, p. 2146–2153.

(4)
A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.
 (5) A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber, A novel connectionist system for improved unconstrained handwriting recognition, IEEE Trans. Pattern Analysis and Machine Intelligence 31 (5) (2009) 855 – 868.

(6)
H. Sak, A. Senior, F. Beaufays, Long shortterm memory recurrent neural network architectures for large scale acoustic modeling.

(7)
M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, in: IEEE Int. Conf. Neural Networks, San Francisco, CA, USA, 2009, pp. 586–591.
 (8) R. HechtNielsen, Theory of the backpropagation neural network, in: Int. Joint Conf. on Neural Networks (IJCNN), Washington, DC, USA, 1989, pp. 593–605.
 (9) W. Zhao, G. Agnus, V. Derycke, A. Filoramo, J. Bourgoin, C. Gamrat, Nanotube devices based crossbar architecture: toward neuromorphic computing, Nanotechnology 21 (17).
 (10) R. Micheloni, A. Marelli, K. Eshghi, Inside Solid State Drives (SSDs), Springer, New York, 2013.
 (11) B. Feinberg, S. Wang, E. Ipekn, Making memristive neural network accelerators reliable (02 2018).
 (12) S. Jain, A. Ranjan, K. Roy, A. Raghunathan, Computing in memory with spintransfer torque magnetic ram, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23 (3) (2017) 470–483.
 (13) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, R. Fergus, Intriguing properties of neural networks, CoRR abs/1312.6199.

(14)
I. J. Goodfellow, J. Shlens, C. Szegedy,
Explaining and harnessing adversarial
examples, CoRR abs/1412.6572.
arXiv:1412.6572.
URL http://arxiv.org/abs/1412.6572 
(15)
A. Kurakin, I. J. Goodfellow, S. Bengio,
Adversarial examples in the physical
world, CoRR abs/1607.02533.
arXiv:1607.02533.
URL http://arxiv.org/abs/1607.02533 
(16)
N. Carlini, D. A. Wagner, Towards
evaluating the robustness of neural networks, CoRR abs/1608.04644.
arXiv:1608.04644.
URL http://arxiv.org/abs/1608.04644 
(17)
N. Papernot, P. D. McDaniel, I. J. Goodfellow,
Transferability in machine learning:
from phenomena to blackbox attacks using adversarial samples, CoRR
abs/1605.07277.
arXiv:1605.07277.
URL http://arxiv.org/abs/1605.07277 
(18)
N. Papernot, P. D. McDaniel, X. Wu, S. Jha, A. Swami,
Distillation as a defense to
adversarial perturbations against deep neural networks, CoRR abs/1511.04508.
arXiv:1511.04508.
URL http://arxiv.org/abs/1511.04508 
(19)
W. Xu, D. Evans, Y. Qi, Feature
squeezing: Detecting adversarial examples in deep neural networks, CoRR
abs/1704.01155.
arXiv:1704.01155.
URL http://arxiv.org/abs/1704.01155  (20) R. Feinman, R. R. Curtin, S. Shintre, A. B. Gardner, Detecting adversarial samples from artifacts, CoRR abs/1703.00410.

(21)
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu,
Towards deep learning models
resistant to adversarial attacks, in: International Conference on Learning
Representations, 2018.
URL https://openreview.net/forum?id=rJzIBfZAb 
(22)
G. Huang, Z. Liu, K. Q. Weinberger,
Densely connected convolutional
networks, CoRR abs/1608.06993.
arXiv:1608.06993.
URL http://arxiv.org/abs/1608.06993  (23) E. D. de Jong, Mnist digits stroke sequence data, https://github.com/edwindejong/mnistdigitsstrokesequencedata/wiki/MNISTdigitsstrokesequencedata (2016).
 (24) R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. J. Douglas, H. S. Seung, Digital selection and analogue amplification coexist in a cortexinspired silicon circuit, Nature (405) (2000) 947–951.
 (25) A. L. Maas, A. Y. Hannun, A. Y. Ng., Rectifier nonlinearities improve neural network acoustic models (2013).

(26)
K. He, X. Zhang, S. Ren, J. Sun, Deep
residual learning for image recognition, CoRR abs/1512.03385.
arXiv:1512.03385.
URL http://arxiv.org/abs/1512.03385 
(27)
A. Athalye, N. Carlini, D. A. Wagner,
Obfuscated gradients give a false
sense of security: Circumventing defenses to adversarial examples, CoRR
abs/1802.00420.
arXiv:1802.00420.
URL http://arxiv.org/abs/1802.00420
Comments
There are no comments yet.