Noisy Computations during Inference: Harmful or Helpful?

11/26/2018 ∙ by Minghai Qin, et al. ∙ 0

We study two aspects of noisy computations during inference. The first aspect is how to mitigate their side effects for naturally trained deep learning systems. One of the motivations for looking into this problem is to reduce the high power cost of conventional computing of neural networks through the use of analog neuromorphic circuits. Traditional GPU/CPU-centered deep learning architectures exhibit bottlenecks in power-restricted applications (e.g., embedded systems). The use of specialized neuromorphic circuits, where analog signals passed through memory-cell arrays are sensed to accomplish matrix-vector multiplications, promises large power savings and speed gains but brings with it the problems of limited precision of computations and unavoidable analog noise. We manage to improve inference accuracy from 21.1 99.5 89.6 signal-to-noise power ratio being 0 dB) by noise-injected training and a voting method. This observation promises neural networks that are insensitive to inference noise, which reduces the quality requirements on neuromorphic circuits and is crucial for their practical usage. The second aspect is how to utilize the noisy inference as a defensive architecture against black-box adversarial attacks. During inference, by injecting proper noise to signals in the neural networks, the robustness of adversarially-trained neural networks against black-box attacks has been further enhanced by 0.5 adversarially trained models for MNIST and CIFAR10, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks (NNs) Schmidhuber15 ; HTF01

are layered computational networks that try to imitate the function of neurons in a human brain during object recognition, decision making, and other cognitive tasks. They are one of the most widely used machine learning techniques due to their good performance in practice. Certain variants of neural networks were shown to be more suitable for different learning applications. For instance, deep convolutional neural networks (CNNs) 

JKR09 ; KSH12

were found to be effective at recognizing and classifying images. Recurrent neural networks (RNNs) 

GLF09 ; HAF14 provide stronger performance at sequence predictions, e.g. speech or text recognition. A neural network is determined by the connections between the neurons and weights/biases associated with them which are trained using the back-propagation algorithm RB93 ; Nielsen89 .

In order to fit highly non-linear functions and thus to achieve a high rate of correctness in practice, neural networks usually contain many layers, each containing a large number of weights. The most power- and time-consuming computation of a neural networks is the matrix-vector multiplication between weights and incoming signals. In power-restricted applications, such as an inference engine in an embedded system, the size of the neural networks is effectively limited by the power consumption of the computations. One attractive method of lowering this power consumption is by use of neuromorphic computing ZADFBG10 , where analog signals passed through memory-cell arrays are sensed to accomplish matrix-vector multiplications. Here weights are programmed as conductances (reciprocal of resistivity) of memory cells in a two-dimensional array, such as resistive RAM (ReRAM), phase-change RAM (PCM) or NAND Flash MME13 . According to Ohm’s law, if input signals are presented as voltages to this layer of memory cells, the output current is the matrix-vector multiplication of the weights and the input signals. The massive parallelism of this operation also promises significant savings in latency over sequential digital computations.

One of the problems with neuromorphic computing is the limited precision of the computations, since the programming of memory cells, as well as the measurement of the analog outputs, inevitably suffers from analog noise, which affects performance of deep learning systems in a harmful way. Some researchers have proposed to correct the errors between A/D and D/A converters of each layer FWI18 ; JRRR17 , which induces extra overhead in latency, space, and power as well. Existing noisy inference literature for neuromorphic computing is mostly in the device/circuit domain, typically focusing on redesigning memory devices/circuits to reduce the noise power. But to our knowledge, the exploration of robustness of neural networks against analog and noisy computations inside the network in an algorithmic point of view has not been studied in detail. It is the first goal of this paper to present methods and observations to mitigate the harmful effects brought by inference noise.

During its study, we find out that noisy inference can also be helpful to defend against black-box adversarial attacks of neural networks, the resistance of which is becoming more crucial for a wide range of applications, such as deep learning systems deployed in financial institutions and autonomous driving. Adversarial attacks manipulate inputs to a neural networks and generate adversarial examples that have rarely perceptible differences to a natural image, but they lead to incorrect predictions Szegedy2013IntriguingPO . White-box attacks can be categorized into three categories.

  • Single step attack, e.g., fast gradient sign methods (FGSM) GoodfellowExplaining14 ,

    where and is the natural and adversarial examples, respectively. is the upper bound on the distance between and .

    is the loss function for an input

    being predicted to have label and is the derivative operator with respect to input .

  • Iterative methods, e.g., iterative FGSM KurakinAdversarialPhysical16 ,

    where is the learning rate of the adversary, Clip is a function that clips the output to be within the -cube of such that the distance between and is always less than or equal to .

  • Optimization based attacks, e.g., CarliniEvaluatingRobustness16 ,

    where balances the norm distortion and the loss function.

Black-box attacks are possible based on the transferability PapernotTransferability16 of adversary such that adversarial examples generated from one neural network often lead to incorrect predictions of another independently trained neural network.

Several defensive mechanisms such as distillations PapernotDistill15 , feature squeezing XuFeatureSqueezing17 , and adversary detection Feinman2017DetectingAS , are proposed against attacks. madry2018towards models the process as a min-max problem and builds a universal framework for attacking and defensive schemes, where it tries to provide a guaranteed performance for the first-order attacks, i.e., attacks based soly on derivatives. They also suggested that adversarial training with projected gradient descent (PGD) provides a strongest robustness against first-order attacks. Based on their study, we further investigate noisy inference of adversarially trained neural networks. Experimental results show improvement on robustness against black-box adversarial attacks and indicate that adversarial examples are more sensitive to inference noise than natural examples.

The contribution of this paper is summarized as follows.

1. We model the analog noise of neuromorphic circuits as additive and multiplicative Gaussian noise. The impact on accuracy of noisy inference is shown to be severe by three neural network architectures (a regular CNN, a DenseNet HLW16 , and a LSTM-based RNN) for three datasets (MNIST, CIFAR10, and MNIST stroke sequences JongMnistStroke2016 ), e.g., when noise power equals the signal power, the accuracy is as low as , respectively.

2. We observe that the performance of noisy inference can be greatly improved by injecting noises to all layers in the neural networks during training. Noise-injected training was used as a regularization tool for better model generalization, but its application to noisy inference has not been studied in details. We provide a quantitative measurement on the impact that the inference noise has on noiseless-trained and noise-injected trained neural networks, where the power of training and inference noise might not match. The performance of noisy inference with low-to-medium noise power is improved to almost as good as noiseless inference. For large noise power (equal to signal power), the accuracy is increased to for the three datasets, respectively.

3. We further improve the performance of noisy inference by proposing a voting mechanism for large noise power. The accuracy has been further increased to for the three datasets when noise power equals the signal power. In addition, we observe that with noise-injected training and the proposed voting mechanism combined, noisy inference can give higher accuracy than noiseless inference for LSTM-based recurrent neural networks, which is counterintuitive since it is often believed that noise during inference would be harmful rather than helpful to the accuracy.

4. A further study on adversarially trained neural networks for MNIST and CIFAR10 has shown that noisy inference improves the robustness of deep neural networks against black-box attacks. The accuracy of adversarial examples has been improved by and (in absolute values) for MNIST and CIFAR10 when validated on a separately and adversarially trained CNN and DenseNet, respectively.

2 Preliminaries

A neural network contains input neurons, hidden neurons, and output neurons. It can be viewed as a function where the input is an -dimensional vector and the output is an -dimensional vector. In this paper, we focus on classification problems where the output is usually normalized such that and

can be viewed as the probability for some input

to be categorized as the -th class. The normalization is often done by the softmax function that maps an arbitrary -dimensional vector into normalized , denoted by , as . For top- decision problems, we return the top categories with the largest output . In particular for hard decision problems where , the classification results is then .

A feed-forward neural network

that contains layers (excluding the softmax output layer) can be expressed as a concatenation of functions such that . The th layer satisfies , . The output of last layer is then fed into the softmax function. The function is usually defined as

(1)

where is the weights matrix,

is the bias vector, and

is an element-wise activation function that is usually nonlinear, e.g., tanh, sigmoid, rectified linear unit (ReLU

HSMDS00 and leaky ReLU MHN13 . Both and are trainable parameters.

In this paper, two noise models are assumed, which are additive Gaussian noise model and multiplicative Gaussian noise model , respectively, in the forward pass after each matrix-vector multiplication. Then Eq.  becomes

(2)

for additive Gaussian noise or

(3)

for multiplicative Gaussian noise. Additive Gaussian noise

models procedures of neuromorphic computing where the noise power is irrelevant to the signals, such as signal sensing, memory reading, and some random electrical perturbation of circuits. On the other hand for multiplicative noises, we can show that multiplying a unit-mean Gaussian random variable

to a signal is equivalent to adding a zero-mean Gaussian random variable

to the signal where the standard deviation of

is proportional to the magnitude of the signal, as

(4)

where . Therefore, multiplicative noise models the procedures where the noise power is proportional to the signal power, such as memory programming and computations.

For both additive and multiplicative Gaussian noise models, we denote and the standard deviation of the noise during training and validation (inference), respectively. We denote , , , if specific additive or multiplicative noise models are referred to. Note that is usually set to be in conventional training; Nonetheless, random noise is sometimes injected () during training to provide better generalization. Conventional deep learning architecture assumes since digital circuits are assumed to have no errors during computations.

For both noisy models, the signal-to-noise power ratio (SNR) is a defining parameter that measures the strength of the noise relative to the signal, which is defined as the ratio between the power of signal and noise. It is usually expressed in dB where SNR(dB) = . For example, if the signal and noise has the same power, SNR = 1 (or dB). For multiplicative noise models, we have constant SNR = ; on the other hand, SNRs for additive noise models with fixed ’s depend highly on signal power and it would be a fair comparison only if the signal power is invariant. Therefore, we mainly use multiplicative noise models in our experiment except for the case where we can normalize the signals to have constant power.

3 Robustness of NNs against Noisy Inference

In this section we explore the robustness of neural networks against noisy computations modeled in Eq.  and Eq. . We will show the robustness of different neural network architectures for different datasets against noisy inference and provide two techniques that improve the robustness, namely, noise-injected training and voting.

3.1 Datasets and NN architectures

Three datasets are used in our experiments, which are MNIST images ( for training and for validation), CIFAR10 images ( for training, for validation, no data augmentation applied), and the stroke sequence of MNIST JongMnistStroke2016 for each MNIST image.

For MNIST images, we use a 6-layer convolutional neural network described in Table 1. The noiseless trained model has prediction accuracy of

for noiseless inference. For multiplicative Gaussian noise models, the parameters in batch normalization layers are trainable. However, for additive Gaussian noise models, we use batch normalization layers with fixed parameters

(as opposed to trainable mean and variance) to minimize the signal power variations for the reason mentioned in Section 

2.

Layer Output shape
Conv. 2D - Batch Norm
Gaussian Noise - ReLu
Conv. 2D - Batch Norm
Gaussian Noise - ReLu
Maxpooling
Conv. 2D - Batch Norm
Gaussian Noise - ReLu
Conv. 2D - Batch Norm
Gaussian Noise - ReLu
Maxpooling
Fully connected - Batch Norm
Gaussian Noise - ReLu
Fully connected - Gaussian Noise
Table 1: A 6-layer CNN architecture for MNIST

For CIFAR10, we use a densely connected convolutional neural network (DenseNet) HLW16 . DenseNet is an enhanced version of ResNet HeResNet15 where all feature maps of previous layers are presented as the input to later convolutional layers. The depth of the DenseNet is and growth rate is . All parameters in convolutional, fully-connected and batch normalization layers are trainable since fixed batch normalization parameters cannot provide satisfactory accuracy even if the inference is noiseless. The total number of trainable parameters of the DenseNet is around one million. The noiseless trained model has prediction accuracy of with noiseless inference, which is slightly below the reported value () in HeResNet15 . We inject multiplicative Gaussian noise after each matrix-vector multiplication.

For MNIST stroke sequences extracted by JongMnistStroke2016 , we use a LSTM-based recurrent neural network with cells corresponding to the first two dimensional coordinates of pen-points of each written digit. If the total number of pen-points is less than

, it is padded with

’s. Since the pen-points are two dimensional, the input dimension to each LSTM cell is . The number of hidden units in each cell is , and the last LSTM cell are connected with an output layer of neurons for classification. The noiselessly trained model has accuracy around with noiseless inference. The drop of accuracy from MNIST images to MNIST stroke sequences may due to that we have to truncate or pad pen-points to fit the LSTM cells and gray level information is lost when converting images to stroke sequences. There are four matrix-vector multiplications in each LSTM cell (see four yellow boxes in LSTM cells in Fig. 1) and one between the LSTM cell and the output layer, where multiplicative noise are injected.

Figure 1: Example of LSTM cells.

3.2 Noisy Inference by Noiseless and Noise-Injected Training

In this section, we show the impact of noisy inference on accuracy for noiselessly and noise-injected trained models. The noise-injected training applies noise layers during training and thus the information in the forward pass is changed and consequently influences the weight updates during backward pass. We train models with different ’s and use different ’s to test the accuracy on all trained models. That is, the noise power in training does not necessarily match the noise power during inference. We will show that such noise-injected training uniformly improves accuracy on all ’s.

3.2.1 The CNN for MNIST images

For MNIST images with the 6-layer CNN architecture, we train 100 epochs using stochastic mini-batch training with Adam optimizer and batch size being 32 for each

to with a step size of . The accuracy of noiselessly trained model with noiseless inference is around . Then the trained models are tested on validation set with to with a step size of . We present the results of a subset of ’s in Figure 2. Validation accuracy for each pair is the average of independent runs and we observe that all results are highly concentrated in their average (within less than variation). This phenomenon is also confirmed from neural network models for other datasets, thus we use the average accuracy as a measurement of robustness against noisy inference. Also note that -axis represents the error rate (accuracy) in logarithm scale. Table 2 and Table 3 summarize the results. The first row is the inference noise levels from to with a step size of ; the second row shows the accuracy for noiselessly trained models; the third row shows the best accuracy for in that column, which is achieved by noise-injected training with in the fourth row.

0.0 0.2 0.4 0.6 0.8 1.0
=0 99.5% 99.3% 93.5% 65.7% 38.8% 21.1%
Best accu. 99.6% 99.5% 99.3% 94.6% 86.1% 77.7%
By 0.2 0.5 0.5 0.7 1.0 1.0
Table 2: Accuracy of noisy inference for multiplicative noise models for MNIST images.
0.0 0.2 0.4 0.6 0.8 1.0
=0 99.5% 99.3% 95.8% 80.5% 58.8% 38.7%
Best 99.6% 99.5% 99.4% 99.2% 99.0% 98.4%
By 0.4 0.4 0.4 0.6 0.9 1.0
Table 3: Accuracy of noisy inference for additive noise models for MNIST images.
Figure 2: Validation accuracy of the 6-layer CNN with Gaussian noise for different pairs. (a) Multiplicative noise model. (b) Additive noise model.

There are some observations that we can make from Table 2, Table 3, and Figure 2.

1. Conventional noiseless training () does not provide satisfactory performance against noisy inference even when the noise power is from low to medium (second row in Table 2 and Table 3). For example, when , the accuracy reduces by and for multiplicative and additive noise models, respectively.

2. If the inference noise power is known, training with properly injected noise can greatly improve noisy inference (third and fourth row in Table 2 and Table 3) when the noise power is from low to medium (e.g,. accuracy decreases less than when and less than when ). Note that is the ratio between standard deviations of noise and magnitude of signals and tolerating (SNR = 7.96 dB) is already a good relaxation on the requirement for neuromorphic circuits.

3. If the inference noise power is unknown, noise-injected training with a specified also provides consistent accuracy improvement for low-to-medium values of inference noise power (Figure 2). For example, the accuracy decreases less than when for all inference noise level in both multiplicative and additive noise models.

We further investigate the learning speed and the weights of noise injected training. The result for multiplicative noise models is presented in Figure 3, where the learning curve (validation accuracy and loss) for the first stochastic mini-batches of size is shown. It can be observed that higher noise power in training generally results in a lower training speed (with learning rate is fixed to be ). Table 4 shows the expected norm of weights for 11 trained CNN models with different . It can be seen that the magnitude of weights for are similar and are slightly greater than noiselessly trained model.

Figure 3: Learning curves of the noise injected training for to .
0 0.1 0.2 0.3 0.4 0.5
0.049 0.054 0.054 0.059 0.060 0.060
0.6 0.7 0.8 0.9 1.0
0.058 0.056 0.060 0.058 0.059
Table 4: Expected norm of all weights of 11 models

In order to understand the difference between models from noiseless and noise-injected training, we choose two 6-layer CNN models, corresponding to and , and run multiplicative noisy inference with to for

times for one same image (which has label “6”) and then plot the distributions of 10 outputs after the softmax layers, which can be thought as the probability of that image being classified as the corresponding digits. Figure 

4 and Figure 5 shows the evolution of the output distributions with increasing . Note that the upper right part in each of the six plots indicates higher probability of having large output values. Therefore, it is likely for the neural network to predict the image as labels/curves that appear in that area. It can be seen that noiselessly trained model (Figure 4) starts to be confused with digit “6” (pink) and “8” (dark yellow) when ; on the other hand, noise-injected training constantly favors the prediction of “6”.

Figure 4: Distribution of the 10 softmax output for MNIST. Model based on conventional training at , inference with validation noise to at a step of .
Figure 5: Distribution of the 10 softmax output for MNIST. Model based on noise-injected training at , inference with validation noise to at a step of .

3.2.2 A Depth-40 DenseNet for CIFAR10

We train the non-augmented dataset of CIFAR10 for 300 epochs using momentum optimizer with mini-batch of size 32. We inject multiplicative Gaussian noise right after each matrix-vector multiplication. Figure 6 shows the average accuracy for pairs of , where Table 5 summarizes the results.

Similar to the 6-layer CNN for MNIST images, we can also conclude from Table 5 and Figure 6 for the multiplicative noise models that noise-injected training provides large robustness gain against noisy inference, e.g., accuracy decreases less than (see Table 5) if (SNR = 7.96 dB).

0.0 0.2 0.4 0.6 0.8 1.0
=0 92.5% 91.9% 88.5% 78.7% 60.4% 29.9%
Best 92.8% 92.4% 91.5% 85.6% 75.7% 66.9%
By 0.1 0.2 0.4 0.6 0.8 1.0
Table 5: Accuracy of noisy inference for multiplicative noise models for CIFAR10.
Figure 6: Validation accuracy of the depth-40 DenseNet with Gaussian noise for different pairs.

3.2.3 LSTMs for MNIST stroke sequences

This section provides results on classifying stroke sequences on MNIST. We train LSTM-based RNNs for 100 epochs with stochastic mini-batch of batchsize 32 with Adam optimizer, where multiplicative noise are injected after four matrix-vector multiplications within the LSTM cells and one for the final output layer. Figure 7 shows the accuracy for different pairs and Table 6 summarizes the results. The accuracy for noise-injected training decreases less than when , providing much larger robustness than noiseless training against noisy inference.

0.0 0.2 0.4 0.6 0.8 1.0
=0 94.8% 88.0% 49.8% 23.6% 17.0% 15.5%
Best accu. 95.6% 95.7% 94.3% 87.3% 77.0% 66.9%
By 0.3 0.3 0.4 0.6 0.8 1.0
Table 6: Accuracy of noisy inference for multiplicative noise models for MNIST stroke sequences.
Figure 7: Validation accuracy of LSTM-based RNNs for different pairs.

3.3 Noisy Inference with Voting

Inspired by the comparison of Figure 4 and Figure 5, we find many instances where the correct prediction is favored in the histogram however not correctly predicted when is large, e.g., . The reason is that the prediction is probabilistic, i.e., if there is a small overlap in the histogram, e.g., , between the favored correct label and a wrong label, the accuracy will be reduces by around

. Observing that the correct prediction appears more often in multiple runs of independent noisy inference, we propose a voting mechanism that can magnify the favored label based on Law of Large Numbers, where we collect a number of predictions of noisy inference and claim the prediction with the most votes (also known as the

mode) as the final prediction. Figure 8, Figure 9, and Figure 10 show a detailed comparison for inference with voting (b) and without voting (a), where three (a) sub-figures are replots of Figure 2(a), Figure 6, and Figure 7 for comparing purposes. Table 7, Table 8, and Table 9 summarize the results, where the second row repeats the best accuracy for noise-injected training and the third row are the best accuracy achieved by 20 votes on noisy inference using neural network models trained with equal to the corresponding column in the fourth row.

We can make a few observations based on Figure 8, Figure 9, Figure 10, Table 7, Table 8, and Table 9.

1. While the accuracy in low-to-medium (i.e., high SNR) regime is similar, voting further improves the accuracy of all three neural networks for three datasets when the noise power is large ( close to ). For example, when (SNR = 0 dB), accuracy is improved by more than , from to for MNIST images, CIFAR10, and MNIST stroke sequences, respectively. Compared horizontally, accuracy maintains almost the same ( vs ) for MNIST images when (SNR=1 in value or 0 dB). Noise-injected training without voting has such high accuracy only when (SNR=25 in value or 14 dB), where a 14 dB gain is realized by voting. Similarly, the tolerable increases from to (a 6 dB SNR gain) for DenseNet of CIFAR10 if accuracy loss is acceptable; the tolerable increases from to (a 3.5 dB SNR gain) for LSTMs of MNIST stroke sequences if accuracy loss is acceptable. Such SNR gains further relax the requirements for neuromorphic circuit designs and enable them to have competitive accuracy with GPU/CPU centered digital computations.

2. We observe in Table 6 that with voting and noise-injected training combined, noisy inference improves accuracy upon noiseless inference for LSTM-based recurrent neural networks, which is counterintuitive since it is often believed that noise during inference would be harmful rather than helpful to the performance. The improvement by noisy inference is also confirmed by independently training multiple LSTM models.

0.0 0.2 0.4 0.6 0.8 1.0
No voting 99.6% 99.5% 99.3% 94.6% 86.1% 77.7%
Best voting 99.6% 99.6% 99.6% 99.6% 99.6% 99.5%
By 0.2 0.5 0.5 0.5 0.8 0.9
Table 7: Accuracy of noisy inference by voting for multiplicative noise models for MNIST images.
0.0 0.2 0.4 0.6 0.8 1.0
No voting 92.8% 92.4% 91.5% 85.6% 75.7% 66.9%
Best voting 92.8% 92.7% 92.7% 92.8% 91.3% 89.1%
By 0.1 0.1 0.2 0.3 0.6 0.8
Table 8: Accuracy of noisy inference by voting for multiplicative noise models for CIFAR10.
0.0 0.2 0.4 0.6 0.8 1.0
No voting 95.6% 95.7% 94.3% 87.3% 77.0% 66.9%
Best voting 95.6% 96.1% 95.9% 94.6% 92.7% 89.6%
By 0.3 0.3 0.4 0.6 0.7 0.8
Table 9: Accuracy of noisy inference by voting for multiplicative noise models for MNIST stroke sequences.
Figure 8: Validation accuracy of the 6-layer CNN for MNIST with multiplicative Gaussian noise of different pairs. (a) One single inference. (b) Inference by voting with 20 runs.
Figure 9: Validation accuracy of the depth-40 DenseNet for CIFAR10 with multiplicative Gaussian noise of different pairs. (a) One single inference. (b) Inference by voting with 20 runs.
Figure 10: Validation accuracy of the LSTM-based RNN for MNIST stroke sequences with multiplicative Gaussian noise of different pairs. (a) One single inference. (b) Inference by voting with 20 runs.

Having observed the benefits of the voting mechanism, we then discuss how to mitigate their weakness in practical applications since latency and power consumption will be increased when a prediction is made by multiple runs of noisy inference. Fig. 11 and shows the trade-offs between the number of votes and accuracy for MNIST images and CIFAR10, respectively. It can be seen that the accuracy improves fast when the number of votes increases from to , in particular for large noise power. Beyond votes, the accuracy curve becomes almost flat. Fast convergence of the voting is particularly critical for neuromorphic circuits-based deep neural networks to have smaller latency and power overhead.

Figure 11: Trade-offs between number of votes and the accuracy of noisy inference for MNIST images (a) and CIFAR10 (b).

The latency overhead can be reduced in two ways. If the chip area is not a bottleneck, then multiple inference can be made in parallel from multiple copies of neuromorphic circuits-based neural networks, resulting in zero latency overhead. Otherwise, it can also be minimized by pipelining multiple inference on a single neuromorphic circuits-based neural network. Each layer of a neural network can be implemented as one memory array and instead of waiting for one inference to finish before the next one to start, images can be fed into the neuromorphic computing system at each time stamp and all layers form a pipeline. Assume the number of layers is , each layer contributes latency , and the number of votes is , then the overall latency with pipelining is (instead of without pipelining). Compared to the latency of for a single inference, the overhead is slightly greater than for large , i.e., deeper neural networks, which is becoming widely used in practice.

If a single inference costs power , then the power consumption of votes would be , which cannot be reduced by parallel computing or pipelining. However, , where is the memory programming/sensing voltage, and is positively related to SNRs, which is the inverse of . Therefore, smaller brings with it smaller power consumption at the cost of increasing . The trade-offs between power consumption and accuracy, connected by and is complicated and beyond the scope of this paper.

4 Noisy Inference as a Defensive Mechanism for Black-box Adversarial Attacks

We show in this section that noisy inference can enhance adversarially trained neural networks against black-box adversarial attacks. Noisy inference provides stochastic gradients which has been shown in CarliniObfuscated18 to provide a false sense of security against white-box attacks. However, we observe that the transferability of adversarial examples from one neural network to the other has been weaker if the neural networks used for prediction have noisy inference. We focus on attacks where all pixel values of an adversarial example are within an -cube of a natural one. Our experiments are setup as follows.

1. We use two datasets, MNIST images and CIFAR10 where all pixel values are normalized between and .

2. The adversarial attacks are iterative fast gradient sign methods (FGSM) with projected gradient descent (PGD) and multiple restarts from a random image within the -cube of a natural image. It is shown to be a strongest attack based on first-order statistics madry2018towards . The detailed parameters of the attacks are listed in Table 10.

MNIST CIFAR10
0.3
Adv. learn. rate 0.05
Iterations 20 7
No. restarts 20 10
Table 10: Parameters of adversarial attacks (iterative FGSM).

3. Two neural networks are noiselessly and adversarially trained madry2018towards , where each minibatch of size 32 consists of equal number of natural and adversarial samples. One neural network is used to generate adversarial examples (denoted by ) and the other is used to validate the accuracy of those adversarial examples (denoted by ). For MNIST, is a 4-layer CNN as it is in madry2018towards and is the 6-layer CNN we presented in Table 1. Both neural networks are adversarially trained for 100 epochs. For CIFAR10, is the depth-40 Densenet with growth rate 12 and is a depth-100 Densenet with growth rate 12. Both neural networks are adversarially trained for 300 epochs.

Tabel 11 shows the average validation accuracy for MNIST and CIFAR10 under black-box attacks with 20 runs of noisy inference. It can be observed that the robustness of the adversarially trained neural networks (the 6-layer CNN and the depth-100 Densenet) against the attacks has been further enhanced from to and from to , respectively. The and improvements are achieved by multiplicative noisy inference with and , respectively.

MNIST CIFAR10
97.40% 66.52%
97.79% 67.65%
97.90% 67.43%
97.71% 66.57%
97.00% 63.39%
Table 11: Accuracy of black-box attacks on MNIST and CIFAR10 for different inference noise levels .

5 Conclusion

In this paper, we propose to use training with injected noise and inference with voting that imbues neural networks with much greater resilience to imperfect computations during inference, which has potential applications on deep learning with ultra-low power consumptions. Three examples of neural network architectures show remarkable improvement on accuracy using these two methods. With strong noise , these two methods can improve accuracy from to for the three datasets, respectively. With low-to-medium values of noise power (), noise-injected training itself can improve accuracy from to for the three datasets at no cost of latency or power consumption.

Further study of black-box attacks against neural networks show and enhancement for MNIST and CIFAR10, which is brought by noisy inference on adversarially trained neural networks.

References