1. Introduction
Deep Neural Networks (DNNs) have achieved superhuman performance in various perception tasks and have become one of the most popular solutions for these applications. Thus, there is an obvious trend in deploying DNNs on edge devices such as automobiles, smartphones, and smart sensors. However, implementing computational intense DNNs directly on edge devices is a significant challenge due to the limited computation resource and constrained power budget of these devices. Moreover, most of the DNN accelerator designs are confined in a design space where the researchers only consider conventional vonNeumann architectures (e.g., GPUs, mobile CPUs, or FPGAs) as candidate platforms. In vonNeumann architectures, data movement inevitably becomes the bottleneck for system efficiency, due to the wellknown memory wall where the computational unit must fetch and store data from the memory hierarchy.
Emerging devicebased ComputeinMemory (CiM) neural accelerators (ielmini2018memory) offer a great opportunity to break the memory wall with special architectural advantages. CiM architectures offer reduced data movement by insitu weight data access (sze2017efficient). Highly efficient emerging devices (e.g. RRAMs, STTRAMS, and FeFETs) can be devised to offer higher energy efficiency and higher memory density compared with traditional MOSFET (shafiee2016isaac)
based designs. However, such accelerators suffer greatly from design limitations. Nonideal conditions of emerging devices due to their nonideal manufacturing process induce uncertainties on emerging devices. These uncertainties, such as devicetodevice (D2D) variations, thermal noise, and retrieval limitations, cause value changes that, the weights in the actually deployed accelerators may be different from the desired weight value trained offline in data centers. This weight value change leads to performance degradation in actual accelerator implementations. As an illustration, we train the four models, multilayer perceptron (MLP) and LeNet for MNIST, ResNet 56 for CIFAR10, and VGG19 for ImageNet, to stateoftheart accuracy and deploy them on CiM simulation tools
(feinberg2018making). As shown in Fig. 1, an accuracy degradation of close to 10% is observed in each model implementation.The device uncertaintyinduced performance degradation has been studied from different perspectives, including devicelevel observations (zhao2017investigation), architecture level analysis (jiang2020device), and behavioral level explorations (yan2020single). Finding suitable pairs of DNN models and hardware designs that can together offer both desirable hardware reliability and high inference accuracy requires great effort.
Neural Architecture Search (NAS) (zoph2016neural; zoph2017learning; zeng2020towards) is one of the most successful efforts to address this issue. NAS liberates human labor from endlessly exploring optimal handcrafted DNN models by automatically identifying neural architectures that can offer desired performances from a predetermined search space. Coexploration of neural architecture and hardware design (jiang2019accuracy; jiang2020standing; jiang2020hardware) pushes this concept further by incorporating hardware design specifications into NAS search spaces, so as to offer neural architecturehardware design pairs that are accurate, efficient, and robust against hardware uncertainties.
In this work, we adopt a statistical analysis perspective to study the effect of device uncertainties on the performance of DNNs. We model the emerging device uncertainty as a whole into Gaussian noise on weights and thoroughly investigate the behavior of different DNN models under such uncertainties. We conduct a MonteCarlo simulationbased analysis on the statistical behavior of the models under the influence of device uncertainties. We then abstract our analysis results to offer supports for NAS applications. The detailed contributions of this work are:

We propose a MonteCarlo simulationbased experimental flow to measure the device uncertaintyinduced perturbations to DNN models.

We then thoroughly investigate the behaviors of different DNN models under such perturbations and show that the value changes of their output vectors follow Gaussian distribution.

To alleviate this effect, we then propose UAE, a device uncertaintyaware NAS framework, to search for architectures that are more robust to device uncertainties.
Experimental results show that UAE offers a 2.49% higher accuracy than NACIM (jiang2020device) with 1.2x of time consumption. By further increasing search complexity, UAE reaches 6.39% higher accuracy than NACIM with 2.5x of search time.
2. Background
2.1. CiM DNN Accelerators
Researchers have proposed different crossbarbased CiM DNN accelerator architectures (shafiee2016isaac; chi2016prime) for efficient DNN inference. We assume an ISAAClike architecture (shafiee2016isaac)
, and the architecture of the system is organized into multiple tiles, with crossbar arrays as the heart of each tile. The crossbar not only stores the synaptic weights but also performs dotproduct computations. One certain crossbar is dedicated to processing a set of neurons in a given DNN layer. The outputs of that layer are fed to other crossbars that are dedicated to processing the next layer. The computation in the crossbar is performed inanalog domain. However, ADC and DAC are used to convert the signal from and to the analog domain dotproduct computation and other digital domain operations needed in DNN computation.
Crossbar is the key component of CiM DNN accelerators. As shown in Fig. 2, a crossbar can be considered as a processing element for matrixvector multiplication where matrix value (i.e. weights for NNs) are stored at the cross point of each vertical and horizontal line with resistive emerging devices such as RRAMs and FeFETs, and each vector value is propagated through horizontal data lines. In this work, we assume an RRAMbased design. The calculation in crossbar is performed in analog domain but additional peripheral digital circuits are needed for other key NN operations (e.g., nonlinear activation), so DAC and ADCs are adopted between different components.
Devicelevel limitations confine the application of crossbars. The precision of ADC and DACs limits the precision of DNN activations for each layer and the nonideal characteristics of emerging devices impose noises on the weights of deployed DNNs.
2.2. Device Variation
In this work, we assume an RRAMbased crossbar design. RRAM devices suffer various types of faults due to manufacturing and runtime nonidealities. Noise sources that are directly relevant to crossbarbased designs include thermal noise, shot noise, random telegraph noise (RTN), and programming errors (feinberg2018making). When the circuitry is used for inference, programming errors due to devicetodevice variations could be the dominant error source.
Writeandverify (alibart2012high; niu2012low; xu2013understanding) is a simple, accurate, and widely used programming scheme for RRAMs. The key operation is to iteratively apply a series of short pulses and check the difference between current and target resistance, converging progressively on the target resistance. In deploying accelerators for Neural Network inference, this timeconsuming progress is tolerable because once programmed, no more modifications to the resistance are needed during the entire life span of the accelerator. Although this scheme pulls down the D2D variationinduced error to less than 1%, a significant error drop can still be observed in conditions shown in Fig. 1.
2.3. Neural Architecture Search
Neural Architecture Search (NAS) has achieved stateoftheart performance in various perceptual tasks, such as image classifications (yang2020co_1; yang2020co_2), inference security (bian2020nass) and image segmentation (yan2020ms). NAS is becoming increasingly successful mainly because it liberates human designing labors by automatically identifying highperformance neural architectures. Coexploration of neural architecture and hardware design (jiang2019accuracy; jiang2020standing; jiang2020hardware) push this concept further by incorporating hardware design specifications into NAS search spaces, so as to offer neural architecturehardware design pairs that are accurate, efficient, and robust against hardware uncertainties.
Formally speaking, NAS deals with a problem that, given a perceptual task , a humandefined design space , and a set of figures of merit (FOM) , what is the best neural architecture in that can offer optimal performance (in terms of FOM in ) on task .
A typical reinforcement learning (RL)based NAS that solves this issue, such as the framework proposed in
(zoph2016neural), is composed of three key components, a controller, a trainer and an evaluator. In one iteration (named episode) of RLbased NAS, (1) the controller generates a neural architecture from the design space; (2) the trainer builds the generated neural architecture into a DNN model, named child network, and trains the child network on a heldout training dataset; (3) the evaluator collects the figures of merit (FOM), e.g., test accuracy of the trained child network on test dataset, its latency and/or energy consumption; and (4) the controller use a userdefined reward function to calculate a data from FOM collected by the evaluator and use the to update itself so that it can predict neural architectures with higher FOM.This iterative method terminates under two circumstances: (1) the controller repeatedly predicts the same child network; and (2) the number of predicted architectures exceeds a predefined threshold (episode limit). The child network that offers the highest reward among all the generated neural architectures is presented as the search result. The chosen neural architecture is then retrained on the training dataset for a longer training time to offer optimal performance.
More recently, differentiable NAS (liu2018darts; wu2019fbnet; li2020edd) has achieved stateoftheart performance with a muchreduced search time by transforming the search process into training an overparameterized neural network. However, those approaches suffer from flexibility. More specifically, in the field of research considered in this paper, differentiable NAS struggles in handling large search spaces with multiple different hardware design parameters and complex designs where the number of channels varies for each layer. Thus, in this work, we adopt RL based NAS as our search algorithm.
3. Uncertainty Modeling
3.1. Uncertainty Model
In this work, we model device uncertainties as a whole and use a Gaussian distribution to represent them (jiang2020device; feinberg2018making). We set the mean of the uncertainty distribution to be zero, its variation to be 0.04, and for each device, its uncertainty is independently distributed. which is referred from (zhao2017investigation), where the uncertainties are measured from actual physical devices. For an easier representation of the latter part of this paper, the uncertainty model is depicted as:
(1) 
where N is a Gaussian variable which, on each individual element of the weights, is independent and identically distributed. and are the expected weights trained in the data center and the actual weights deployed on the accelerators, respectively.
3.2. Effects on DNN Outputs
In this work, we focus on the impact of device uncertainty on classification tasks, and the reasons go as follows: (1) most emerging devicebased DNN accelerators target classification tasks, analyzing the effect of device uncertainties on these tasks helps the majority of the researchers to improve their work; (2) DNNs for classification tasks are typically composed of convolution layers and fully connected layers, which are also the basic components of DNNs targeting other application. The effect of device uncertainties on these two components is essential for all types of DNNs.
We start by understanding the effect of device uncertainties on the output of a DNN model. Formally speaking, a DNN model can be defined as a combination of its neural architecture and its trained weights. Thus, the inference process of a DNN model with input that generate an output can be defined as:
(2) 
where is the neural architecture of , is its weights, is this input vector and is the output vector.
During training,
is then passed through a loss function, where a version of
after SoftMax is compared with the ground truth classification label to generate a lossfor the backpropagation process. During inference, the final predicted class of
can be calculated by , which is the index of the item in that has the maximum value.Although in inference, classification result is the final outcome of a DNN model, the output vector serves as a better representative of the behavior of this model. The classification result is only an index of the maximum value of and is thus only a simplified discrete proxy of . The continuous, multidimension vector contains more information than the classification result. In order to understand how uncertainties in weights may affect the network, it is of crucial importance to understand how it affects .
As defined in 1 and 2, a deployed neural network under the effect of device uncertainties can be depicted as:
(3) 
where is the trained value of the neural network to be deployed, is one sample from the noise distribution, and is the affected output.
We analyze the distributional behavior of the effect of device uncertainties on the output vector of a DNN model. To conduct this analysis, we first (1) train a DNN model to converge and collects its trained weight . We then (2) fix one input image and collect its output on the trained weight . We denote this output as the original output . After that, we (3) sample different instances of noises and then feed them to Eq. 3, collecting different output vectors. Finally, we calculate output change using Eq. 4
(4) 
where the output change is the elementwise subtraction of the perturbed and original output.
3.3. Experimental Results
In order to get a glance at the statistical behavior of output change, according to the workflow introduced in Sect. 3.2, we train a LeNet model for the MNIST dataset (lecun1998gradient) to stateoftheart accuracy. We then randomly choose one input image in the test dataset and sampled 10k different instances of noise. with this setup, we gathered 10k different output change vectors.
The output change
is a vector of 10, with each element representing the confidence of classifying the input image into one certain category. Because a highdimensional vector is not a good choice for analytical study and visualization, we analyze each element of these vectors. We analyze the statistical behavior of each element across different vectors and gathered 10 instances of distribution data.
Surprisingly, each element of the output change follows Gaussian distribution. To visualize this finding, we plot the histogram of the distribution of each element of output change vector and the corresponding Gaussian distribution that fits it. The visualization result for the first element of output change is shown in Fig. 3 that the first element of output change vector is a Gaussian variable.
To verify this observation, We tested various networks in various datasets. With the MNIST dataset, we analyze both LeNet and multilayer perceptrons (MLP) using ReLU and Sigmoid activation with 2 layers. With the CIFAR10 dataset
(krizhevsky2009learning), we test a conventional floatingpoint CNN, a quantized CNN, and two ResNets, ResNet56 and ResNet110. We also train these models with 3 different initializations to get different trained weights.We evaluate how these output change vectors fit into Gaussian variables by two widely used standards: mean square error (MSE) and Chisquare () test. MSE can be described as:
(5) 
and test can be depicted as:
(6) 
where and are the observed (output change
) and estimated (Gaussian) value of probability and
is a userdefined granularity. We define because it is precise enough when we have a total of 10k instances of data.Model  Dataset  ()  MSE () 

MLPReLU  MNIST  5.22  3.20 
MLPSigmoid  MNIST  5.81  2.20 
LeNet  MNIST  4.59  2.67 
FloatConv  CIFAR10  7.01  3.03 
FixedConv  CIFAR10  6.79  2.74 
ResNet56  CIFAR10  4.56  1.79 
ResNet110  CIFAR10  4.81  2.01 
The evaluation result is shown in Table 1. Note that we have test 3 different initializations for each model and for each model, is a vector of 10. The result shown in Table 1 is an average of them. For each model tested, test results are all below 0.1 and MSE are all below , which indicates that they are well fit into Gaussian distributions. Moreover, both errors do not increase when the model is extremely shallow (e.g. 2layer MLP) and very deep (RestNet110), so this observation generalizes across different DNN models.
The study on each of the models supports the previous observation that their output vectors values follow Gaussian distribution. Based on these studies, we can claim that,
with any independent and identically distributed Gaussian noise on weight, the output vector of the same input image follows a multidimensional Gaussian distribution^{1}^{1}1Note that each element of the output are deeply corelated, not independent over different samples of noise.
This is a very strong claim but is not counterintuitive. The output of the first convolution layer is the summation of the multiplication result of deterministic inputs and Gaussianly distributed weights and is thus a summation of Gaussian distributions. The summation of Gaussian variables is also a Gaussian variable, so the output of the first layer is a Gaussian variable. After activation, the input of the second layer is a transformed Gaussian variable and after propagating through this layer, with enough number of operands, the accumulated variable can be approximated by Gaussian variables. Thus, although the final output may not strictly be a Gaussian variable, a Gaussian approximation can be observed.
4. Uncertainty Aware Search
4.1. Methodology
In addition to understanding the effect of device uncertainties, we propose a remedy method to reduce the effect of this issue by adopting NAS.
In this work, we propose Uncertainty Aware sEarch (UAE), a more comprehensive uncertainty aware NAS for better exploration of neural architectures in nonideal emerging devices based CiM NN accelerators.
Similar to the stateoftheart Reinforcement Learning based NAS framework NACIM (jiang2020device), UAE works iteratively and in each iteration: (1) an LSTMbased controller is used to learn and generate neural architectures; (2) an uncertainty aware trainer is used to train each generated neural architecture to get a model to deploy; (3) an uncertainty aware evaluator is adopted to evaluate the actual performance of the deployed model; (4) the evaluated performance is used as a reward to update the controller so that it can generate neural architectures with higher rewards.
The detailed implementation of the uncertainty aware trainer and evaluator are described below.
4.2. UncertaintyAware Training & Evaluation
We adopt an uncertaintyaware training scheme similar to the one used in NACIM (jiang2020device). The training process is organized the same as traditional DNN training that, in each iteration, a subset (batch) of the training data is used to train the model, and after the whole training dataset has been used to train the model, and an epoch of training is finished and another epoch is started. The trainer trains the model for multiple epochs to get a trained model.
The uncertaintyaware training augments the training process for each batch to learn a DNN model that is more robust against device variations. In each training batch, before feeding the input into the model, the trainer (1) save the original weight of the model; (2) sample a noise from the uncertainty distribution and add the noise to the weight of the model to form a ; (3) perform forward inference and back propagation in the perturbed model and collect gradient data for each weight; (4) load the saved back to the model and update
with the collected gradient data via stochastic gradient descent.
Uncertaintyaware training simulates the process of training DNNs directly on CiMbased accelerators. Experiments in NACIM show that uncertaintyaware training learns DNN models that are robust against device uncertainties.
The uncertaintyaware evaluation is performed similarly to the training process. Before evaluation, the evaluator samples an instance of noise from the uncertainty distribution and add the noise to the trained weight to get a . The evaluator then evaluates the classification accuracy of the perturbed model on a test dataset. This process is performed for times and different accuracy data are gathered. The evaluator then report one distributional property (e.g., mean, maximum value, 95% minimum value) of the accuracy data to form a reward. The distributional property to be used is specified by the user.
4.3. Experimental Results
We demonstrate the effectiveness of UAE by searching for an optimal quantized CNN for CIFAR10. The fixed design parameters and hyperparameters included in the search space are shown in Table 2. For device uncertainty specifications, we assume an ISAAClike (shafiee2016isaac) neural accelerator architecture and a fourbit RRAM device, whose behavioral model is extracted from (zhao2017investigation). The search process is conducted in a GPU server machine with an Nvidia GTX 1080ti accelerator.
HyperParameters  Value choices 

Dataset  CIFAR10 
Type  Quantized CNN 
# of Conv Layers  6 
# of FC Layers  2 
FC Hidden size  1024 
# channels  (24, 36, 48, 64) 
Filter Height/Width  (1, 3, 5, 7) 
# of integer bits  (0, 1, 2, 3) 
# of fraction bits  (0, 1, 2, 3, 4, 5, 6) 
As described in Sect. 4.2, there are two major search parameters: the instances () of noise sampled for each architecture and the distributional properties used to form the accuracy data collected by the evaluator into a reward. We test two different values of , 5, and 100 samples, for the reason that will be explained afterward. We also use two different distributional properties, one is the mean value of all accuracy data (mean) and the other is 95% minimum of the accuracy data (95). The mean value indicates how a model performs under the effect of device uncertainty in average circumstances and the 95% minimum shows the models’ behavior in worstcase scenarios.
We offer a comparison for different specifications of UAE and two baseline methods, quantNAS (Lu2019Neural), a stateoftheart NAS framework to search for the optimal quantized CNN and NACIM (jiang2020device), another uncertainty aware searching framework for CiMbased accelerators. In each experiment, the NAS controller searches for 2000 different architectures (episodes) and the trainer trains each generated architecture for 15 epochs.
The DNN models finally presented by each search framework are also evaluated by mean and 95% minimum value with their accuracy data collected by 10k MonteCarol simulation. The experimental result is shown in Table. 3.^{2}^{2}2Because the data for quantNAS and NACIM are collected from published work, we do not have the 95% minimum accuracy result for them.
Method  w/o noise  mean  95  Time (h)  
QuantNAS (Lu2019Neural)  0  84.92%  08.48%  N/A  53 
NACIM (jiang2020device)  1  73.88%  73.45%  N/A  98 
UAEM  5  77.48%  75.94%  75.55%  118 
UAEM  100  82.99%  79.84%  77.82%  255 
UAE95  100  80.64%  78.39%  77.98%  255 
Experimental results show that, without uncertaintyaware training, QuantNAS can identify an optimal DNN model that can offer close to 85% of test accuracy, but struggles in finding proper neural architectures that are robust to device uncertainties, as the test accuracy of the DNN model identified by quantNAS is down to 8.5%, even worse than random guessing (10%). With the help of uncertaintyaware training, NACIM can identify DNN models that are robust against the impact of device uncertainties. However, as NACIM only evaluates the performance of the searched architecture once, the randomness of the device uncertainty hinders the performance of NACIM, resulting in a model with only 73.45% of average accuracy. UAE, on the contrary, is able to find DNN models that are both accurate and robust against the impact of device uncertainties. When collecting only 5 samples in uncertaintyaware evaluation, UAE achieves 2.49% higher accuracy than NACIM with a time overhead of only 20%. When collecting 100 samples in uncertaintyaware evaluation, UAE achieves 6.39% higher accuracy than NACIM a search time overhead of 2.5x. Though further increasing the number of samples is possible, the search time overhead will be too large to endure. The adoption of a 95% minimum value standard is also effective. UAE95 offers 0.16% higher worstcase accuracy than its UAEM counterpart with a 1.45% lower average accuracy.
5. Conclusions and Future Works
In this work, we propose a MonteCarlo simulationbased experimental flow to measure the device uncertaintyinduced perturbations to DNN models. Our thorough investigation of the behaviors of different DNN models under such perturbations and shows that the value changes of their output vectors follow Gaussian distribution. We also propose UAE, a device uncertaintyaware NAS framework that identifies DNN models that are both accurate and robust against device uncertaintyinduced perturbations. Based on the observations made on the impact of device uncertainties on the DNN models, the possible future directions include the formal mathematical proof of the analyzed statistical behaviors and a timeefficient estimation method for the impact of device uncertainties.
Acknowledgment
This work is supported in part by National Science Foundation under grant CNS1919167.