1 Introduction
Deep learning neural networks (DNNs) are widely used today in many applications such as image classification, object detection, etc. High throughput and high energy efficiency are two of the pressing demands on DNNs these days. However, current computation mechanisms are not the best choices for DNN in terms of efficiency. As DNN models become increasingly complex, the computation process takes a lot of time and energy [12]. In traditional von Neumann architectures, storage memory is separated from computation operations. To compute a output feature map, we need to read the input feature map and weights from the memory units, send them to the computation module to compute before writing the results back to the memory units. During the whole process, the system spends a large portion of energy and time in data movement [35]. This is made worse with advances in process technology, making the relative distances involved even longer and thus more costly.
Emerging memory technologies (EMT) including PCRAM, STTRRAM and FeRAM [18] promise better density and energy efficiency, especially since many are nonvolatile and well suited to the readmostly applications. Inmemory deep learning using EMT cells, especially in analog
mode, has already demonstrated an order of magnitude better performance density and two orders of magnitude better energy efficiency than the traditional deep learning on the MNIST dataset
[36] [2] [17]. As inmemory deep learning integrates computation with the memory operations [26], the computation results can be directly read from the memory modules using a single instruction. This is different from traditional deep learning, where the memory operations are executed separately from the computation operations. Computing where data is stored reduces the need to move large amounts of data around frequently. Especially when technology scales and onchip distances become longer, we can expect substantial savings of time and energy if the emerging memory technology is used in the inmemory deep learning paradigm.A big challenge of inmemory computing is the instability of EMT cells [24] [16]. Unlike traditional memory technology, where the data are stored in stable memory cells, data stored in analog mode EMT cells may fluctuate and different values can be output. For instance, suppose we store the weight in an EMT cell. When we read it from the EMT cell, the output may become instead of . Here is a fluctuating amplitude of that memory cell. Because of this instability, inmemory deep learning using EMT especially in an analog manner may make incorrect classification [6]. This can severely limit its application in the real world.
Another challenge of inmemory deep learning is the ineffectiveness of traditional energy reduction techniques, such as pruning, quantization, etc. Pruning reduces energy consumption by decreasing the number of operations [35] [9]. However, the energy of inmemory deep learning is roughly proportional to the weight values. Pruning usually removes those weights with smaller absolute values, which only contribute to a small portion of the energy consumption. Quantization reduces the energy consumption by decreasing the complexity of operations using low precision data [32] [33]. Unfortunately, in EMT cells, low precision data consumes almost the same amount of energy as that of the highprecision data. Therefore, traditional pruning and quantization technologies are less effective for inmemory deep learning.
In this paper, we solve these problems by proposing three techniques. They can effectively recover the model accuracy loss in inmemory deep learning and minimize energy consumption. Our innovations are:

Device enhanced Dataset: We propose augmenting the standard training dataset with device information integrated into the dataset. In this way, the models will also learn the fluctuation patterns of the memory cells. This extra information can also help the optimizer avoid overfitting during training and thus improve the model’s accuracy.

Energy Regularization:
We optimize the energy coefficients of memory cells and the model parameters by adding a new term into the loss function. This new term represents the energy consumption of the model. The optimizers can automatically search for the optimal coefficients and parameters, improving the model accuracy and energy efficiency at the same time.

Low fluctuation Decomposition: We decompose the computation into several time steps. This is customized for inmemory deep learning and can compensate for the fluctuation of memory cells. This decomposed computation mechanism can improve the model accuracy and energy efficiency.
We prove the effectiveness of all these techniques both theoretically and experimentally. They can substantially improve the model accuracy and energy efficiency of models for inmemory deep learning. We organize the rest of the paper as follows: Section 2 gives related work; Section 3 presents the background knowledge; Section 4 introduces our proposed optimization method; Section 5 shows the experiment results; Section 6 concludes.
2 Related Work
The first category of work to improve the accuracy of the inmemory deep learning model is called binarized encoding. The information in each memory cell is digitized into one bit instead of being stored as a full precision number. In other words, the data stored in each memory cell is either or [7]
. Theoretically, the onebit data is more robust than a highprecision number at the same level of fluctuation. Several previous works used the onebit design to compute the binarized neural networks. Sun
et al.used the singlebit cell to execute the XNOR operations in XNORNet
[29]. Chen et al. [4] used the singlebit cell to execute the basic operations in binary DNN. Tang et al. proposed a customized binary network for the singlebit cell [30]. However, either the XNORNet or the binary neural network has a large accuracy drop compared with the fullprecision model [25]. Recently, new progress in this research direction is to store a highprecision weight using a group of singlebit memory cells. For example, Zhu et al. [39] used singlebit memory cells to store a bit weight. Such a method can increase the model accuracy because it can increase the effective precision of weights. Compared with the traditional design, it uses more memory cells.The second category of work to improve the accuracy of the inmemory deep learning model is called weight scaling. Theoretically, we can reduce the amplitude of weight fluctuation by scaling up the weight values stored in the memory cell [20]. After computation, we scale the result down using the same scaling factor. In the literature, many research works have found other physical ways to reduce the amplitude of weight fluctuation. For example, He et al. found that we can reduce the fluctuation amplitude by lowering the operation frequency [10]. Chai et al. found one material, which has lower fluctuation amplitude than the other types of material [3]. However, these methods demand strict physical conditions. Compared with them, weight scaling is a more general method that can reduce the fluctuation amplitude of memory cells in most conditions [28]. However, Choi et al. found that although the memory cell using scaled weights showed smaller fluctuation amplitude, it also consumed higher energy consumption [5]. Ielmini et al. modeled the relationship between the scaling factor, the fluctuation amplitude, and the energy consumption [11], which could help us to find the optimal scaling factor for inmemory computing computation.
The third category of work to improve the accuracy of the inmemory deep learning model is called fluctuation compensation
. To alleviate the instability, they first read the memory cell by multiple times and then record the statistical result such as mean and standard deviation
[21]. Afterward, they either calibrate the model parameter or the model output directly based on that statistical results [27]. This method is also widely used during the memory cell programming stage. For example, Joshi et al.compensated the programming fluctuation by tuning the batch normalization layer parameters
[14]. Alternatively, Zhang et al. compensated the programming fluctuation by offsetting the weight values [37]. These methods are effective in a static device environment. If we face a dynamic environment, a more general way is needed. One popular approach is to have many equivalent models running in parallel. Then we calculate the mean of the results. Joksas et al. did this by applying the committee machine theory into the inmemory computing devices [13]. Wan et al. optimized this process by running a single model on the same device and reading the memory cells multiple times [31]. This method can average out the weight fluctuation and get a more stable result.3 Preliminaries
EMTbased inmemory deep learning can be very efficient [36] because of its analog operation. Fig. 1(a) shows the difference between the traditional and EMT memory cells. When we read a weight from the traditional memory cell, the input to the corresponding memory cell is , meaning that the read request to that memory cell is enabled. Afterward, the memory cell returns as an output. An EMT cell for inmemory deep learning is quite different. When we read the weight , the input to the memory cell is a variable instead of the fixed data . The memory cell then returns directly, the product of the input signals and the stored weight . In other words, the EMT cell integrates the multiplication operation into the read operation.
Analog EMT is more efficient than traditional memory not only in multiplication operations but also in addition operations [19]. To better explain this, we show how traditional cells and EMT cells execute the multiplyaccumulate (MAC) operation in Fig. 1(b) and Fig. 1(c), respectively. For traditional memory cells, we first read weight from the corresponding memory cells. Afterward, the output is multiplied by the activation using a multiplier. Finally, we sum all products from each of the multipliers together either by a single adder sequentially, or use a tree of adders to perform the sum in parallel. To achieve the same computation using EMTbased inmemory computing, we just need to connect the output of each memory cell to the same port. Physically, the sum of all the memory cell outputs can be obtained from that port directly. This is also referred to as a current sum.
3.1 Challenges of Analog Inmemory Deep Learning
The energy consumption of a EMT cell is much less than that of traditional memory cells when they execute the MAC operation. There are important differences. In traditional memory cells, energy consumption is not related to the weight value that is stored. In analog EMT, it is proportional to the weight value [34], as shown in Fig. 2(a). We use a parameter to denote this energy coefficient. This parameter is tunable. We can use this parameter to optimize its energy consumption. Theoretically, a small coefficient can help us improve the energy efficiency of models.
A big challenge in using the merging memory technology is that regardless of the actual technology, their memory cells do not output stable results [24]. Physically, each memory cell has multiple states, and it changes its state with time randomly. Whenever we read the memory cell, it can be in any of the states. At th state, the weight value read from the memory cell is , where is the prestored weight and is the energy coefficient. Given an input , the output data becomes instead of . In Fig. 2(b), we show an example of a memory cell with two states. In this example, each state has a
% probability to show. If the store weight is
, and the energy coefficient is , the output result can be either or depending on its state, where is the input activation. The bottom three subgraphs in Fig. 2(b) corresponds to memory cells with three different energy coefficients .The fluctuation shown in Fig. 2(b) is a simplified example. In practical EMT cells, the number of fluctuation states and the probability of each state are more complicated. There are many works in the literature studying these fluctuations [8]. For deep learning, such a phenomenon can cause a nonnegligible accuracy drop. It is the biggest challenge that limits the application of inmemory computing. As we can see from Fig. 2(b), the fluctuation amplitude, defined as the average distance among , is related to the energy coefficient . Theoretically, a higher will result in less fluctuation of the weight and thus higher model accuracy, but it also means higher energy consumption of memory cells. This trade off is the focus of this paper.
3.2 Incompatibility of Traditional Training Method
Fig. 3
shows the standard training process in deep learning. The loss function takes the dataset and the model and generates a measure of the distance between the current parameter values and their optimal values. The optimizer uses gradient descent to reduce this distance by updating the parameters. This process would iterate for many epochs until the optimizer can find an optimal set of parameters. We define
as the image data and as the label data. Here and denote the spaces for images and labels . For simplicity, we express a onelayer neural network model in Equation (1), where the weight and the bias are both trainable parameters.(1) 
Let’s define the function class as the search space of function . Let be the combination of and , and be the combination of X and Y. Define as the unknown distribution of data Z on space . Given the above definitions, the loss function can be expressed as . It is the mapping from the combination of and to the real number space . Theoretically, the training process of the neural network is to find the optimal function from the function class that can minimize the risk, i.e., the expectation of the loss function. We express it in Equation (2).
(2) 
The difficulty in solving this problem is that the distribution is unknown. What we have is just the dataset , which are independent and identically distributed (i.i.d.) samples from the distribution . Alternatively, the traditional training process of the neural network is to find the optimal model from the function class that can minimize the empirical risk, i.e., the average of loss functions on the sampled dataset. We express it in Equation (3).
(3) 
The distance between the risk and the empirical risk is called the generalization error , expressed in Equation (4). The crucial problem of the training process is to make sure that the generalization error can be bounded.
(4) 
Researchers have since solved this math problem for traditional deep learning models. Many theoretical studies have shown that neural network optimizers, such as stochastic gradient descent (SGD) [1] or
adaptive moment estimation
(Adam) [38] can efficiently find the optimized function . However, for inmemory deep learning models, because of the fluctuation of weight matrix , the sampled function becomes and the distance of risks becomes , as expressed in Equation (5). Therefore, the traditional training method no longer works for this problem. We need a new training method that is suitable for inmemory deep learning models.(5) 
4 Optimizing for Inmemory Deep Learning
Our training method for inmemory computing can effectively improve the model accuracy, with improved energy efficiency. Fig. 4 shows the overview picture. Essentially, we propose three optimization techniques and integrate them into the training process. The first technique is called deviceenhanced datasets. This technique integrates device information as additional data into the dataset. This will make the model more robust to the fluctuations of the device. The second technique is called energy regularization which we add a new regularization term into the loss function that makes the optimizer reduce the energy consumption automatically. The third technique is called lowfluctuation decomposition by which we decompose the computation involved into several time steps. This decomposition achieves high model accuracy and energy efficiency. We shall now give the mathematical basis of these three techniques.
4.1 DeviceEnhanced Dataset
Our first optimization technique is to enhance the dataset with device information. In addition to the regular image data and the label data , our enhanced dataset has another source of data, the fluctuation , which reflects the random behavior of memory cells. Fig. 5 shows an visualized example of , and , with four training data (Fig. 5(a))(Fig. 5
(d)). They can be classified as either letter
A or letter B. Images in the same class can have different variants. Take the letter A for instance. It can be in any font, either normal (Fig. 5(a)) or italic (Fig. 5(c)). No matter what the variant is, its pixels must follow , the distribution for class A. After training, we can accurately classify any image that belongs to class A as long as its pixels follow the distribution (Fig. 5(e)). This truth also holds for images of letter B, whose pixels follow distribution . The fluctuation data reflects the random states of memory cells. In our visualized example, the pixels indicate the states of the memory cells. The patterns for fluctuation also follow a certain distribution , which can be learned during training. Using this enhanced dataset, the model can make correct predictions for inmemory deep learning because now it becomes aware of the fluctuations (Fig. 5(e))(Fig. 5(f)).As we integrate the fluctuation device data into the dataset, the model will not overfit during training. In Fig. 6, we visualized an example of training using only the dataset and do not consider the device information. All pixels in the data are in the center of the matrix, indicating the absence of device information (Fig. 6(a))(Fig. 6(b)). During training, the model will overfit this static data because it does not have any variant. As we can see from Fig. 6(c), the distribution learned by the model is only a straight line (orange), which is different from the real fluctuation of memory cells (purple). Therefore, the model will misclassify the images. On the other hand, if we include the fluctuation data , the overfit can be avoided. As we can see from Fig. 6(e) and Fig. 6(f), the fluctuation of memory cells (purple) will follow the learned distribution (orange) so that the model can accurately classify the images.
We developed a method to integrate the fluctuation data into the training process. To simplified this problem, we first decompose the computation of neural network model (Equation (1)) into several subtasks. For example, each element in the output matrix can be computed independently using Equation (6
). Here vector
is the th row in weight matrix , vector is the th column in the input matrix , and is the element in the bias matrix at th row and th column.(6) 
As we showed in Fig. 2, the weight we read from the EMT cell is unpredictable. Physically, the memory cell changes its status randomly, and the exact output value depends on the state of the memory cell when we are reading it. We denote be the th element in vector . is the sampled data when we read from the memory cell, and multiply it with the input vector . Mathematically, we can express as a polynomial, as shown in Equation (7).
(7) 
We use to denote the weight retrieved when the memory cell is at th state. It can be considered as a function of the prestored weight and energy coefficient
. At any moment, each memory cell can only be in one state. Hence, we use a onehot encoded vector [
] to indicate the state of the memory cell when is sampled. The value of is shown in Equation (8). For given indexes , , and , if the corresponding memory cell is at th state, only equals and all the other coefficients equal .(8) 
As we can see from Equation (7), the sampled weight from the memory cell consists of two parts:

The deterministic parameter is a function indicating the returned value from the memory cell storing weight for a memory cell that is in the th state. We denote the matrix , as , shown in Equation (9). can be considered a function of the weight vector and the energy coefficient .

The stochastic parameter is a random coefficient indicating whether the memory cell storing weight is at th state when it is sampled and multiplied with the input vector . We denote matrix as , shown in Equation (10). can be considered as a part of fluctuation data .
(9)  
(10) 
We can now integrate the fluctuation data into the training process for inmemory deep learning. Each element in the output matrix can be calculated using Equation (11). is a function of both the deterministic and stochastic parameters. Here refers to the sampled weight vector read from the memory cells. For simplicity, we assume the bias is a deterministic parameter. In some cases, the bias is also fluctuating. We can use the same method to separate deterministic and stochastic parameters for .
(11) 
The operator between and is the Hadamard product, i.e., elementwise product. The unit vector is expressed in Equation (12). We use it to sum up the entire column of the target matrix.
(12) 
4.2 Energy Regularization
Our second optimization technique adds an energy regularization term into the loss functions during training. From Equation (11) we can infer that the loss function of the model is a function of weights and energy coefficient . The target of our optimization technique is to find the optimal energy coefficient that can improve both the model accuracy and energy efficiency. However, it is not an easy task. We prefer a smaller for higher energy efficiency. However, as we see in Fig. 2, the higher causes a larger fluctuation amplitude of the weights, which results in accuracy loss. On the other hand, if we choose a larger coefficient , the model accuracy would be less affected by the weight fluctuation, but the energy consumption becomes larger.
Our new loss function is expressed in Equation (13). The first term is the original loss function of the model, and the second term represents the energy consumption of the model. is a hyperparameter indicating the significance of the energy regularization term. is a constant indicating the number of reading operations from the memory cell storing weight . The overall loss function can be considered as a function of w ( is the th element of w) and , which are both trainable parameters. We can use any popular optimizer (such as SGD optimizer [1] or Adam optimizer [38]) to search for the optimal weight and energy coefficient .
(13) 
During training, gradient descent will minimize the loss function . After optimization, both and will become smaller. We show this process in Fig. 7. With the help of the energy regularization term, we can improve both the model accuracy and energy efficiency simultaneously.
4.3 Lowfluctuation Decomposition
The third optimization technique is to decompose the computation process into multiple time steps. We can visualize the computation involved in Fig. 8. The input activation and the weight equals the length of the horizontal bar and the vertical bar, respectively. The computation result got from the memory cell equals the area of the square, whose two edges have the same length as the horizontal bar and the vertical bar. In the example of original computing (Fig. 8(a)), the lengths of the horizontal bar and vertical bar are seven () and one (), respectively. The area of the output square is thus seven ().
Theoretically, we can express any input as a polynomial, as shown in Equation (14). Here the fraction bit is binary data, which equals either or , and is the scaling factor for that term. For example, if the input equals , we can decompose it into three parts: , , and .
(14) 
In our lowfluctuation decomposed computation mechanism (Fig. 8(b)), we read each memory cell in multiple time steps instead of once. As Equation (15) shows, at each time step, we only input the fraction bit to the memory cell, obtain the value of , and then scale the output from the memory cell by the factor . Finally, we sum up all the results from each time step. In this example, we use three steps to process input seven (). The scaling factor of each time step is , , and , respectively. As the weight is one (), the final accumulated result is seven (), the same result as the original computing mechanism.
(15) 
As the name indicates, our lowfluctuation decomposition can alleviate the fluctuations of the memory cell effectively. We can explain this using Fig. 8, where we show the fluctuation amplitude in the yellow blocks. The block and hollow block denote positive and negative fluctuation amplitudes, respectively. As we can see from Fig. 8(b), using the decomposed computation mechanism, the negative fluctuation amplitude (hollow block) at the third time step can partially average out the positive fluctuation amplitude at the second time step (solid block). Statistically, the accumulated fluctuations from the decomposed computation mechanism have a lower standard deviation than that of the original computation mechanism.
We can mathematically compare their standard deviations. Equation (16) shows the standard deviation of the original computation mechanism, where is the original output, and is the standard deviation of when we read it from the memory cell. Equation (17) shows the standard deviation of our lowfluctuation decomposed computation mechanism, where is the new output, and is the weight sampled from the memory cell at th time step. Since reading memory cells can be considered as independent events, we have .
(16) 
(17)  
We can infer from Equation (16) and (17) that our decomposed computation result has a lower standard deviation than the original computation result, leading to high model accuracy (Equation (18)).
(18) 
Our lowfluctuation decomposition can also improve energy efficiency. To prove this, we express the energy consumption of the two computation mechanisms in Equation (19). From the equation, we can infer that our decomposed computation mechanism consumes less energy than the original one (Equation (20)).
(19)  
(20) 
4.4 Convergence of the Training Method
We shall now mathematically prove the convergence of our training method. Equation (21) shows the basic relationship between image data and label data in the traditional deep learning network. The output matrix is a function of the input matrix . Here and denote the space of and , respectively.
(21) 
For inmemory deep learning applications, the computation becomes unpredictable because of weight fluctuation. As can be seen from Equation (11), output is a function of both the input and the fluctuation data . To generalize this, the output matrix can be defined a function of the input matrix and the fluctuation data , shown in Equation (22). Here denotes the space of .
(22) 
We define a new data as the combination of images , labels and fluctuations . Here the space is the combination of , , and . Since is the combination of and , we can express the space as the combination of the spaces and . We show their relationships in Equation (23)(24).
(23)  
(24) 
Given the fact that follows distribution , while follows distribution , we can infer that the space (the combination of and ) must follow a distribution (the combination of and ). We show this in Equation (25).
(25) 
Finally, our proposed training process for inmemory deep learning models can be concluded as follows: given a function in the space , and a loss function , we would like to find that can minimize the risk, i.e., the expectation of the loss function (Equation (26)).
(26) 
Thus, we convert the convergence problem of the new training process for inmemory computing into the convergence problem for regular neural networks (see Section 3.2). Finding the optimal function is well studied, and we can use various existing optimizers, such as SGD or Adam, to find the optimal function .
5 Experiments
We trained the models on the Pytorch platform and evaluate the energy consumption in an inmemory deep learning simulation platform
[15] [34]. To accelerate the training process, we start each experiment from a welltrained model with fullprecision weights [22]and then finetune the model by applying our proposed optimizations. During finetuning, we quantize both the activations and weights. To form the deviceenhance dataset, we fetch the images and labels from the regular datasets (such as CIFAR10 and ImageNet) as data
X and data Y, respectively. The fluctuation data S are obtained from stateoftheart device models [11]. We use a workstation with an Nvidia 2080TI graphic card to train the model. For the CIFAR10/ImageNet dataset, each experiment can finish in about an hour/an entire day.
We proposed three solutions, denoted as ‘A’, ‘A+B’, and ‘A+B+C’. As shown in Fig. 4, the notations A, B, C stand for the deviceenhanced dataset, energy regularization, and lowfluctuation decomposition, respectively. Solution A uses only the first technique; solution A+B applies the first two; solution A+B+C combines all three. We evaluate popular models including VGG16, ResNet18/34 and MobileNet. VGG16 is a regular deep neural network with only kernels. ResNet18/34 are popular models that achieve competitive accuracy by adding residual links between layers. MobileNet is a smallsize model that achieves high efficiency due to its special depthwise layer. We compare our work with three stateoftheart solutions: binarized encoding [39], weight scaling [11], and fluctuation compensation [31] as described in detail in Section 2.
5.1 Ablation Study of Proposed Techniques
Models optimized by our proposed methods have much higher accuracy than the model trained by the traditional optimizer. In Fig. 9, we show the accuracy achieved by solution A, solution A+B, and solution A+B+C under different energy budgets. As a reference, we also give the model accuracy trained by the traditional optimizer. As we can see from the figure, at 16 J energy budget, the accuracy of solution A+B+C is very close to baseline accuracy (shown as dashed lines in the top subfigures). On the other hand, the traditional optimizer exhibits relatively low accuracy due to its unawareness of memory fluctuation.
We can see that when the energy budget is decreased, models trained using the traditional optimizer show a dramatic decrease in accuracy. On the contrary, our solution A+B+C can achieve high model accuracy even if we reduce the energy budget. Even just using Solutions A and A+B is enough to maintain a relatively high accuracy, only to be outperformed by Solution A+B+C. This observation further proves the effectiveness of our proposed three techniques on inmemory computing.
We can also see that under 16 J energy consumption, the ResNet18 trained by the traditional optimizer shows much lower accuracy than the VGG16. By using our solution A+B+C, ResNet18 can fully recover the accuracy and thus outperforms VGG16. This experiment shows that MobileNet is not suitable for inmemory deep learning. Under the same energy budget, MobileNet shows lower accuracy than VGG16 and ResNet18. We attribute this to its depthwise layer. When we compute a regular convolution layer, the system reads hundreds of memory cells at once. However, to process the depthwise layer, it only read nine memory cells at once. Therefore, a large portion of the energy is consumed in the peripheral circuits, causing a significant amount of energy overhead.
0% accuracy drop  1% accuracy drop  2% accuracy drop  
VGG16 (93.6% Acc.)  Ref.  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S) 
Binarized Encoding  [39]  378  74M  2.8  135  74M  2.8  94  74M  2.8 
Weight Scaling  [11]  444  15M  2.8  78  15M  2.8  49  15M  2.8 
Fluctuation Compensation  [31]  1091  15M  14  157  15M  14  82  15M  14 
Ours (A+B)  36  15M  2.8  16  15M  2.8  11  15M  2.8  
Ours (A+B+C)  4.1  15M  14  1.0  15M  14  0.5  15M  14  
ResNet18 (95.2% Acc.)  Ref.  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S) 
Binarized Encoding  [39]  876  56M  6.8  389  56M  6.8  286  56M  6.8 
Weight Scaling  [11]  1127  11M  6.8  209  11M  6.8  158  11M  6.8 
Fluctuation Compensation  [31]  2217  11M  34  474  11M  34  347  11M  34 
Ours (A+B)  83  11M  6.8  22  11M  6.8  10  11M  6.8  
Ours (A+B+C)  6.9  11M  34  1.1  11M  34  0.7  11M  34  
MobileNet (91.3% Acc.)  Ref.  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S) 
Binarized Encoding  [39]  392  16M  4.6  81  16M  4.6  62  16M  4.6 
Weight Scaling  [11]  232  3.2M  4.6  57  3.2M  4.6  42  3.2M  4.6 
Fluctuation Compensation  [31]  659  3.2M  23  126  3.2M  23  91  3.2M  23 
Ours (A+B)  75  3.2M  4.6  23  3.2M  4.6  13  3.2M  4.6  
Ours (A+B+C)  12.2  3.2M  23  1.8  3.2M  23  1.3  3.2M  23 
0% accuracy drop  1% accuracy drop  2% accuracy drop  

ResNet18(69.8% Acc.)  Ref.  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S) 
Binarized Encoding  [39]  23k (0.4%)  58M  151  2338  58M  151  1336  58M  151 
Weight Scaling  [11]  21k (0.3%)  12M  151  3544  12M  151  1933  12M  151 
Fluctuation Compensation  [31]  71k (0.3%)  12M  756  8505  12M  756  4725  12M  756 
Ours (A+B)  1951  12M  151  897  12M  151  659  12M  151  
Ours (A+B+C)  142  12M  756  71  12M  756  54  12M  756  
ResNet34 (73.3% Acc.)  Ref.  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S)  Energy (J)  #Cells  Delay (S) 
Binarized Encoding  [39]  28k (0.2%)  109M  207  2844  109M  207  1778  109M  207 
Weight Scaling  [11]  25k (0.1%)  22M  207  3302  22M  207  2154  22M  207 
Fluctuation Compensation  [31]  83k (0.1%)  22M  1033  7990  22M  1033  4857  22M  1033 
Ours (A+B)  2496  22M  207  1044  22M  207  729  22M  207  
Ours (A+B+C)  168  22M  1033  90  22M  1033  62  22M  1033 
5.2 Robustness to Different Devices
Today, academia and industry have developed various types of EMT cells, which have different levels of fluctuation intensity. Hence, it is necessary to prove the robustness of our solutions under any level of fluctuation intensity. In Fig. 10, we test our solutions and the stateoftheart under three intensity levels [23]: weak, normal, and strong. The experiment is conducted on the ImageNet dataset using two ResNet models. All solutions, including the stateoftheart, are free to tune the energy coefficient . We compare the energy consumption when the model achieves its maximum accuracy. Noted that on the ImageNet dataset, our solutions can achieve the same accuracy as the baseline model running on GPU, where the stateoftheart cannot.
The results show the robustness of our solutions. At any level of fluctuation intensity, our solution has almost the same performance on energy reduction. When fluctuation intensity is increased, both ours and the stateoftheart solutions prefer a higher energy coefficient to maximize the model accuracy, resulting in a higher energy consumption. However, our solutions still outperform the stateoftheart. Our solutions A+B and A+B+C shows one and two orders of magnitude energy reduction, respectively. From this point onwards, we shall assume the fluctuation intensity to be normal. Results under any other intensity level follow a similar trend.
5.3 Verification of the Optimization Solutions
In Fig. 11, we verified our optimization solution by testing two ResNet models on the ImageNet dataset. The dashed line in the figure shows the baseline accuracy. We defined it as the highest accuracy we can achieve on GPU. We also list the accuracy of the stateoftheart for comparison. Among all, our solution A+B+C has the highest top1 and top5 accuracies, which are the same as the baseline accuracies. The accuracy of solution A+B also shows higher accuracy than the stateoftheart. We can observe a small accuracy loss under a smaller energy budget. By contrast, models optimized by the stateoftheart have significant accuracy losses. For example, we can see at least 0.9% and 0.8% top1 accuracy losses of the ResNet18 and ResNet34, respectively. We can also see that the ResNet18 on the ImageNet dataset consumes more energy than the same model on the CIFAR10 dataset. It is because ImageNet has a larger image size than CIFAR10.
5.4 Holistic Comparison with the StateoftheArt
Our proposed solutions have better performance not only on energy reduction but also reduce cost and latencies. In Table LABEL:t:sota1 and Table LABEL:t:sota2, we give a holistic comparison of our solutions with the stateoftheart on energy consumption, the number of cells, and latency, under the same accuracy loss. We test various models on the CIFAR10 and the ImageNet datasets. From the tables, our method has the lowest energy consumption, the least number of cells, and the shorted latency. Specifically, solution A+B shows one order of magnitude improvement in energy consumption, and solution A+B+C can achieve two orders of magnitude improvement, compared with the stateoftheart. However, one limitation of Solution A+B+C is that its latency is longer because the lowfluctuation decomposed computation takes time to accumulate the results. The tradeoff between energy consumption and latency depends on the specific application. Therefore, we list the results of solution A+B and solution A+B+C. Developers can choose one of them based on their demands.
Another observation is that our solutions are the only ones that can fully recover the accuracy loss on the ImageNet datasets. As can be seen from the third column of Table LABEL:t:sota2, all the stateoftheart solutions are unable attain no accuracy loss (we mark their actual accuracy drops in red text). In a neural network, some parameters are vital to achieving high model accuracy. However, the stateoftheart usually applies a general optimization rule to all the model parameters. After the optimization, those important parameters still have relatively large fluctuations, which constraints the recovery of model accuracy. Unlike the stateoftheart, our solutions can automatically identify those significant parameters and minimize their fluctuations.
6 Conclusion
Inmemory deep learning has a promising future in the AI industry because of its high energy efficiency over traditional deep learning. This is even more so if the potential of emerging memory technology (EMT) especially in analog computing mode is used. Unfortunately, one of the major limitations of EMT is the intrinsic instability of EMT cells, which can cause a significant loss in accuracy. On the other hand, falling back on a digital mode of operation will erode the potential gains. In this work, we propose three optimization techniques that can fully recover the accuracy of EMTbased analog inmemory deep learning models while minimizing their energy consumption: They include the deviceenhanced dataset, energy regularization, and lowfluctuation decomposition. Based on the experiment results, we offer two solutions. Developers can either apply the first two optimization techniques or apply all three to the target model. Both solutions can achieve higher accuracy than their stateoftheart counterparts. The first solution shows at least one order of magnitude improvement in energy efficiency, with the least hardware cost and latency. The second solution further improves energy efficiency by another order of magnitude at the cost of higher latency.
References
 [1] (2012) Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Cited by: §3.2, §4.2.
 [2] (2019) A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nature Electronics 2 (7), pp. 290–299. Cited by: §1.

[3]
(2018)
Impact of RTN on pattern recognition accuracy of RRAMbased synaptic neural network
. IEEE Electron Device Letters 39 (11), pp. 1652–1655. Cited by: §2.  [4] (2018) A 65nm 1Mb nonvolatile computinginmemory ReRAM macro with sub16ns multiplyandaccumulate for binary DNN AI edge processors. In 2018 IEEE International SolidState Circuits Conference(ISSCC), pp. 494–496. Cited by: §2.
 [5] (2014) Random telegraph noise and resistance switching analysis of oxide based resistive memory. Nanoscale 6 (1), pp. 400–404. Cited by: §2.
 [6] (2020) Exploring the impact of random telegraph noiseinduced accuracy loss on resistive RAMbased deep neural network. IEEE Transactions on Electron Devices 67 (8), pp. 3335–3340. Cited by: §1.
 [7] (2014) Differential 1T2M memristor memory cell for single/multibit RRAM modules. In 2014 6th Computer Science and Electronic Engineering Conference (CEEC), pp. 69–72. Cited by: §2.
 [8] (2018) Signal and noise extraction from analog memory elements for neuromorphic computing. Nature communications 9 (1), pp. 1–8. Cited by: §3.1.
 [9] (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Cited by: §1.
 [10] (2019) Noise injection adaption: endtoend ReRAM crossbar nonideal effect adaption for neural network mapping. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6. Cited by: §2.
 [11] (2010) Resistancedependent amplitude of random telegraphsignal noise in resistive switching memories. Applied Physics Letters 96 (5), pp. 053503. Cited by: §2, TABLE I, TABLE II, §5, §5.
 [12] (2015) A performance and power analysis. NVidia Whitepaper, Nov. Cited by: §1.
 [13] (2020) Committee machines—a universal method to deal with nonidealities in memristorbased neural networks. Nature communications 11 (1), pp. 1–10. Cited by: §2.
 [14] (2020) Accurate deep neural network inference using computational phasechange memory. Nature communications 11 (1), pp. 1–13. Cited by: §2.
 [15] (2019) A systemlevel simulator for RRAMbased neuromorphic computing chips. ACM Transactions on Architecture and Code Optimization (TACO) 15 (4), pp. 1–24. Cited by: §5.
 [16] (2018) An FPGAbased hardware emulator for neuromorphic chip with RRAM. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 39 (2), pp. 438–450. Cited by: §1.
 [17] (2021) NCNet: efficient neuromorphic computing using aggregated subnets on a crossbarbased architecture with nonvolatile memory. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems. Cited by: §1.
 [18] (201409) Overview of emerging nonvolatile memory technologies. Nanoscale Research Letters 9, pp. 1–33. External Links: Document Cited by: §1.
 [19] (2021) Inmemory computing with resistive memory circuits: status and outlook. Electronics 10 (9), pp. 1063. Cited by: §3.

[20]
(2019)
Optimizing weight mapping and data flow for convolutional neural networks on RRAM based processinginmemory architecture
. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: §2.  [21] (2015) Statistical analysis of random telegraph noise in HfO2based RRAM devices in LRS. SolidState Electronics 113, pp. 132–137. Cited by: §2.
 [22] Pytorch model library. Note: pytorch.org/vision/stable/models.htmlAccessed: 20211023 Cited by: §5.
 [23] (2013) Microscopic origin of random telegraph noise fluctuations in aggressively scaled RRAM and its impact on read disturb variability. In 2013 IEEE International Reliability Physics Symposium (IRPS), pp. 5E–3. Cited by: §5.2.
 [24] (2013) RTN insight to filamentary instability and disturb immunity in ultralow power switching HfOx and AlOx RRAM. In 2013 Symposium on VLSI Technology, pp. T164–T165. Cited by: §1, §3.1.

[25]
(2016)
Xnornet: Imagenet classification using binary convolutional neural networks.
In
European conference on computer vision
, pp. 525–542. Cited by: §2.  [26] (2020) Memory devices and applications for inmemory computing. Nature nanotechnology 15 (7), pp. 529–544. Cited by: §1.
 [27] (2020) Twostep write–verify scheme and impact of the read noise in multilevel RRAMbased inference engine. Semiconductor Science and Technology 35 (11), pp. 115026. Cited by: §2.
 [28] (2020) Optimizing the energy consumption of spiking neural networks for neuromorphic applications. Frontiers in neuroscience 14, pp. 662. Cited by: §2.
 [29] (2018) XNORRRAM: a scalable and parallel resistive synaptic architecture for binary neural networks. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1423–1428. Cited by: §2.
 [30] (2017) Binary convolutional neural network on RRAM. In 2017 22nd Asia and South Pacific Design Automation Conference (ASPDAC), pp. 782–787. Cited by: §2.
 [31] (2020) A voltagemode sensing scheme with differentialrow weight mapping for energyefficient RRAMbased inmemory computing. In 2020 IEEE Symposium on VLSI Technology, pp. 1–2. Cited by: §2, TABLE I, TABLE II, §5.
 [32] (2019) Haq: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.
 [33] (2021) Evolutionary multiobjective model compression for deep neural networks. IEEE Computational Intelligence Magazine 16 (3), pp. 10–21. Cited by: §1.
 [34] (2020) NCPower: power modelling for NVMbased neuromorphic chip. In International Conference on Neuromorphic Systems 2020, pp. 1–7. Cited by: §3.1, §5.
 [35] (2017) Designing energyefficient convolutional neural networks using energyaware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §1, §1.
 [36] (2020) Fully hardwareimplemented memristor convolutional neural network. Nature 577 (7792), pp. 641–646. Cited by: §1, §3.
 [37] (2020) Reliable and robust RRAMbased neuromorphic computing. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, pp. 33–38. Cited by: §2.
 [38] (2018) Improved ADAM optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1–2. Cited by: §3.2, §4.2.
 [39] (2019) A configurable multiprecision cnn computing framework based on single bit RRAM. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §2, TABLE I, TABLE II, §5.