I Introduction
Deep Learning [1]
has become the core driving force of artificial intelligence (AI). Applications such as image recognition, playing games, selfdriving cars, and AI assistants are all made possible with the help of deep learning. At the core of deep learning lies artificial neural networks (ANNs)
[2]. ANNs are trained using large sets of data to approximate a function that explains the given data. Training is done using backpropagation
[3], in which the weights of the neural network are updated based on gradient descent update rule.The majority of the operations in training ANNs are matrix multiplications. Graphics processing units (GPUs) and Tensor processing units (TPUs) are specialized digital hardware designed to speed up this matrix multiplication. With faster computation cores, the bottleneck is currently in memory systems and data transfer
[4]. Moreover, training ANNs for a typical realworld application requires hundreds of years of GPU time [5], leading to high energy costs.Inmemory computing [6] is an emerging paradigm, where data transfer is minimized by storing data and performing computation at the same place. Crossbar arrays with nonvolatile memory have been shown to use lower energy, while also reaping the benefits of inmemory computation. Unfortunately, most of the devices struggle with precision and hence, the resulting performance of the system is not on par with their digital counterparts.
Gokmen and Vlasov [7] proposed a hypothetical resistive processing unit (RPU) that can be used to accelerate ANNs while being more energyefficient than GPUs and having a negligible loss in accuracy. A crossbar architecture with a stochastic weight update rule allowed matrix multiplication in time. Linearity in weight update of the crosspoint device and a high number of conductance levels were shown to be necessary to ensure good accuracy.
Various approaches with nanoscale emerging memories like PCM [8] and RRAM [9] have shown insufficient linearity to enable RPU as the sole memory. Recently, traditional charge trap flash memory has shown promising linearity [10, 11]. However, their performance in the RPU framework has not been explored.
In this paper, we present a charge trap flash device that can act as a crosspoint device in the RPU framework. We experimentally show a high number of conductance levels and approximately linear updates by choosing appropriate pulse width and voltage for weight update. Through simulations, we show that it indeed leads to a good accuracy when tested on MNIST, CIFAR10 and CIFAR100 datasets. In addition to supervised learning problems, we also successfully train a reinforcement learning agent on the Mountain Car environment.
Ii Related Work
Matrixvector multiplication and vectorvector outer product form the bulk of operations while training neural network. RPU [7] speeds up this computation using stochastic multiplication and hypothetical devices with linear weight updates.
Electronic synapses that have been proposed, such as nanoscale memristive synapses, may not have the gradual learning required for RPU. Phasechange memory (PCM) based synapse has gradual positive conductance change, but abrupt negative conductance change, which requires novel synapse circuit design with enhanced controller complexity as well as a dual precision approach. Successful methods supplement weight storage in low precision but compact PCM with high precision but area inefficient CMOS based memory to achieve high performance
[12, 13, 14, 6].With resistive randomaccess memories (RRAMs), multiple devices are required to obtain sufficiently gradual weight change to enable software equivalent learning [15, 16]. Additionally, RRAM (HfO_{2}/PCMO/NbO_{2}) and PCM based memory has additional process complexity / cost to be integrated into CMOS [17].
Floatinggate has been explored as an analog memory for neural networks extensively [18]. However, horizontal floatinggate flash memory has been replaced by vertical charge trap flash memory with storage in silicon nitride traps for advanced technology nodes [19].
In contrast with memristor, a siliconoxidenitrideoxide (SONOS) based charge trap flash memory has significantly gradual conductance change with conductance saturation after 100 pulses [10]. This may be compared to 20 pulses for PCM [8], or 20 pulses for PCMO based RRAM [9]. Maximum conductance change was between 520% of the range of conductance and noise was around 5%10% of the range of conductance. A dual precision approach in which one flash cell has a 1x factor and another has an 8x factor to define the weight was required to obtain software level accuracy on MNIST. The weight updates also required varying pulse voltage and time, which would incur additional circuit costs.
Recently, a similar charge trap flash device has been programmed by quantum tunneling to show extremely gradual programming of 1,00010,000 levels, which gives a 10100x improvement over literature [10]. The maximum conductance change per spike is controlled to 1% of the range while the noise is 0.1% of the range. However, linearity is not available in the entire range, which is essential for RPU applications. An important question is whether, by reducing the range of conductance, a smaller but more linear range can be found, which would enable software equivalent RPU, despite experimentally measured noise.
Iii Background
Iiia Artificial Neural Networks
Artificial Neural Networks work based on the principle of multilayered perceptron
[20, Chapter 6]. Each layer of neurons performs a weighted linear combination of its inputs, applies a nonlinear function, and passes the output to the next layer. Mathematically, given an input vector
and a weight matrix , a fully connected layer outputs(1) 
where is some nonlinear function called the activation. This operation is repeated for all layers, giving the output .
In machine learning, neural networks are used to approximate the function between the input data and a target. Gradient descent is used to minimize a loss function (
) between the output of the neural network () and the true target (). The gradients are calculated efficiently using backpropagation [3].Backpropagation uses chain rule to propagate the gradients to the lower layers, given the gradients of the higher layers. Let
and . Then,(2) 
(3) 
where
are the gradients of the activation functions and
is the Hadamard (elementwise) product. Equations 1, 2, and 3, along with the gradient descent update, form the core of training a neural network.IiiB Resistive Processing Unit
Resistive processing units (RPUs) [7] attempt to speed up the computation of the matrixvector multiplication (Equations 1, 2) and vectorvector outer product (Equation 3). For efficient hardware implementation, devices are arranged in a crossbar architecture with device conductance at each cross point representing a weight.
First, Ohm’s law, combined with Kirchhoff’s current law, is used to enable multiplyaccumulate operation naturally in hardware. During forward pass (Equation 1), passing voltage proportional to to the rows makes the current at the columns equal to the output of the layer . Similarly, during backward pass (Equation 2), passing voltage proportional to to the columns makes the current at the rows equal to , which is required for backpropagating the gradient.
Second, weight update by a simple stochastic AND operation is performed directly on nonvolatile memory elements. The outer product (Equation 3
) is calculated using stochastic multiplication. Two pulse trains, with probability of high voltage proportional to
, respectively, are generated and passed through rows and columns respectively. The voltage levels are set such that the resistive device updates its weight by when the pulses coincide, and there is no change when the pulses don’t coincide. Since the expected number of coincidences is proportional to , the total weight update is proportional to the gradient in expectation. Figure 1 shows an example of pulse trains and the resulting update.The crossbar architecture and the stochastic weight update makes RPU more energy and area efficient compared to high precision digital multiplication blocks [7].
Iv Flash Synapse
Iva Experimental Device
We use a CTF capacitor (Figure 2), which is fabricated as described by Sandhya et al. [21]. The device is fabricated on an nSi substrate with 4 nm thermal SiO_{2} as a tunnel oxide, 6 nm LPCVD Si_{3}N_{4} as charge trap layer (CTL), 12 nm MOCVD Al_{2}O_{3} as blocking oxide, and n+ polysilicon on 12” substrate by Applied Materials cluster tool. Aluminum is used as a back contact. A selfaligned B implant and anneal is done to provide a source for minority carriers for fast programming as shown in Figure 2a.
IvB Working as Synapse
The program/erase operation is based on FN tunneling. When a positive pulse is applied to the gate, electrons from the channel tunnel through the 4 nm tunnel oxide to be trapped in the CTL, i.e., programming (Figure 2b). To erase, a negative pulse is applied to the gate. Electrons are ejected from the CTL by tunneling through the tunnel oxide (Figure 2c).
Programming and erasing results in a threshold voltage shift (). The threshold voltage () is translated to drain current (), which indicates the synaptic conductance () as follows:
(4)  
(5)  
(6) 
where and are proportionality constants [22]. Erasing () results in potentiation (), while programming () results in depression (). Henceforth, we use and interchangeably since they are simply the scaled version of each other. An approximately linear and gradual change of conductance with the pulse number can be designed by pulsewidth modulation [11].
IvC Experimental Data
IvC1 Curve Fitting Device Updates
We experimentally calculate the pulse amplitude and pulse width that gives an approximately linear weight change. Figure (a)a shows the experimental data of vs pulse number for LTD (using a pulse of +12.5V and 0.85ms width) and LTP (using a pulse of 12.5V and 15ms width). The scatter points are the observed data and the solid lines are the corresponding curve fits.
shift is nonuniform. (c) Repeated measurements (6 times) of (a) is used to estimate the noise as a fraction of mean
vs . The experimental is 3040% for LTP and 10% for LTD.The curves were fit using the equation to minimize the mean squared error, with being the curve fit variables. The equation for was then found by setting to get
(7) 
We define as the positive change in when (using LTD data) and as the negative change in when (using LTP data). Figure (b)b shows the variation of with . The results of the curve fit gave respectively for LTD and for LTP respectively, which implies that
(8)  
(9) 
IvC2 Characterization of Device Noise
To find the noise in the updates, LTP and LTD experiments were repeated six times on the same device to characterize the variation within a device. For each experiment, a curve was fit and the corresponding was found. Then, for each
, the standard deviation (
) of the evaluation of all six was found. Figure (c)c shows the standard deviation as a percentage of mean vs for LTD and LTP. This standard deviation is a measure of variation over time within a flash device  interpreted as noise. To simplify the simulations, was set to a high constant for all in our experiments.IvD CTF in RPU array
IvD1 Simulating Device Updates
The conductances of a CTF device are always positive, but the weights can be negative. Thus, two devices are required to represent both positive and negative weights. Mathematically, the weight
(10) 
The scaling constant is used to control the range of device conductance. In hardware, 2 CTF devices are arranged as shown in Figure 4a. Applying voltages to the gates of the devices generates currents at the drain and source respectively. These currents are added to implement Equation 10.
is not constant since is a function of the current device conductances, and whether the update is positive or negative. The update is also noisy. Accommodating all these modifications, the positive and negative updates are given by
(11)  
(12) 
where is the noise.
IvD2 Controlling Linearity and Noise
Since the range of only depends on the dataset and the step size, controls the range of used, and hence, the noise, linearity, and the number of levels available. For example, Gokmen and Vlasov [7] showed that the required range of was , when training on MNIST dataset. Based on Equation 10, a conductance range of on each device is sufficient to represent this range. Hence, a higher implies a lower required range of , which can be observed in Figure (a)a.
Constraining to a lower range improves linearity (Figure (a)a). It also allows us to stay in the region with low noise, leading to lower maximum standard deviation as a fraction of mean (Figure (b)b). But as a tradeoff, the number of levels available before it goes out of the range of is reduced (Figure (b)b). In Section V, we show the effect of this tradeoff on the performance of the system. In addition to the range, the centerpoint of the conductance range is optimized by trial and error to improve linearity.
IvD3 Circuit Design Considerations
Performing an addition or subtraction of pulse trains is easier from a hardware perspective than an AND operation [7]. To perform a positive update, two positive polarity pulse trains can be added such that a positive voltage pulse results at the coincidences. The polarities can be reversed to perform a negative update. Since and are applied to the two ends of the crossbar, the polarity of the pulse trains must depend independently on the corresponding or and not the product . The input
can be assumed to be positive since inputs are generally normalized between 0 and 1 and the common nonlinear activations functions used in a neural network like sigmoid or ReLU only output positive values.
Two possible update cycles with these constraints and the corresponding pulse polarities are shown in Figure 6. We always use the positive cycle in our experiments.
Weight update in hardware for CTF devices is done by applying the voltage at the gate with respect to SourceDrain connected to the ground (Figure 4b).
As proposed by Gokmen and Vlasov [7], nonlinear activation functions and their gradient can be implemented using an external circuitry. For the special case of ReLU activation, this external circuitry can be simplified. ReLU simply passes forward the positive inputs and blocks the negative inputs. The gradient is hence, 1 for positive inputs and 0 for negative inputs.
V Experiments and Results
To test the performance of a neural network with flash synapse as the cross point device, we performed three experiments. We trained neural networks for supervised classification of digits in the MNIST dataset [23], images in the CIFAR dataset [24], and for reinforcement learning in the Mountain Car environment [25]. All neural network operations were performed by simulating CTF devices as described in section IVD. As a baseline in all the experiments, we performed the neural network training using exact floating point operations.
Table I
shows the list of hyperparameters used in the experiments. A combination of manual tuning and grid search was used to find these hyperparameters. Hyperparameters related to the CTF device and RPU were kept constant for all the experiments.
Hyperparameter  Value 

Update step size ()  MNIST: 0.01 
CIFAR: 0.1  
Mountain Car: 0.00625  
Initial weights ()  Kaiming uniform [26] 
Weight scaling factor ()  
Initial device conductance (, )  
Pulse train length  10 
Input scaling factor 
Va Mnist
MNIST dataset consists of 60,000 training and 10,000 test images of 10 handwritten digits, each of size 28x28 pixels.
A fully connected neural network with 2 hidden layers consisting of 256 and 128 neurons respectively, was used for classification. The neural network was trained for 10 epochs. Experiments were repeated 10 times with different random seeds and the train accuracy was recorded after every 5,000 images. The test accuracy was also recorded after every 5,000 training images by performing classification on the complete test set. Two sets of experiments were performed, with noise standard deviation (
) being 10% of the mean in one and 100% of the mean in the other.Figure (a)a
shows the learning curves with 10% noise and 100% noise respectively, compared with that of the baseline. The curves are averaged over the 10 runs and one standard error is shaded. The final accuracies with the flash device are
and with 10% noise and 100% noise respectively. The final accuracy of the baseline is .VA1 Effect of Weight Scaling Factor () on Performance
As described in Section IVD, changing leads to a tradeoff between linearity, noise, and the number of pulses available. To study its effect on the performance, we adjust and measure the test and train accuracies.
Figure (b)b shows the variation of train and test accuracies for different values of k at a noise level of 10%. The highest train accuracy of was obtained for , with the corresponding test accuracy being .
Higher values of used a lower range of device conductances, which reduced the precision of the system since and are unchanged. Lower values of used a larger range of device conductances. Since the conductance change became more nonlinear on either extreme, the performance declined.
VA2 Noise Analysis
In the above subsections, we showed plots for the flash device with a noise level of 10% of the mean and 100% of the mean. To further study the effect of noise on the performance, we run the MNIST experiments with noise level varying from 0% to 500% and find the test accuracy after 3 epochs.
Figure (c)c shows the accuracy as a function of noise, averaged over 4 runs. The accuracy is without noise, with 100% noise, and drops to at 500% noise. As shown in section IVC, 100% noise is well above those found experimentally in the flash device, and hence, acts as a lower bound on the obtainable accuracy.
VB Cifar
CIFAR dataset consists of 50,000 training and 10,000 test images of real world objects. Each image is colored and 32x32 pixels in size. CIFAR10 consists of 10 classes of images, while CIFAR100 consists of 100 classes of images.
Since convolutional neural networks (CNNs) are generally used for classification on these datasets, we follow the methodology used by Ambrogio et al.
[14] to compare our device with the baseline. A pretrained CNN, specifically, ResNet50 [27]pretrained on the ImageNet
[28]dataset, is used for feature extraction. The CIFAR images were resized, normalized and passed through the pretrained network. The activations of the last hidden layer were considered as features.
Once the features were extracted, a neural network with no hidden layers was trained to classify the images based on the features. The neural network was trained for 10 epochs. Similar to the MNIST experiments, CIFAR experiments were repeated 10 times while recording test, train accuracies.
Figure (a)a shows the learning curves with 10% noise and 100% noise respectively, for CIFAR10 dataset. The final accuracies with the flash device are and respectively. The final accuracy of the baseline is .
Figure (b)b shows the same for the CIFAR100 dataset. The final accuracies with the flash device are and with 10% and 100% noise respectively. The final accuracy of the baseline is .
Authors  Precision  Programming  Devices per Weight  MNIST Accuracy  Applications Demonstrated 

Ambrogio et al. [14]  Dual Precision: High precision, volatile DRAM + Low precision nonvolatile PCM  Analog pulse V and time  2 PCM + DRAM  97.95%  Supervised Learning  MNIST, CIFAR10, CIFAR100 
Nandakumar et al. [29]  Dual Precision: High precision, volatile CMOS + Low precision, nonvolatile PCM  Analog pulse V and time  2 PCM + SRAM  97.40%  Supervised Learning  MNIST 
Agarwal et al. [10]  Single precision  Analog pulse V and time  2 SONOS flash  97.6%  Supervised Learning  File Types, MNIST 
Agarwal et al. [10]  Dual Precision: High & Low precision CTF by relative weight  Analog pulse V and time  4 SONOS flash  98%  Supervised Learning  File Types, MNIST 
Nandakumar et al. [8]  Single Precision  Stochastic Identical Pulse Train  2 PCM  83%  Supervised Learning  MNIST 
Babu et al. [9]  Single Precision  Stochastic Identical Pulse Train  2 PCMO  88.1%  Supervised Learning  MNIST 
This work  Single Precision  Stochastic Identical Pulse Train  2 CTF  97.9%  Supervised Learning  MNIST, CIFAR10, CIFAR100; Reinforcement Learning  Mountain Car 
VC Mountain Car
Mountain Car is a control problem in which the agent should drive a car to the top of the mountain. The agent observes its current horizontal position (a real number between 1.2, 0.6) and its velocity (a real number between 0.07, 0.07). The goal is to reach the position of 0.5, which corresponds to the top of the peak. The agent can move forward, move backward or do nothing. Since the agent can’t accelerate enough to reach the peak by just moving forward, it needs to move back and forth to build enough momentum before being able to reach the peak [25]. The agent gets a reward of 1 at every time step until it reaches the goal, and hence, it needs to reach the goal as quickly as possible.
We used tile coding [30, pg. 217] to extract features from the observations and used a neural network with no hidden layers on top of it to predict the stateaction values (Qvalues) for each action. Mathematically, provided an approximation of , for each state and action . The weights were updated using Qlearning [31] update:
(13) 
where is the current state, is the action chosen, is the reward obtained, is the next state, is the step size, and is the discount factor. The gradient calculation and weight update in Equation 13 was performed by simulating the flash device.
Action selection was done using epsilongreedy strategy with . Hashbased tile coding software by Sutton [32] was used for feature extraction, with 8 equally sized tiles per dimension and 16 tilings.
The agent was trained for 500 episodes, with each episode being terminated either on reaching the goal or after 1000 steps. The experiment was repeated 100 times and the total reward obtained from each episode was recorded.
Figure (c)c shows the total reward per episode as a function of the number of episodes with 10% noise and 100% noise respectively. The floating point baseline obtains a reward of (which implies that it takes around 143 steps to complete an episode). With the flash device, the reward is with 10% noise and with 100% noise.
Vi Discussions
We show that the CTF device works as a replacement for floating point update in various applications. In all the experiments, the performance of our device was close to that of the floating point baseline. It was also fairly robust to the experimentally measured noise of 1040% in updates which is crucial for analog computing.
Classification on MNIST dataset showed that a multilayer neural network can be trained using the CTF device. Classification on CIFAR100 dataset showed that even in the regime of a large number of classes and relatively low data, the performance is on par with the floating point updates. Training an agent on Mountain Car environment showed that our method is not just restricted to the supervised learning setting, but can also be used in other settings that use neural networks.
Table II shows that comparison of various current approaches. Among various approaches for inmemory computing, precision enhancement of low precision but compact nanoscale memory like Phase Change Memory (PCM) with high precision but area inefficient CMOS memory enables high performance on MNIST dataset [14, 29]. Further, single precision approaches with RPU based stochastic identical pulse based weight update show degraded performance of 83% for PCM [8] and of 88% for PCMO based RRAM [9] on MNIST dataset. Agarwal et al. [10] have shown a single precision approach based on SONOS based Flash memory with analog pulse control with voltage and time to record a performance of 97.6% on MNIST. This technology is based on NOR flash memory like programming scheme using high current/power technique of channel hot electrons (CHE). Enhancing precision by a dual precision technique with more flash devices per weight and control circuit to enable a periodic carry improves MNIST performance to 98%.
In comparison, our flash memory is programmed with the low current/power/energy FN tunneling technique. Stochastic pulse train based RPU is demonstrated, eschewing the need for variable pulses with analog voltage levels and pulse time controls. The low rate of conductance change, high linearity produces a peak performance of 97.9%  which is robust to experimentally measured noise levels. Further, our method produces excellent performance on various ANN applications like classification on CIFAR10, CIFAR100 datasets, and reinforcement learning on Mountain Car environment  demonstrating excellent generalization.
Vii Conclusions
In this paper, we proposed a charge trap flash device in an RPU architecture to accelerate deep neural networks while maintaining softwarelevel accuracy. The resistive processing unit speeds up vectormatrix and vectorvector multiplication operations, which are ubiquitously used in the backpropagation algorithm to train deep neural networks. We engineered the magnitude and the width of the pulse used to update the weights using the flash device. The updates were shown to be linear, gradual and symmetric, which is necessary for good performance.
We then simulated the device to train neural networks on MNIST, CIFAR10 and CIFAR100 datasets. In each case, the accuracy of the system was close to the floating point baseline, showing excellent generalization. The system was also robust to noise in weight updates, with less than 1% drop in accuracy when the simulated noise was 10x the experimentally observed value. We also demonstrated the generality of the method by applying it to a reinforcement learning method on the Mountain Car environment. The performance of our system matched the software baseline in this experiment too. Such implementation is benchmarked against the state of the art demonstrations to show bestinclass performance  indicating a promising hardware option for inmemory computing.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [2] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, p. 386, 1958.
 [3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.
 [4] M. Bauer, H. Cook, and B. Khailany, “CudaDMA: Optimizing GPU memory bandwidth via warp specialization,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 12.
 [5] M. Moravčík, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard et al., “DeepStack: Expertlevel artificial intelligence in headsup nolimit poker,” Science, vol. 356, no. 6337, pp. 508–513, 2017.
 [6] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma et al., “Mixedprecision inmemory computing,” Nature Electronics, vol. 1, no. 4, p. 246, 2018.
 [7] T. Gokmen and Y. Vlasov, “Acceleration of deep neural network training with resistive crosspoint devices: Design considerations,” Frontiers in Neuroscience, vol. 10, p. 333, 2016.
 [8] S. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou, “A phasechange memory model for neuromorphic computing,” Journal of Applied Physics, vol. 124, no. 15, 2018.
 [9] A. V. Babu, S. Lashkare, U. Ganguly, and B. Rajendran, “Stochastic learning in deep neural networks based on nanoscale PCMO device characteristics,” Neurocomputing, vol. 321, pp. 227–236, 2018.
 [10] S. Agarwal, D. Garland, J. Niroula, R. B. JacobsGedrim, A. Hsia, M. S. Van Heukelom et al., “Using floatinggate memory to train ideal accuracy neural networks,” IEEE Journal on Exploratory SolidState Computational Devices and Circuits, vol. 5, no. 1, pp. 52–57, 2019.
 [11] S. Shrivastava, T. Chavan, and U. Ganguly, “Ultralow energy charge trap flash based synapse enabled by parasitic leakage mitigation,” arXiv preprint arXiv:1902.09417, 2019.
 [12] M. Suri, O. Bichler, D. Querlioz, O. Cueto, L. Perniola, V. Sousa et al., “Phase change memory as synapse for ultradense neuromorphic systems: Application to complex visual pattern extraction,” in International Electron Devices Meeting. IEEE, 2011, pp. 4–4.
 [13] O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo, and C. Gamrat, “Visual pattern extraction using energyefficient “2PCM synapse” neuromorphic architecture,” IEEE Transactions on Electron Devices, vol. 59, no. 8, pp. 2206–2214, 2012.
 [14] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. di Nolfo et al., “Equivalentaccuracy accelerated neuralnetwork training using analogue memory,” Nature, vol. 558, no. 7708, p. 60, 2018.
 [15] I. Boybat, M. Le Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma et al., “Neuromorphic computing with multimemristive synapses,” Nature Communications, vol. 9, no. 1, p. 2514, 2018.
 [16] A. Shukla, S. Prasad, S. Lashkare, and U. Ganguly, “A case for multiple and parallel RRAMs as synaptic model for training SNNs,” in International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.
 [17] V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 [18] O. Fujita and Y. Amemiya, “A floatinggate analog memory device for neural networks,” IEEE transactions on electron devices, vol. 40, no. 11, pp. 2029–2035, 1993.
 [19] D. Kang, W. Jeong, C. Kim, D. Kim, Y. Cho, K. Kang et al., “256 Gb 3 b/Cell Vnand flash memory with 48 stacked WL layers,” IEEE Journal of SolidState Circuits, vol. 52, no. 1, pp. 210–217, 2017.
 [20] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [21] C. Sandhya, U. Ganguly, N. Chattar, C. Olsen, S. M. Seutter, L. Date et al., “Effect of SiN on performance and reliability of charge trap flash (CTF) under Fowler–Nordheim tunneling program/erase operation,” IEEE Electron Device Letters, vol. 30, no. 2, pp. 171–173, 2008.
 [22] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices. Cambridge University Press, 2013.
 [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [24] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 [25] A. W. Moore, “Efficient memorybased learning for robot control,” 1990.

[26]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
humanlevel performance on ImageNet classification,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2015, pp. 1026–1034. 
[27]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 770–778.  [28] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A largescale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 [29] S. R. Nandakumar, M. L. Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Mixedprecision architecture based on computational memory for training deep neural networks,” in IEEE International Symposium on Circuits and Systems, ISCAS, 2018, pp. 1–5.
 [30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
 [31] C. J. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992.
 [32] R. S. Sutton, “Tile coding software – reference manual,” http://incompleteideas.net/tiles/tiles3.html, 2017, accessed: 20190713.
Comments
There are no comments yet.