Adaptive Precision CNN Accelerator Using Radix-X Parallel Connected Memristor Crossbars

Neural processor development is reducing our reliance on remote server access to process deep learning operations in an increasingly edge-driven world. By employing in-memory processing, parallelization techniques, and algorithm-hardware co-design, memristor crossbar arrays are known to efficiently compute large scale matrix-vector multiplications. However, state-of-the-art implementations of negative weights require duplicative column wires, and high precision weights using single-bit memristors further distributes computations. These constraints dramatically increase chip area and resistive losses, which lead to increased power consumption and reduced accuracy. In this paper, we develop an adaptive precision method by varying the number of memristors at each crosspoint. We also present a weight mapping algorithm designed for implementation on our crossbar array. This novel algorithm-hardware solution is described as the radix-X Convolutional Neural Network Crossbar Array, and demonstrate how to efficiently represent negative weights using a single column line, rather than double the number of additional columns. Using both simulation and experimental results, we verify that our radix-5 CNN array achieves a validation accuracy of 90.5 dataset, a 4.5 simultaneously reducing crossbar area by 46 removing the need for duplicate columns to represent signed weights.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 9

page 10

page 12

11/07/2018

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

This paper describes a novel approach of packing sparse convolutional ne...
06/16/2021

FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator

Recent works demonstrated the promise of using resistive random access m...
02/07/2021

CrossStack: A 3-D Reconfigurable RRAM Crossbar Inference Engine

Deep neural network inference accelerators are rapidly growing in import...
08/25/2020

IKW: Inter-Kernel Weights for Power Efficient Edge Computing

Deep Convolutional Neural Networks (CNN) have achieved state-of-the-art ...
12/21/2021

VW-SDK: Efficient Convolutional Weight Mapping Using Variable Windows for Processing-In-Memory Architectures

With their high energy efficiency, processing-in-memory (PIM) arrays are...
12/02/2021

Memory-efficient array redistribution through portable collective communication

Modern large-scale deep learning workloads highlight the need for parall...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning algorithms have become ubiquitous in the modern world, and are crucial in enabling computer systems which automatically update and improve with experience. This has opened up new frontiers in data analysis techniques. Deep learning refers to the use of a multi-layered neural network where the sequence of layers between the input and output perform feature identification at various hierarchies, as inspired by an approximation of the neuronal connections within the brain [1, 2, 3, 4]. A popular deep learning algorithm for structured data is the convolutional neural network (CNN), which are well suited for object detection and vision-based processing, due to their high performance in feature recognition and object detection in images [5].

One of the challenges associated with machine learning stems from dimensionality issues, where algorithms with more features in higher dimensional spaces lead to difficulty in interpretability of the network. When a learning algorithm does not work, the simplest path to success is often to feed the machine more data. This leads to scalability issues, where we have more data but lack the processing power to compute new inferences. An almost real-time prediction with sufficient accuracy is required for portable devices and edge sensors, using a constrained power budget to implement ambient-assisted technologies.

This challenge was initially addressed by shifting computations over to graphical processing units (GPU), as GPU architectures consist of many small cores that parallelize the processing of data. Calculations of similar form are carried out simultaneously, thus maximizing throughput of all threads which boosts performance while reducing the bottleneck when paired with a CPU. However, when dealing with algorithms that must call a significant number of parameters from memory, (e.g., 138 million parameters in the VGG-16 CNN [6]), these parameters must be accessed from and stored in data memory via a shared bus with restrictive data transfer rates. This issue is referred to as the von Neumann bottleneck.

More recently, application specific Neural Processing Units (NPUs) were deployed in mobile devices for real time operation without the need for server connections to perform deep learning operations [7, 8, 9, 10]. NPUs are optimized for power and area efficiency for matrix-vector multiplication (MVM) without the need for ‘cloud-based’ processing. However, this approach still relies on conventional CMOS technology where process scaling is bound to performance degradation (retention, cycling and reliability), and memory and processing are physically delocalized. This has given rise to the exploration of beyond-CMOS architectures for artificial neural network (ANN) and CNN applications.

Researchers have offered a variety of hardware solutions that implement memristors into neuromorphic processors [11, 12, 13, 14, 15, 16, 17, 18]. The memristor is a two-terminal nanoscale device which serves as non-volatile memory and also doubles as a resistor. That is, memory and computation based on the linear form of Ohm’s Law exist within the same device. Memristors are scaled into a dense crossbar structure for an area efficient means to parallelize multiply-and-accumulate (MAC) functions, where high-speed computation is achieved through the column-wise parallelism of arrays. However, problems such as memory leakage, variability and device sensitivity make it challenging to reliably store multi-bit and analog data [19, 20, 21, 22]. The work in [23] demonstrates the storage of over 64 conductance states per memristor, though the difference between simulated and experimental efficiency is an order of magnitude of TOPS/w, speculated to be a result of the slow write times needed to ensure precise conductance control and noise mitigation.

To combat the limitation of multi-bit and analog state memristors, hardware implementations using two states (, ) of a memristive binarized neural network (BNN) [24, 25] model have been proposed. Where weights are limited to single-bit resistances, lower precision results in decreased classification accuracy. Other methods to achieve multi-bit weights are through binarized encoding schemes with column-wise distribution, or via frequency modulation by encoding weight information in the time-domain of the driving voltage [26]. In all cases, either chip area or timing is compromised due to additional columns and the need for more complex CMOS driving circuitry. The representation of negative weights with positive conductances requires double the number of columns, with outputs passed through a differential amplifier [23].

In this paper, we propose a novel solution derived from nanoelectronics to overcome the above limitations of conventional crossbar architectures. This is done by introducing parallel-connected memristors at each crosspoint junction on a crossbar, by either splitting larger memristors and insulating the smaller counterparts from one another, or laying out multiple masks per crosspoint. This means we are able to process radix-X weights (i.e., higher bit precision at each junction), and formalize a hardware mapping approach that significantly reduces circuit area utilization by representing both negative and positive weights without the need to distribute computations across column wires. Furthermore, this approach significantly reduces exposure to line losses.

The main contributions of this paper are:

  1. Radix-X CNN: here we introduce a CNN implementation using radix-X weights, where the weights and activation values are mapped to the range of the radix numeral system, or ‘X’. We develop a straightforward algorithm based on regularization, and provide both pseudo-code and our python implementation. We test the accuracy of our radix-X CNN by training it on the CIFAR-10 dataset, and comparing it with several prominent models. Intuitively, this can be thought of as targeting the algorithmic component in algorithm-hardware co-design methodology.

  2. Parallel-connected memristors at each crossbar junction for radix-X weight representation: hardware implementation of radix-X CNN. We show improved stability, reliability and decreased area consumption by using our proposed parallel-connected memristor architecture for storage of radix-X CNN weights. This focuses on the hardware aspect of the co-design methodology.

  3. Negative weights representation: implementation of negative weights is a significant overhead in crossbar arrays. Conventional methods use twice the area of crossbars to address this problem. Here, we demonstrate how our radix-X CNN significantly reduces the circuit area by using a single crossbar reference column for both negative and positive weight representation, rather than doubling the number of column wires.

The above contributions are quantified by showing how our proposed radix-X CNN hardware achieves a validation accuracy of 90.5% on the CIFAR-10 dataset when , and a 4.5% improvement on conventional low precision weights (namely, BNNs). Importantly, we reduce chip area by 46% over conventional state-of-the-art arrays by condensing the number of required column wires to represent negative weights down to a single reference line.

This paper is organized as follows: section II introduces the concepts that drive the technology of the radix-X CNN approach in a memristor crossbar. Section III describes our radix-X CNN learning algorithm with pseudo-code provided, and section IV demonstrates how it is implemented using a parallel-connected memristive crossbar array for representation of radix-X weights, and proposes a solution for negative and multi-bit weight representation. Section V shows our simulation results by running a classification example on the CIFAR-10 dataset, and section VI presents the nanofabrication techniques employed in the development of our crossbar array, with accompanied experimental results of a simple convolutional kernel with a Sobel filter containing both positive and negative elements applied to an input image. Section VII provides a discussion of some of the design trade-offs of the hardware implemented radix-5 CNN, with concluding remarks given in section VIII.

Ii Background

Ii-a Resistive Switching in Memristors

The reconfigurability of conductance in a memristor is leveraged in neuromorphic computing to represent updatable weight values. Resistive switching has been demonstrated in metal-oxide devices, with Ta2O5 [27, 28], HfO2 [29] and TiO2 [30, 31] being among the most recognized. Under the influence of an applied electric field, a conductive filament made up of oxygen vacancies can be formed which creates a pathway for electrons to flow through [32]. The formation of the filament corresponds to a low resistance, and the rupture of the filament breaks the conductive pathway resulting in a high resistance.

(a)
(b)
Fig. 3: Memristor characterization (a) physical representation depicting TiO2 and TiO_2-x. (b) V-I characteristics of the model illustrating the LRS and HRS. Resistance ratio of this memristor is 100 when a sinusoidal voltage with frequency = 1 kHz is applied.

Under a forward bias, the memristor switches to a low resistance state (LRS). When a reversal of the bias is applied, it switches to a high resistance state (HRS). Fig. 3(a) illustrates the physical structure of a memristor formed by TiO2 and oxygen deficient TiO_2-x layers sandwiched between two metal electrodes. Fig. 3(b) illustrates the resultant V-I curve under a sinusoidal driving voltage, causing the device to switch between two resistance states.

To achieve analog or multi-bit states, the width of the filament must be precisely modulated, which is challenging in practice. It often requires the use of lower write voltages applied across longer durations, which super-exponentially increase the time of write cycles [33]. Therefore, many realizations of crossbar arrays employ conservative design techniques and treat metal-oxide memory cells as single-bit storage [34]. Multi-bit weights are often implemented using multiple memristors, distributed across multiple column wires.

Ii-B Convolutional Neural Networks

Fig. 4: Generalized CNN model with parameters labeled including number of layers, kernel size, and channel depth. Various CNN models can be created by altering these parameters and layer structures.

A generic structure of a CNN is depicted in Fig. 4 [35, 36]. Its high performance in image classification is enabled by retaining some spatial dependencies (i.e., taking consideration of the location of pixels relative to neighboring pixels). This is achieved by treating the image as a matrix rather than vectorizing it in a fully-connected neural network. As higher-level features are extracted, the channel depth increases, which results in a much larger number of MVMs (computational equivalent of a MAC operation) for a given number of inputs.

Ii-C Neural Network Using Memristor Crossbar Arrays

The key to memristor crossbar arrays being capable of neural network acceleration is that MVMs are the dominant process in CNNs. By parallelizing a large number of MACs across column wires using weights that have been stored in the form of conductance values, we are able to optimize the hardware mapping of neural network architectures.

(a)
(b)
Fig. 7: The artificial neuron (a) architecture (b) mapping to a single column within a memristor crossbar architecture.

Figs. 7(a) and (b) depict the mapping of the neuron model to a circuit. The inputs of the neural network to are linearly mapped to the input voltages to of the crossbar, and the weights to are linearly mapped to the conductances to of the memristor. By using the virtual ground of an inverting amplifier to hold each column wire as the reference node (detailed in section IV), the current drawn by each memristor can be calculated using Ohm’s Law, and then summed along the column wire in accordance to Kirchhoff’s Current Law. Equations (1) and (2) mathematically describe this process:

(1)
(2)

where is the pre-activation output of the artificial neuron, corresponds to the number of inputs to the neuron, and is the total current through a column wire. A vectorized implementation of Fig. 7(b) is defined by (3). When the number of columns is increased to a value in an array, (2) can be extended to MVM in (4):

(3)
(4)

The conductance weights in a single column in the crossbar array correspond to a single channel in a CNN kernel. One can implement deep-channel kernels in parallel by distributing these across column wires. The voltage corresponding to the image data is applied at the input terminals of the crossbar (i.e., at the row wires), where the convolution operation is performed.

Iii Radix-X CNN Algorithm

The conventional methods of working around single-bit weight restrictions in memristor crossbar arrays are either algorithmically by using BNNs, or via hardware distribution of computations via binarized encoding across columns. As mentioned, the former compromises accuracy and the latter expands chip area and power consumption. In the past, BNNs have been implemented either at the weight level, at the activation level (akin to the classic perceptron

[37]), and both in unison. The work in [24, 38, 39] implements a binarized activation that can adopt both positive and negative values:

(5)

Although this bounding approach is convenient for digitized implementation, there is a degradation of inference accuracy as a result of high precision compression [40]

. This may be counteracted by using more learning parameters with an increased number of training epochs, but this offsets the advantages of parallelization. In light of these limitations, we propose a novel approach based on a radix-X weight representation, and present our method for algorithm and hardware co-design to realize it on a memristor crossbar array.

The radix of a digital numeral system refers to the number of unique digits used to represent values in a positional numeral system, including the digit zero. If X is the radix of a numeral system, then in context of a neural network, radix-X refers to the complete set of values that are assignable as a weight and activation value. For example, where X=5, then in a radix-5 CNN the weights and activations can take on any one of 5 values. We present an algorithm that normalizes a high-precision pre-trained weight matrix into a radix-X weight matrix that can be any one of the values in the set {-2, -1, 0, 1, 2}. By employing the ReLU activation function, we ensure the outputs can also be represented within the limits of the radix-X numeral system as one of the values in the set {0, 1, 2, 3, 4}. In the most generalized case for radix-X, we propose that the weights must first be normalized according to the pseudo-code:

1:function NormalizedTensor(x, weights): Return normalized weights given input of radix-x and pre-trained weights
2:     
3:     
4:     
5:function QuantizedTensor(weights): Return quantized weights given input of pre-trained weights
6:     for element in weights do:
7:         if element then round element up
8:         else round element down               
9:      In main function
10:return QuantizedTensor(NormalizedTensor(x, weights))
Algorithm 1 Convert pre-trained weights into radix-X

An integer input

is the radix of a numeral system, and a matrix or tensor input of pre-trained weights, plainly denoted

, are both passed into the function NormalizedTensor, which returns a normalized set of weights where the minimum element is and the maximum element is . The output is called as the argument of QuantizedTensor which quantizes all floating point decimals into integers. For accessibility, we have provided a link to the GitHub repository containing our Python 3 implementation of the above pseudo-code [42]. The Python code also includes the radix-X activation function.

Where , the mathematical equivalent for a radix-5 CNN of the above algorithm is:

(6)
(7)
(8)
(a)
(b)
Fig. 10: Kernel mapping platform (a) process of converting a weight used in convolution to five limited numbers. The kernel values in a layer are normalized then quantized to {-2, -1, 0, 1, 2}. (b) Conversion is based on a bounded ReLU activation, where all values less than 0 are mapped to zero.
(a)
(b)
Fig. 13: Conversion plot of (a) weights and (b) activations.

where refers to the range of weights prior to normalization, and and are the maximum and minimum weights respectively. In (7), is the equivalent set of weights in radix-5, and refers to the weights in the base 10 system. In (8), ‘pixel’ is the input data convolved with a kernel before activation, and when passed through the activation, gives an output of , which is bound to one of five integer values. Figs. 10 and 13 illustrate the process.

Fig. 14: The process of training radix-X CNN models. The gradients of the cost function are obtained through forward- and back-propagation using radix-X converted weights . The ADAM optimizer updates real-valued weights based on those in the previous cycle. This updated is converted to a radix-X weight and used as a parameter to decrement the cost function again. The real-valued weight must be saved during training.

When training data is passed through the network, the neuron output and weights are converted to using (6)-–(8). Then, the classification result is obtained through forward propagation. The cost function for the output is obtained, and the slope of the cost function for is calculated using backward propagation. We compute the real-valued weights using the ADAM optimizer [41], which is used to calculate and store . This feedback process is represented in Fig. 14

, and while it bears many similarities to conventional backpropagation, we will show how it can be harnessed at the system level using parallel-connected memristive junctions in a crossbar array in the following sections.

Iv Radix-X CNN Accelerator Circuit

We designed and fabricated an application specific reconfigurable crossbar array, intended precisely for the implementation of our radix-X CNN Accelerator. Here, we will describe the operating principle of our design, how to achieve multi-bit and negative weights at a single crosspoint, and then detail the nanofabrication techniques used in its development.

Iv-a Multi-bit Weights

As the resistance precision of the memristor for storing information is limited, and the impact of writing variation increases with the number of resistance states [20, 21, 22], we seek to circumvent this issue by introducing parallel-connected memristors at each crosspoint in the array. Each of these heterogeneous memristors are still only used to store binarized weights, but by forming and severing connections to the memristor electrodes, we introduce additional bits per crosspoint, despite our conservative design approach.

Fig. 15: Parallel-connected crosspoint for X=5. Each intersection of the crossbar array implements a physically sub-divided memristor into four constituent components. The equivalent circuit is 4-parallel connected memristors that can represent five weight values. The sub-division process is only limited by increased variation in devices of decreasing width.

Fig. 15 depicts an illustration of this concept at a high-level. The crosspoint of a single junction has (X-1) parallel connected memristors. In the diagram shown above, we have chosen X = 5 (i.e., radix-5) which requires four parallel memristors per column-row wire intersection.

4 1-bit memristors are placed in a quad-parallel structure per metal crosspoint. Thus, 0-4 memristors are connected to the top metal and pre-programmed to either a HRS or LRS. That is, these memristors are used as read-only memory to ensure high reliability and to avoid write variability.

(a)
(b)
Fig. 18: Parallel-connected memristor crosspoint structure (a) Symbolic resistance, mask layout, number of connections and our shorthand notation to represent each possible type of parallel-connected crosspoint (i.e., 1M to 4M). The mask layout demonstrates that the various parallel connections are hardwired at the time of fabrication, which removes the need for independent access to each memristor within a crosspoint (b) Experimental results obtained by measuring the V-I curves of our fabricated crossbar array from Fig. 12 at frequency = 10Hz.
(a)
(b)
(c)
Fig. 22: Negative weights in crossbar arrays (a) Conventional horizontal mapping: array M+ for storing positive weights and M- for storing negative weights. If the value of the weight stored in the memristor is a positive value, the memristor in M+ is activated and in the case of a negative value, the memristor in M- is activated. (b) Vertical mirroring. (c) 1T1M architecture is required to activate or deactivate a memristor.

As shown in Fig. 18, five resistance values can be obtained depending on the number of activated parallel-connected memristors. The proposed parallel-connected memristor demonstrates how a set of radix-5 CNN weights using 5 discrete resistance states can be implemented.

Iv-B Negative Weights

Existing studies have implemented the hardware described in Fig. 22 to represent negative weights, which requires twice the number of columns. In our proposed method, each of the radix-5 weights {-2, -1, 0, 1, 2} are mapped to one of five available memristor configurations. This is depicted in Table I. However, in each of the 5 configurations, the equivalent resistance at a crosspoint is still non-negative. We will demonstrate how to remove the need for duplicative columns by mapping negative weights into positive conductances.

Mapping Values
-2 -1 0 1 2
Mapping five resistances to five conductance to represent all weights of the Radix-5 CNN without duplicative columns. = the weights of pre-trained Radix-5 CNN model; G = equivalent conductance of a crosspoint determined by the number of parallel-connected memristors between row and column wires.
TABLE I: Mapping of to Conductance

First, all radix-X weights are positively shifted by the magnitude of the minimum weight . This translates the minimum weight to 0. Next, each level-shifted weight is divided by the resistance of a single memristor to calculate the equivalent memristance weight. For example, in radix-5, the minimum weight is -2. Where = -1, a level shift of gives +1, and the equivalent resistance can be found by dividing by this value. Table I shows that only one memristor (1M) should be connected between row and column wires to attain . For = 0, the equivalent resistance will be . Table I indicates that two memristors are connected in parallel: . The equivalent conductance is given by:

(9)

Substituting (9) into (2) gives the following equation for the column current for rows in radix-5:

(10)
(a)
(b)
(c)
(d)
Fig. 27: A simple example showing the issues that occur when linearly mapping negative weights to positive conductances in an ANN without a level-shift: , V. All values are color co-ordinated.

However, this is an insufficient representation of output current. To see why, consider Figs. 27(a) and (c) which are radix-5 ANNs consisting of 3 unique inputs, and Figs. 27(b) and (d) which are the crossbar array equivalents using our parallel-connected structure. The equivalent conductances are derived from Table I and (9), where . The current through the first column in Fig. 27(b) is calculated using (10):

(11)

and for the first column of Fig. 27(d):

(12)

Although the outputs of the two ANNs in Figs. 27(a) and (c) are identical, the read-out currents from Figs. 27(b) and (d) are different. This is a result of element-wise level shifts of weights causing subsequent mismatch.

To counter the level-shift, we must design an adaptive reference line to be subtracted from the signal columns. To do this, we note that the minimum column current in Fig. 27(b) corresponds to the ANN output of Y = 0. If we subtract from each current in the set {, , }, the resulting set of column currents becomes {, , }; there is now a 1:1 correspondence to the ANN output. For Fig. 27(c), subtracting the minimum current from {, , } attains {, , }. The current sets now match the ANN outputs. In both cases, the solution is to subtract the current corresponding to the ANN output of ‘0’ from all column signals.

In a radix-5 crossbar array, we create our own zero-weight reference column by having two memristors in parallel at each row (2M in Table I). This corresponds to a radix-5 weight of 0 for an entire column.111This is generalizable beyond radix-5 to radix-X, where the zero-weight conductance can be calculated by substituting into (9). It is this ability to generalize that enables our algorithm to have an adaptive precision: radix-5 is simply a test case for demonstration. The output current of the reference line can be calculated by substituting into (10):

(13)

This is generalized to any radix-X numeral system, by substituting into (9), and the result into (2):

(14)

The reference current is dependent on the input voltages, and therefore cannot be implemented using a constant current. This was demonstrated by example in Fig. 27. The reference current is converted into a voltage using an op-amp, and subtracted from all signal voltages with an array of differential amplifiers.

Fig. 28: The structure of the proposed circuit is able to represent negative weights with the addition of a single column. The linear shift of the reference signal is applied by subtracting from the output of each column. is the output of the inverting amplifier and is the output voltage of a single column, and corresponds to an artificial neuron. The relevant equation number for each signal is shown in the figure.

The hardware level implementation of the level-shift is shown in Fig. 28, with the reference line highlighted in red. The inverting amplifiers are used to fix all columns at virtual ground. To find the potential at the output of the inverting amplifier on the reference line, note that from (13) is passing through the negative feedback resistor :

(15)

Similarly, for the inverting amplifier output of the signal columns:

(16)

Given all resistors of the differential amplifier are equivalent, the output stage of the crossbar array is a subtractor with from (15) being passed into the positive terminal, and from (16) into the negative terminal:

(17)

The final result of (17) shows how the ‘+2’ linear shift is removed by , thus ensuring a correct representation of negatively weighted MVMs following the demonstration in Fig. 27.

The relationship between a neural network input and the input voltage in the circuit is given as,

(18)

where is the scaling factor of . Substituting (18) and (1)into (17) obtains:

(19)

This verifies that the output voltage of our radix-X CNN accelerator is simply scaled by , and concludes that we are able to represent multi-bit negative weights with a parallel-connected memristor without duplicative columns.

V Simulation Results

Fig. 29: The architecture of the CNN used. Radix-5 weight and activation blocks are marked in red.
Fig. 30: Validation accuracy of a real-valued CNN, BNN and the proposed radix-5 CNN. While the training performance of the CNN using full precision weight and radix-5 CNN shows a slight difference, the use of a BNN results in a noticeable drop in performance.

We conducted a simulation of the radix-X CNN accelerator described above with all memristors being used as read-only memory, and peripheral circuitry in the SK Hynix 180nm CMOS Process. The characteristics of the simulated memristor are based on our own Al/TiO2/TiO_x/Al crossbar array which we will provide details of in the next section. The relevant features for our feed-forward simulation of a pre-trained network are and . As all parallel configurations are fixed on our crossbar, there was no need to consider switching time characteristics and programming variations. The relatively large width of our metal lines () meant low line resistance and so line losses were negligible. When scaling the metal lines down and the number of rows and columns up, this assumption will need to be adapted accordingly. The final idealization made was assuming negligible device-to-device variation which was accounted for in experimentation. The peripheral resistances were chosen to be and the scaling factor to ensure read voltages did not exceed the switching threshold.

Training Accuracy Validation Area*
10 epochs 500 epochs Accuracy ()
Real-valued CNN 92% 99% 91.5% 8400
BNN 88% 99% 86.0% 8400
Radix-5 CNN 92% 99% 90.5% 4600

*SK Hynix 180nm CMOS process. Area is based on the layout in the BEOL before the pad level, where CNN and BNN implementations require differential pairs for signed weight representation from Fig. 

22.
TABLE II: Comparison of Memristor Based CNNs for CIFAR-10

The architecture of our radix-5 CNN is shown in Fig. 29. We evaluated the validation accuracy for three implementations of a high precision 16-bit CNN, BNN and the proposed radix-5 CNN. Fig. 30 shows the classification accuracy during training on the CIFAR-10 dataset, where a high precision CNN and radix-5 CNN showed a difference in accuracy of approximately 0.8%. This is a 5.3% improvement over BNNs, which this is to be expected given the higher base value used, but for a substantial decrease in chip area. A more detailed comparison is summarized in Table II.

(a)
(b)
Fig. 33: (a) Simulation of the proposed architecture where , , and . (b) Output of each node for various inputs.

As shown in Fig. 33, the behavior of a simple neural network for the proposed radix-5 CNN is fully implemented and simulated. Analyzing the simulation results in Fig. 33(b) shows that the output of the neuron for the first pulse during time to is verified with (2):

(20)

Therefore,

(21)

In the same manner as (20) and (21), the outputs from the second and third input pulses are and , respectively. The results of our simulation agree with our mathematical derivations in section IV.

Vi Experimental Results

Vi-a Nanofabrication

Fig. 34: Al/TiO2/TiO_x/Al memristor device imaged using a focus ion beam (FIB) analyzer.

We fabricated a proof-of-concept parallel-connected crossbar array in-house to demonstrate the feasibility of the proposed memristor-based radix-5 CNN method. This was achieved with a sandwich structure composed of Al/TiO2/TiO_x/Al layers. A 200-nm-thick Al layer was deposited as the bottom electrode on a glass wafer. Standard photolithography was conducted to produce 20-m-wide Al lines. During the microfabrication process, the wafers were irradiated by using a mask alignment system for 100 s and then developed at 296K for 120 s. The Al channel was then defined by wet etching (H3PO4:HNO3:CH3COOH:H2O = 80 ml : 5 ml : 5 ml : 10 ml), removing any Al outside of the channel regions at an etching rate of = 300 nm/min. 5-nm-thick TiO2 thin film and a 15-nm-thick TiO_x thin film layers were formed by atomic layer deposition (ALD) and magnetron sputtering. Subsequently, another 200-nm-thick Al layer was sputtered as the top electrode, followed by standard photolithography to create windows. Fig. 34 shows a cross-sectional image of a single memristor taken with a focus ion beam (FIB) analyzer.

Vi-B Image Processing

(a)
(b)
Fig. 37: 2D convolution performed on 100 images from the MNIST dataset with a Sobel filter in radix-5 on the crossbar array in Fig. 34. (a) Before processing. (b) After processing.
(a)
(b)
(c)
(d)
Fig. 42: A scaled-up inspection of hardware-based 2D convolution. (a) Before processing digit ‘4’. (b) After processing digit ‘4’. (c) Before processing digit ‘8’. (d) After processing digit ‘8’.

We performed image convolution on 100 images of handwritten digits from the MNIST dataset, of in dimension [43] and passed them through a Sobel filter, which is typically used in edge detection algorithms. The Sobel operator takes the form of a matrix in radix-5 form:

(22)

The rationale being that, if the crossbar is capable of performing MVMs then by extension, classification tasks using a CNN will also be possible on larger arrays. The image is processed using similar parameters to those in the simulations, where input pixels are linearly mapped from a null input for a black pixel and for a white pixel. As per Table I, a kernel element of ‘-2’ is implemented as an open junction at a crosspoint, and an element of ‘2’ mapped to four parallel connected memristors. The maximum current drawn from a memristor was measured to be approximately , and the critical value for from a full column under the test case of MNIST images passed through an edge detection filter was . This column current is relatively small when compared to similar arrays based on conduction via oxygen vacancies, but this is a result of having a small-scale array rather than low read voltages. The output voltages at

were then linearly mapped back into output images. Qualitatively, we successfully generated a near perfect 2D convolution with a stride of 1 and no zero-padding, as can be seen in Fig. 

37, and a scaled up sample in Fig. 42. The small scale prototyped nature of our array meant that for a kernel, each pixel required 3 read cycles where 4 output pixels could be pipelined across columns, and convolving a image required a total of 21 read cycles.

Vii Discussion

Implementing BNNs on memristor crossbars is a common technique used to enhance robustness of crossbar arrays in light of analog write variability. Our proposed technique follows this conservative design methodology where the radix-X CNN accelerator uses single-bit memristors. Rather than using binarized encoding across multiple columns, we instead modulate the number of memristors at crosspoints between row and column lines (i.e., a 1TXM cell), and have thus proposed a new crossbar architecture and co-developed an algorithm specifically suited to adapt to the number of memristors per cell. The first trade-off to consider is the number of additional memristors per cell, as against additional columns to improve precision and implementation of negative weights. This analysis is process dependent, and in our array where the metal lines occupy a width of 20 microns, the minimum width of a single memristor is of sub-micron pitch (and of a few nanometers in more advanced processes [44, 45]). For single-bit memristors in conventional binarized crossbars, the closest equivalent comparison to radix-5 is by using 2-bit weights, which will require a total of 4 columns (2 for positive weights, and 2 for the differential pair). We are able to implement the above scheme in 2 columns, with a 20% improvement in precision using radix-5 over 2-bit representations. The alternative option for column reduction is to use analog weights, which remains a developing but promising field of research. The limiting factor is where the radix of the numeral system becomes larger, resulting in an increasing number of parallel-connected memristors per cell, and an associated reduction in equivalent resistance. Larger metal lines and more vias are needed to cope with the increasing current capacity. While our array had no issues with a critical current of (due to the wide metal lines used in our process, and had current capacity of over – see Fig. 18(b)), this will become an increasingly important trade-off when optimizing for higher values of X in radix-X. The effect of decreasing equivalent resistance can be partially mitigated by reducing the read voltage, where state-of-the-art crossbar arrays have demonstrated read currents of under [33].

The second trade-off is with respect to pipelining. Given that parallel-connections are fixed at the time of fabrication, the radix-X crossbar will typically be optimized for specific conductance matrices. In general, this will be advantageous only for kernels containing a particular set of elements. The benefit to reduced reconfigurability is that write-variability is no longer an issue, and endurance is also prolonged due to the application of only read pulses.

Viii Conclusion

We have proposed a crossbar array with multiple metal-oxide thin film switches at each crosspoint, and a co-designed algorithm tailored for this inference accelerator to convert a set of pre-trained weights into values based on user-selected precision. We conducted CNN classification on the CIFAR-10 dataset using a large-scale simulation, and performed experimental validation of convolution image processing on a subset of the MNIST dataset using a small-scale crossbar array. We demonstrated that we could achieve multi-bit and negative weights using 46% of the area of conventional differential pairs of columns, all whilst including an adaptive precision mechanism within our array. What has been proposed is not an exhaustive use of this array. For example, future work includes the use of transistor switches to reconfigure the number of memristors at each crosspoint to enable a higher degree of reconfigurability. Alternatively, as research on multi-bit memristors matures and values of memristance increases, these will be the proponents to achieving higher precision by extending the range of possible base values usable for a given crossbar dimension in radix-X.

References

  • [1] H. Yanagisawa, T. Yamashita, and H. Watanabe, “A study on object detection method from manga images using CNN,” Int. Workshop on Advanced Image Technology (IWAIT), pp. 1–-4, IEEE, Jan. 2018.
  • [2]

    B. Khagi, C. G. Lee, and G. R. Kwon, “Alzheimer’s disease Classification from Brain MRI based on transfer learning from CNN,”

    Biomedical Engineering Int. Con. (BMEiCON), pp. 1-–4, IEEE, Nov. 2018.
  • [3] D. Ushizima, C. Yang, S. Venkatakrishnan, F. Araujo, R. Silva, H. Tang, J. V. Mascarenhas, A. Hexemer, D. Parkinson, and J. Sethian, “Convolutional neural networks at the interface of physical and digital data”,

    2016 IEEE Applied Imagery Pattern Recognition Workshop

    , pp. 1-–12, Oct. 2016.
  • [4]

    S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: A convolutional neural-network approach,”

    IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 98–113, Jan. 1997.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

    Proc. IEEE Conf. Computer Vision and Pattern Recognition

    , pp. 770-–778, Jun. 2016.
  • [6] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Sep. 2014, arXiv preprint arXiv: 1409.1556.
  • [7] K. Kiningham, M. Graczyk and A. Ramkumar, “Design and Analysis of a Hardware CNN Accelerator,” Small, vol. 27, no. 6, Jun. 2016.
  • [8] S. Hong, I. Lee, and Y. Park, “NN compactor: Minimizing memory and logic resources for small neural networks,” IEEE 2018 Design, Automation and Test in Europe Conf. and Exhibition (DATE), pp. 581–-584, Mar. 2018.
  • [9] C. F. Chen, G. G. Lee, V. Sritapan, and C. Y. Lin, “Deep convolutional neural network on iOS mobile devices,” IEEE Int. Workshop on Signal Proc. Systems (SiPS), pp. 130-–135, Oct. 2016.
  • [10] J. Wang, B. Cao, P. Yu, L. Sun, W. Bao, and X. Zhu, “Deep learning towards mobile applications,” IEEE Int. Conf. on Distributed Computing Systems (ICDCS), pp. 1385-–1393, Jul. 2018.
  • [11] Y. Zhang, X. Wang, and E. G. Friedman, “Memristor-based circuit design for multilayer neural networks,” IEEE Trans. Circuits and Systems I: Regular Papers, vol. 65, no. 2, pp. 677–686, Feb. 2018.
  • [12] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler, K. Virwani, M. Ishii, P. Narayanan, A. Fumarola, L. L. Sanches, I. Boybat, M. L. Gallo, K. Moon, J. Woo, H. Hwang, and Y. Leblebici, “Neuromorphic computing using non-volatile memory,” Advances in Physics: X, vol. 2, no. 1, pp. 89–124, Jan. 2017.
  • [13] C. Yang, H. Kim, S. Adhikari, and L. Chua, “A circuit-based neural network with hybrid learning of backpropagation and random weight change algorithms,” Sensors, vol. 17, no. 1, pp. 16, Dec. 2017.
  • [14] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery, P. Lin, Z. Wang, W. Song, J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Efficient and self-adaptive in-situ learning in multilayer memristor neural networks,” Nature Communications, vol. 9, no. 1, pp. 2385, Jun. 2018.
  • [15] C. Liu, Q. Yang, C. Zhang, C. Jiang, Q. Wu, and H. H. Li, “A memristor-based neuromorphic engine with a current sensing scheme for artificial neural network applications,” IEEE Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 647–-652, Jan. 2017.
  • [16] Y. Zhao, B. Li, and G. Shi. “A current-feedback method for programming memristor array in bidirectional associative memory,” IEEE Int. Symp. Intelligent Signal Processing and Commun. Systems (ISPACS), pp. 747–-751, Nov. 2017.
  • [17] J. K. Eshraghian, K. Cho, C. Zheng, M. Nam, H. H. C. Iu, W. Lei, and K. Eshraghian, “Neuromorphic Vision Hybrid RRAM-CMOS Architecture,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 26, no. 12, pp. 2816–-2829, Dec. 2018
  • [18] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose, and R. W. Linderman, “Memristor crossbar-based neuromorphic computing system: A case study,” IEEE Trans. Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1864-–1878, Oct. 2014.
  • [19] J. K. Eshraghian, H. H. C. Iu, T. Fernando, D. Yu, and Z. Li “Modelling and characterization of dynamic behavior of coupled memristor circuits,” 2016 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 690–693, May 2016.
  • [20] L. Ni, Y. Wang, H. Yu, W. Yang, C. Weng, and J. Zhao, “An energy-efficient matrix multiplication accelerator by distributed in-memory computing on binary RRAM crossbar,” IEEE Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 280-–285, Jan. 2016.
  • [21] S. Stathopoulos, A. Khiat, M. Trapatseli, S. Cortese, A. Serb, I. Valov, and T. Prodromakis, “Multibit memory operation of metal-oxide bi-layer memristors,” Scientific reports, vol. 7, no. 1, p. 17532, Dec. 2017.
  • [22] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolutional neural network on RRAM,” IEEE Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 782–-787, Jan. 2017.
  • [23] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Davila, C. E. Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Analogue signal and image processing with large memristor crossbars”, Nature Electronics, vol. 1, no. 1, pp. 52-–59, Jan. 2018.
  • [24] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” Feb. 2016, arXiv preprint arXiv:1602.02830.
  • [25] O. Krestinskaya and A. P. James, “Binary Weighted Memristive Analog Deep Neural Network for Near-Sensor Edge Processing,” 2018 IEEE 18th International Conference on Nanotechnology (IEEE-NANO), Jul. 2018.
  • [26] J. K. Eshraghian, S. M. Kang, S. Baek, G. Orchard, H. H. C. Iu, and W. Lei, “Analog weights in ReRAM DNN Accelerators”, 2019 IEEE Artificial Circuits and Systems Conference, Mar. 2019.
  • [27] M. -J. Lee, C. B. Lee, D. Lee, S. R. Lee, M. Chang, J. H. Hur, Y. Kim, C. Kim, D. H. Seo, S. Seo, U. Chung, I. Yoo, and K. Kim, “A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O_5-−x/TaO_2-−x bilayer structures,” Nature Materials, vol. 10, pp. 625-–630, Jul. 2011.
  • [28] A. C. Torrezan, J. P. Strachan, G. Medeiros-Ribeiro, and R. S. Williams, “Sub-nanosecond switching of a tantalum oxide memristor”, Nanotechnology, vol. 22, no. 48, p. 485203 Nov. 2011.
  • [29] B. J. Murdoch, D. G. McCulloch, R. Ganesan, D. R. McKenzie, M. M. M. Bilek, and J. G. Partridge, “Memristor and selector devices fabricated from HfO_2−-xN_x”, Applied Physics Letters, vol. 108, p. 143504, Apr. 2016.
  • [30] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found”, Nature, vol. 453, pp. 80-–83, May 2008.
  • [31] J. J. Yang, M. D. Pickett, X. Li, D. A. A.Ohlberg, D. R. Stewart, and R. S. Williams, “Memristive switching mechanism for metal/oxide/metal nanodevices,” Nature Nanotechnology, vol. 3, pp. 429-–433, Jun. 2008.
  • [32] D. Kwon, K. M. Kim, J. H. Jang, J. M. Jeon, M. H. Lee, G. H. Kim, X. Li, G. Park, B. Lee, S. Han, M. Kim, and C. S. Hwang, “Atomic structure of conducting nanofilaments in TiO2 resistive switching memory,” Nature Nanotechnology, vol. 5, pp. 148-–153, Jan. 2010.
  • [33] E. J. Fuller, S. T. Keene, A. Melianas, Z. Wang, S. Agarwal, Y. Li, Y. Tuchman, C. D. James, M. J. Marinella, J. J. Yang, A. salleo and A. A. Talin, “Parallel programming of an ionic floating-gate memory array for scalable neuromorphic computing”, Science, vol. 364, no. 6440, pp. 570–574, May 2019.
  • [34] J. K. Eshraghian, K. R. Cho, H. H. C. Iu, T. Fernando, N. Iannella, S. M. Kang, and K. Eshraghian, “Maximization of Crossbar Array Memory Using Fundamental Memristor Theory”, IEEE Trans. on Circuits and Syst. II: Express Briefs, vol. 64, no. 12, pp. 1402–1406, Dec. 2017.
  • [35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–-551, Dec. 1989.
  • [36] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 93–202, Apr. 1980.
  • [37] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 386–408, Nov. 1958.
  • [38] M. Courbariaux, Y. Bengio, and J. P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” Advances in neural information processing systems, pp. 3123-–3131, 2015.
  • [39]

    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet classification using binary convolutional neural networks,”

    European Conf. on Computer Vision, pp. 525–542, Oct. 2016.
  • [40] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” Dec. 2016, arXiv preprint arXiv:1612.01064.
  • [41] D. P. Kingma, and J. Ba, “Adam: A method for stochastic optimization,” Dec. 2014, arXiv preprint arXiv:1412.6980. C. Liu, B. Yan, C. Yang, L. Song, Z. Li, B. Liu, Y. Chen, H. Li, Q. Wu, and H. Jiang, “A spiking neuromorphic design with resistive crossbar,” IEEE ACM/EDAC/IEEE Design Automation Conf., pp. 1–6, Jun. 2015.
  • [42] J. K. Eshraghian and J. Lee, mrRadix, (2019), GitHub repository, https://github.com/jeshraghian/mrRadix
  • [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2323, Nov. 1998.
  • [44] S. Pi, C. Li, H. Jiang, W. Xia, H. Xin, J. J. Yang, and Q. Xia, “Memristor crossbar arrays with 6-nm half-pitch and 2-nm critical dimension,” Nature Nanotechnology, vol. 14, pp. 35–39, Jan. 2019.
  • [45] X. Zhu, S. H. Lee, W. D. Lu, “Nanoionic resistive-switching devices’,’ Advanced Electronic Materials, p. 1900184, May 2019.