The perceptron, introduced by Rosenblatt in 1958 rosenblatt1958perceptron
, was one of the first models for supervised learning. In a perceptron, the inputsare linearly combined with coefficients given by the weights , as well as with a bias to form the input to the neuron (see Fig. 1). is then fed into a non-linear function whose output is either or . The goal of the perceptron is thus to find a set of weights that correctly assigns inputs to one of two predetermined binary classes. The correct weights for this task are found by an iterative training process, for instance the delta rule widrow1960adaptive. However, the perceptron is only capable of learning linearly separable patterns, as was shown in 1969 by Minksy and Papert minsky2017perceptrons.
These limitations triggered a search for more capable models, which eventually resulted in the proposal of the multilayer perceptron. These objects can be seen as several layers of perceptrons connected to each other by synapses (see Fig. 2). This structure ensures that the multilayer perceptron does not suffer from the same limitations as Rosenblatt’s perceptron. In fact, the Universal Approximation Theorem cybenko1989approximation states that a multilayer perceptron with at least one hidden layer of neurons and with conveniently chosen activation functions can approximate any continuous function to an arbitrary accuracy.
There are various methods to train a neural network such as the multilayer perceptron. One of the most widespread is the backpropagation algorithm, a generalization of the original delta rule rumelhart1986learning.
Artificial neural networks such as the multilayer perceptron have proven extremely useful in solving a wide variety of problems rowley1998neural; devlin2014fast; ercal1994neural, but they have thus far mostly been implemented in digital computers. This means that we are not profiting from some of the advantages that these networks could have over traditional computing paradigms, such as very low energy consumption and massive parallelization jain1996artificial. Keeping these advantages is, of course, of utmost interest, and this could be done if a physical neural network was used instead of a simulation on a digital computer. In order to construct such a network, a suitable building block must be found, with the memristor being a good candidate.
Besides these energetic considerations, exploring the fact that MLPs are universal function approximators our proposal of MLPs based only on memristors implies that memristive circuits can approximate any smooth function to arbitrary accuracy.
The memristor was first introduced in 1971 as a two-terminal device that behaves as a resistor with memory chua1971memristor. The three known elementary circuit elements, namely the resistor, the capacitor and the inductor, can be defined by the relation they establish between two of the four fundamental circuit variables: the current , the voltage , the charge and the flux-linkage . There are six possible combinations of these four variables, five of which lead to widely-known relations: three from the circuit elements mentioned above, and two given by and . This means that only the relation between and remains to be defined: the memristor provides this missing relation. Despite having been predicted in 1971 using this argument, it was not until 2008 that the existence of memristors was demonstrated at HP Labs strukov2008missing, which led to a new boom in memristor-related research prodromakis2010review. In particular, there have been proposals of how memristors could be used in Hebbian learning systems soudry2013hebbian; cantley2011hebbian; he2014enabling, in the simulation of fluid-like integro-differential equations barrios2018analog, in the construction of digital quantum computers pershin2012neuromorphic and of how they could be used to implement non-volatile memories ho2009nonvolatile.
The pinched current-voltage hysteresis loop inherent to memristors endows them with intrinsic memory capabilities, leading to the belief that they might be used as a building block in neural computing architectures traversa2015universal; pershin2010experimental; yang2013memristive. Furthermore, the relatively small dimension of memristors, the fact that they can be laid out in a very dense manner and their non-volatile nature may lead to highly parallel, energy efficient neuromorphic hardware strachan2011measuring; jeong2016memristors; taha2013exploring; indiveri2013integration.
The possibility of using memristors as synapses in neural networks has been extensively studied. The wealth of proposals in this field can be broadly split into two groups: one related to spike-timing-dependent plasticity (STDP) and spiking neural networks (SNN) mostafa2015implementation; thomas2013memristor; ebong2012cmos; afifi2009implementation; querlioz2011simulation, and the other to more traditional neural network models soudry2015memristor; hasan2014enabling; bayat2017memristor; negrov2017approximate; emelyanov2016first; wang2013memristive; yakopcic2013energy; demin2015hardware; duan2015memristor; prezioso2015training; wu2012synchronization; wen2018general; adhikari2012memristor. The first group has a more biological focus, with its main goal being the reproduction of effects occurring in natural neural networks, rather than algorithmic improvements. In fact, the convergence of STDP-based learning is not guaranteed for general inputs soudry2015memristor. The second group is more oriented towards neuromorphic computing and is composed of two major architectures, one based on memristor crossbars and another on memristor arrays.
Despite all these results, and to the best of our knowledge, all existent proposals use memristors exclusively as synapses, with the networks’ neurons being implemented by some other device. The main goal of this paper is thus to introduce a memristor-based perceptron, i.e., a single-layer perceptron (SLP) in which both synapses and neurons are built from memristors. It will be generalized to a memristor-based multilayer perceptron (MLP) and we will also introduce learning rules for both perceptrons, based on the delta rule for the SLP, and on the backpropagation algorithm for the MLP.
Recently the universality of memristors has been studied for Boolean functions lehtonen2010two
and as a memcomputing equivalent of a Universal Turing Machine (Universal Memcomputing Machinetraversa2015universal). However, to the best of our knowledge, it has not yet been shown that the memristor is a universal function approximator. This result will come as a consequence of the introduction of the above-mentioned memristor-based MLP.
Ii The memristor as a dynamical system
In general, a current-controlled memristor is a dynamical system whose evolution is described by the following pair of equations chua1971memristor
The first one is Ohm’s law and relates the voltage output of the memristor with the current input through the memristance , which is a scalar function depending both on and on the set of the memristor’s internal variables . This dependence of the memristance on the internal variables induces the memristor’s output dependence on past inputs, i.e., this is the mechanism that endows the memristor with memory. The second equation describes the time-evolution of the memristor’s internal variables by relating their time derivative, , to an
-dimensional vector function, depending on both previous values of the internal variables and the input of the memristor.
ii.1 Memristor-based Single-Layer Perceptron
between the target output and the actual output:
Our goal is to implement a perceptron and an adaptation of the delta rule to train it using only a memristor. To this end, we use the memristor’s internal variables to store the SLP’s weights and the learning rate. Equation (1b) allows us to control the evolution of the memristor’s internal variables and implement a learning rule. If, for example, we want to implement a SLP with two inputs we need a memristor with four internal variables, two of them to store the weights of the connections between the inputs and the SLP, a third one to store the SLP’s bias weight and another for the learning rate.
Let us then consider a memristor with four internal state variables, from now on labeled by and in which . It could be difficult to externally control multiple internal variables. However, a possible solution is to use several memristors with the chosen requirements and with an externally controlled internal variable each.
In order to understand the form of these functions, we must remember that we expect different behaviours from the perceptron depending on the stage of the algorithm. In the forward propagation stage, the weights must remain constant to obtain the output for a given input. In this phase the internal variables must not change. On the other hand, in the backpropagation stage, we want to update the perceptron’s weights by changing the internal variables. However, it may happen that the update is different for each of the weights, so we need to be able to change only one of the internal variables without affecting the others.
There are thus three different possible scenarios in the backpropagation stage: we want to update , while and should not change; we want to update , while and should not change, and we want to update , while and should not change. To conciliate this with the fact that a memristor takes only one input, we propose the use of threshold-based functions, as well as a bias current , for the evolution of the internal variables
where is an activation function, is the Heaviside function, is the threshold for the internal variable and is a parameter that determines the dimension of the threshold, i.e., the range of current values for which the internal variables are updated. The first term of the update function can only be non-zero if the input current is positive, whereas the second term can only be non-zero if the input current is negative, allowing us to both increase and decrease the values of the internal variables. If , and are sufficiently different from each other and from zero, we can reach the correct behaviour by choosing the memristor’s input appropriately. The thresholds and the
parameter are thus hyperparameters that must be calibrated for each problem. In the aforementioned construction in which our memristor with three internal variables is constructed as an equivalent memristor, we can also use an external current or voltage control to keep the internal variable fixed. In fact, this is how it is usually addressed experimentallyyang2013memristive; xia2009memristor; yu2015dynamic; budhathoki2013composite. Therefore, we can assume that this construction is possible. It is important to note that, in an experimental implementation, this threshold system does not need to be based on the input currents’ intensities. It can, for instance, be based on the use of signals of different frequencies for each of the internal variables or in the codification of the signals meant for each of the internal variables in AC voltage signals.
We are now ready to present a learning algorithm for our SLP based on the delta rule, which is described in Algorithm 1. In case one wants to generalize this procedure to an arbitrary number of inputs , this can be trivially achieved by using a memristor with internal variables and adapting Algorithm 1 accordingly.
ii.2 Memristor-based Multilayer Perceptron
In this model, memristors are used to emulate both the connections and the nodes of a MLP. In principle, the nodes could be emulated by non-linear resistors, but using memristors allows us to take advantage of their internal variable to implement a bias weight, which in some cases proves fundamental for a successful network training.
The equations describing the evolution of the memristor at each node in this model are the same as in the seminal HP Labs paper strukov2008missing. We have chosen the experimentally tested set
Here, and are, respectively, the doped and undoped resistances of the memristor, and are physical memristor parameters, namely the thickness of its semiconductor film and its average ion mobility, and is a threshold current playing the same role as the in the model for the memristor-based SLP introduced above. Equation (10) can be approximated by
since we have that . If, for instance, we impose a constant current input to the memristor for a time , the output is given by
It is then possible to implement non-linear activation functions starting from Equation (10), which is an important condition for the universality of neural networks hornik1991approximation.
Looking now at synaptic memristors, their evolution is described by
In synaptic memristors, the internal variable is used to store the weight of the respective connection, whereas in node memristors the internal variable is used to store the node’s bias weight.
As explained before, the node memristors are chosen to operate in a non-linear regime, which allows us to implement non-linear activation functions. On the other hand, we choose a linear regime for synaptic memristors, which allows us to emulate the multiplication of weights by signals.
It must be mentioned that Equation (11) is only valid for . If we were to store the network weights in the internal variables using only a rescaling constant , i.e., , then the weights would all have the same sign. Although convergence of the standard backpropagation algorithm is still possible in this case dickey1993optical, it is usually slower and more difficult, so it is convenient to redefine the variable strukov2008missing so that the interval of the internal variable in which Equation (11) is valid becomes . Using a rescaling constant , the network weights can then be in the interval .
The new learning algorithm is an adaptation of the backpropagation algorithm, chosen due to its widespread use and robustness. In our case, the activation function of the neurons is the function that relates the output of a node memristor with its input, as seen in Equation (10). The local gradients of the output layer and hidden layer neurons are respectively given by:
In Equation (16), denotes the target output for neuron in the output layer. In Equations (16) and (17), is the derivative of the neuron’s activation function with respect to the input to the neuron . Finally, in Equation (17), the sum is taken over the gradients of all neurons in the layer to the right of the neuron that are connected to it by weights . The update to the bias weight of a node memristor is given by:
where is the learning rate. The connection weight is updated using , where is the local gradient of the neuron to the right of the connection, and is the output of the neuron to the left of the connection.
We count now with all necessary elements to adapt the backpropagation algorithm for our memristor-based MLP, as described in Algorithm 2.
Iii Simulation results
In order to test the validity of our SLP and MLP, we tested their performance on three logical gates: OR, AND and XOR. The first two are simple problems which should be successfully learnt by SLP and MLP, whereas only the MLP should be able to learn the XOR gate, due to Minsky-Papert’s theorem.
The Glorot weight initialization scheme glorot2010understanding was used for all simulations, as it has been shown to bring faster convergence in some problems when compared to other initialization schemes. In this scheme the weights are initialized according to , weighed by , where and are the number of neurons in the previous and following layers, respectively. The data sets used contain
randomly generated labeled elements, which were shuffled for each epoch, and the cost function is:
where is the target output and the actual output.
iii.1 Single-Layer Perceptron Simulation Results
For the SLP, a learning rate of was used for all tested gates, a value set by trial and error. The metric we used to evaluate the evolution of the network’s performance on a given problem was its total error over an epoch, which is given by Equation (20).
where the sum is taken over all elements in the training set. In Fig. 3, the evolution of the total error over epochs, averaged over different realizations of the starting weights, is plotted.
We observe that our SLP successfully learns the gates OR and AND, with the total error falling to within epochs, as expected from a SLP. However, the total error of our SLP for the XOR gate does not go to zero, which means that it is not able to learn this gate, in accordance with Minsky-Papert’s theorem.
iii.2 Multilayer Perceptron Simulation Results
The structure of the network was chosen following walczak1999heuristic. There, a network with one hidden layer of two neurons is recommended for the case of two inputs and one output. As noted in walczak1999heuristic, networks with only one hidden layer are capable of approximating any function, although in some problems, adding extra hidden layers improves the performance. However, the results obtained by employing only one hidden layer are satisfactory, thus there is no need for a more complex network structure. There is also the matter of how many neurons must be employed in the hidden layer. In this case, there is a trade-off between speed of training and accuracy. A network with more neurons in the hidden layer counts with more free parameters, so it will be able to output a more accurate fit, but at the cost of a longer time required to train the network. A rule of thumb for choosing the number of neurons in the hidden layer is to start with an amount that is between the number of inputs and the number of outputs and adjust according to the results obtained. This leads to two neurons for the hidden layer and, similarly to what happened with the number of hidden layers, the results obtained using two neurons in the hidden layer are sufficiently accurate, so there was no need to try other structures. The learning rates used, which we have chosen through trial and error, are for the OR and AND gates, and for the XOR gate. In Fig. 4, the evolution of the total error over epochs, averaged over different realizations of the starting weights, is plotted.
As was the case for our SLP, our MLP successfully learns the OR and AND gates. In fact, it is able to learn them faster than our SLP, which is a consequence of the larger number of free parameters. Additionally, it is able to learn the XOR gate, indicating that it behaves as well as a regular MLP.
In summary, both memristor-based perceptrons behave as expected. Our SLP is able to learn the OR and AND gates, but not the XOR gate, so it is limited to solving linearly separable problems, just as any other single-layer neural network. However, our MLP is not subject to such a limitation and it is able to learn all three gates.
iii.3 Receiver Operating Characteristic Curves
As another measure of the perceptrons’ performance, we show in Fig. 5 the receiver operating characteristic (ROC) curves obtained with perceptrons trained for epochs on data sets of size . The curves shown were obtained using a SLP trained for the OR gate, a SLP trained for the XOR gate and a MLP trained for the XOR gate, with thresholds of , and for each. Again, we see that the SLP is capable of learning the OR gate but not XOR, since it correctly classifies the inputs for OR every time, but its performance is equivalent to random guessing for XOR. We can also see that the MLP is capable of learning the XOR gate, since it correctly classifies its inputs every time. The learning rates used in training were for the SLP on both gates and for the MLP on XOR gate, as explained in the previous subsection.
In this paper, we introduced models for single and multilayer perceptrons based exclusively on memristors. We provided learning algorithms for both, based on the delta rule and on the backpropagation algorithm, respectively. Using a threshold-based system, our models are able to use the internal variables of memristors to store and update the perceptron’s weights. We also ran simulations of both models, which revealed that they behaved as expected, and in accordance with Minsky-Papert’s theorem. Our memristor-based perceptrons have the same capabilities of regular perceptrons, thus showing the feasibility and power of a neural network based exclusively on memristors.
To the best of our knowledge, our neural networks are the first ones in which memristors are used as both the neurons and the synapses. Due to the Universal Approximation Theorem for multilayer perceptrons, this implies that memristors are universal function approximators, i.e., they can approximate any smooth function to arbitrary accuracy, which is a novel result in their characterization as devices for computation.
Our models also pave the way for novel neural network architectures and algorithms based on memristors. As previously discussed, such networks could show advantages in terms of energy optimization, allow for higher synaptic densities and open up possibilities for other learning systems to be adapted to a memristor-based paradigm, both in the classical and quantum learning realms. In particular, it would be interesting to try to extend these models to the quantum computing paradigm, using a recently proposed quantum memristor pfeiffer2016quantum, and its implementation in quantum technologies, such as superconducting circuits salmilehto2017quantum or quantum photonics sanz2017quantum.