  # Activation Function

## What is an Activation Function?

An activation function is a function used in artificial neural networks which outputs a small value for small inputs, and a larger value if its inputs exceed a threshold. If the inputs are large enough, the activation function "fires", otherwise it does nothing. In other words, an activation function is like a gate that checks that an incoming value is greater than a critical number.

Activation functions are useful because they add non-linearities into neural networks, allowing the neural networks to learn powerful operations. If the activation functions were to be removed from a feedforward neural network, the entire network could be re-factored to a simple linear operation or matrix transformation on its input, and it would no longer be capable of performing complex tasks such as image recognition.

Well-known activation functions used in data science include the rectified linear unit (ReLU) function, and the family of sigmoid functions such as the logistic sigmoid function, the hyperbolic tangent, and the arctangent function.

Two commonly used activation functions: the rectified linear unit (ReLU) and the logistic sigmoid function. The ReLU has a hard cutoff at 0 where its behavior changes, while the sigmoid exhibits a gradual change. Both tend to 0 for small x, and the sigmoid tends to 1 for large x.

Activation functions in computer science are inspired by the action potential in neuroscience. If the electrical potential between a neuron's interior and exterior exceeds a value called the action potential, the neuron undergoes a chain reaction which allows it to 'fire' and transmit a signal to neighboring neurons. The resultant sequence of activations, called a 'spike train', enables sensory neurons to transmit feeling from the fingers to the brain, and allows motor neurons to transmit instructions from the brain to the limbs.

## Formulae for some Activation Functions

### ReLU Function Formula

There are a number of widely used activation functions in deep learning today. One of the simplest is the rectified linear unit, or ReLU function, which is a piecewise linear function that outputs zero if its input is negative, and directly outputs the input otherwise:

Mathematical definition of the ReLU Function

Graph of the ReLU function, showing its flat gradient for negative x.

### ReLU Function Derivative

It is also instructive to calculate the gradient of the ReLU function, which is mathematically undefined at x = 0 but which is still extremely useful in neural networks.

The derivative of the ReLU function. In practice the derivative at x = 0 can be set to either 0 or 1.

The zero derivative for negative x

can give rise to problems when training a neural network, since a neuron can become 'trapped' in the zero region and backpropagation will never change its weights.

### PReLU Function, a Variation on the ReLU

Because of the zero gradient problem faced by the ReLU, it is sometimes common to use an adjusted ReLU function called the parametric rectified linear unit, or PReLU:

This has the advantage that the gradient is nonzero at all points (except 0 where it is undefined).

The PReLU function with α set to 0.1 avoids the zero gradient problem.

### Logistic Sigmoid Function Formula

Another well-known activation function is the logistic sigmoid function:

Mathematical definition of the Logistic Sigmoid Function

The logistic sigmoid function has the useful property that its gradient is defined everywhere, and that its output is conveniently between 0 and 1 for all x. The logistic sigmoid function is easier to work with mathematically, but the exponential functions make it computationally intensive to compute in practice and so simpler functions such as ReLU are often preferred.

Graph showing the characteristic S-shape of the logistic sigmoid function

### Logistic Sigmoid Function Derivative

The derivative of the logistic sigmoid function can be expressed in terms of the function itself:

The derivative of the logistic sigmoid function.

Graph of the derivative of the logistic sigmoid function

The derivative of the logistic sigmoid function is nonzero at all points, which is an advantage for use in the backpropagation algorithm, although the function is intensive to compute, and the gradient becomes very small for large absolute x, giving rise to the vanishing gradient problem.

Because the derivative contains exponentials, it is computationally expensive to calculate. The backpropagation algorithm requires the derivative of all operations in a neural network to be calculated, and so the sigmoid function is not well suited for use in neural networks in practice due to the complexity of calculating its derivative repeatedly.

## Activation Function vs Action Potential

Although the idea of an activation function is directly inspired by the action potential in a biological neural network, there are few similarities between the two mechanisms.

Biological action potentials do not return continuous values. A biological neuron is either triggered or it is not, rather like the threshold step function. There is no possibility for a continuously varying output, as is normally required in order for backpropagation to be possible.

Furthermore, an action potential is a function in time. When the voltage across a neuron passes a threshold value, the neuron will trigger, and deliver a characteristic curve of output voltage to its neighbors over the next few milliseconds, before returning to its rest state.

The shape of a typical action potential over time as a signal passes a point on a cell membrane

Learning in the human brain runs asynchronously, with each neuron operating independently from the rest, and being only connected to its immediate neighbors. On the other hand, the "neurons" in an artificial neural network are often connected to thousands or millions of other neurons, and are executed by the processor in a defined order.

Although biological neurons cannot return continuous values, because of the added time dimension, information is transmitted in the firing frequency and firing mode of a neuron.

## Applications of the Activation Function

### Activation Function in Artificial Neural Networks

Activation functions are an essential part of an artificial neural network. They enable a neural network to be built by stacking layers on top of each other, glued together with activation functions. Usually, each layer consists of a function that multiplies the input by a weight and adds a bias, followed by an activation function and then the next weight and bias.

A simple feedforward neural network with activation functions following each weight and bias operation.

Each node and activation function pair outputs a value of the form

where g is the activation function, W is the weight at that node, and b is the bias. The activation function g could be any of the activation functions listed so far. In fact, g could be nearly any nonlinear function.

Ignoring vector values for simplicity, a neural network with two hidden layers can then be written as

Given that g could be any nonlinear function, there is no way to simplify this expression further. The expression represents the simplest form of the computational graph of the entire neural network.

A simple neural network like this is capable of learning many complicated logical or arithmetic functions, such as the XOR function.

However, let us imagine removing the activation functions from the network.

The network formula can now be simplified to a simple linear equation in x:

Therefore, a neural network with no activation functions is equivalent to a linear regression model, and can perform no operation more complex than linear modeling.

For this reason, all modern neural networks use a kind of activation function.

### Activation Function in the Single Layer Perceptron

Taking the concept of the activation function to first principles, a single neuron in a neural network followed by an activation function can behave as a logic gate.

Let us take the threshold step function as our activation function:

Mathematical definition of the threshold step function, one of the simplest possible activation functions

The threshold step function may have been the first activation function, introduced by Frank Rosenblatt while he was modeling biological neurons in 1962.

And let us define a single layer neural network, also called a single layer perceptron, as:

Formula and computational graph of a simple single-layer perceptron with two inputs.

For binary inputs 0 and 1, this neural network can reproduce the behavior of the OR function. Its truth table is as follows:

 0 0 0 0 1 1 1 0 1 1 1 1

Truth table of the OR function

By adding weights and biases, it is quite easy to construct a perceptron to calculate the OR, AND or NAND functions. This discovery was one of the first important events in the development of artificial neural networks, and gave early AI researchers a clue that weights, biases and activation functions could be combined in a way inspired by biological neural networks and could give rise to a kind of artificial intelligence.

However, there is an important limitation, which was famously highlighted by Minsky and Papert in 1969: there is no way to get a single layer perceptron to implement the XOR function, no matter what weights and biases are chosen. The XOR function returns 1 if the two inputs are different, and 0 if they are equal.

 0 0 0 0 1 1 1 0 1 1 1 0

Truth table of the XOR function.

To achieve more complex operations such as the XOR function, at least two layers are needed. In fact, a neural network of just two layers, provided it contains an activation function, is able to implement any possible function, not just the XOR. It took two decades for the field of neural networks to progress from this sticking point. This shows how the activation function is powerful, but its true potential can be realized only when it is used as the glue in a multi-layer neural network.

## Activation Function History

Around the turn of the 20th Century, the Spanish scientist Santiago Ramón y Cajal, often called the father of neuroscience, discovered that brain cells are joined in an interconnected network, and theorized about how signals are passed across the system. Ramón y Cajal shared the 1906 Nobel Prize in Physiology or Medicine with Camillo Golgi for this discovery.

One of Ramón y Cajal's famous hand drawings of a cell from a cat's visual cortex, showing a biological neural network. Image is in the public domain.

In the 1950s, the two British physiologists and biophysicists Alan Hodgkin and Andrew Huxley conducted a study of the giant axons in the neurons of the squid. The squid is convenient for neuroscience research because its axons are visible to the naked eye. The researchers measured the voltage across neurons, and were able to measure the voltages needed to excite a neuron and induce it to transmit a signal to its neighbors. The critical voltage is called the action potential. Later, together with their Australian colleague John Eccles, the team was awarded the 1963 Nobel Prize in Physiology for their work.

The squid Loligo forbesii, whose neurons enabled the discovery of the action potential. Image is in the public domain.

In 1959, a team led by the American cognitive scientist Jerome Ysroael Lettvin and including Warren McCulloch and Walter Pitts published a famous paper called What the frog's eye tells the frog's brain, where they detailed how the frog's visual system is made up of a network of interconnected neurons with threshold values, allowing it to implement a wide array of logical or arithmetic functions.

In 1962, the American psychologist Frank Rosenblatt developed the first perceptron, using the threshold step function as an activation function, and taking inspiration from various papers by McCulloch and Pitts where they produced truth tables of biological neurons. Activation functions had now crossed over from biology to computational neuroscience.

In 1969, the American Marvin Minksy and the South African Seymour Papert published a book titled Perceptrons: An Introduction to Computational Geometry, which attacked the limitations of the perceptron (in its then form as a single-layer network). They pointed out that in particular the perceptron was incapable of modeling the XOR function. Their scepticism helped to usher in the so-called "AI Winter": two decades of sparse AI research funding opportunities which lasted until the 1980s when multi-layer neural networks regained their popularity.

Perceptrons, the controversial 1969 book by Minsky and Papert which some feel was to blame for the 'AI Winter'. Image from MIT Press.

At center stage of the resurgence of deep learning was of course the activation function. By now, Rosenblatt's threshold step function had fallen out of favor and was replaced by variations on the sigmoid and ReLU functions.

Today, activation functions are an important component of the deep learning systems which are dominating analytics and decision making across industries. They have enabled neural networks to expand beyond the narrow horizons of AND, OR and NOT functions and have enabled researchers to achieve ever increasing predictive power by stacking layers upon layers of neurons, glued together with activation functions, and trained using the backpropagation algorithm.

## References

Hodgkin and Huxley, Action potentials recorded from Inside a nerve fibre, Nature (1939)

Lettvin, Maturana, McCulloch, Pitts, What the frog's eye tells the frog's brain. (1959)

Ramón y Cajal, Manual de Anatomia Patológica General (Handbook of general Anatomical Pathology) (1890)

Minsky and Papert, Perceptrons: An Introduction to Computational Geometry (1969)

McCulloch and Pitts, A Logical Calculus of Ideas Immanent in Nervous Activity (1943)