DeepAI
Log In Sign Up

Variational Neural Networks: Every Layer and Neuron Can Be Unique

10/14/2018
by   Yiwei Li, et al.
0

The choice of activation function can significantly influence the performance of neural networks. The lack of guiding principles for the selection of activation function is lamentable. We try to address this issue by introducing our variational neural networks, where the activation function is represented as a linear combination of possible candidate functions, and an optimal activation is obtained via minimization of a loss function using gradient descent method. The gradient formulae for the loss function with respect to these expansion coefficients are central for the implementation of gradient descent algorithm, and here we derive these gradient formulae.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/21/2014

Learning Activation Functions to Improve Deep Neural Networks

Artificial neural networks typically have a fixed, non-linear activation...
11/13/2019

Quadratic number of nodes is sufficient to learn a dataset via gradient descent

We prove that if an activation function satisfies some mild conditions a...
05/25/2019

Hebbian-Descent

In this work we propose Hebbian-descent as a biologically plausible lear...
09/07/2022

A Greedy Algorithm for Building Compact Binary Activated Neural Networks

We study binary activated neural networks in the context of regression t...
08/28/2017

A parameterized activation function for learning fuzzy logic operations in deep neural networks

We present a deep learning architecture for learning fuzzy logic express...
05/07/2018

Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

We analyze Gradient Descent applied to learning a bounded target functio...
10/16/2022

Stability of Accuracy for the Training of DNNs Via the Uniform Doubling Condition

We study the stability of accuracy for the training of deep neural netwo...

I Introduction

In conventational artificial neural networks (ANNs), the backward-propagation updates weights for the entire network to minimize the loss functionLeCun et al. (1998). The activation function in each hidden layer is determined before training the network and fixed during the training process. Various activation functions have been proposed, such as sigmoid,

, ReLu, etc

Goodfellow et al. (2016)

. There may be other customized activation functions for specific use cases. Empirically, ReLu is the default choice when building deep neural networks for computer vision and speech recognition

LeCun et al. (2015).

The choice of activation function is pretty arbitrary. During the early development stage, the practitioners of neural networks tried to simulate genuine neurons in human, and preferred to use activation functions that saturate when its input value is large. Moreover, previously, people have held it to be self evident that the activation functions should be differentiable everywhere, and thus sigmoid and hyperbolic tangent functions are widely used. The realization that ReLu, which is neither saturating nor everywhere differentiable, could also be an activation function with even better performance than sigmoid or hyperbolic tangent removes the shackle in people’s imagination, and significantly promotes the development of neural networksGlorot et al. (2011). However, up to now, the choice of activation functions is still ad-hoc and no rigorous proof exists that can demonstrate that ReLU or any other novel activation function is superior to the conventional sigmoid and hyperbolic tangent functions. The advantage of one activation function over another is established from mere experience, and the lack of theoretical or algorithmic justification thereof is one of our concerns.

Here, in this article, we propose a more systematic method to enable our neural network to find the optimal activation function automatically. In order to do this, we first select a set of candidate eigen functions, and then make the assumption that the optimal activation function can be represented as as linear combination of these candidate eigen functions, the combination coefficients of which yet to be determined. We next define a loss function, and try to minimize this loss function with respect to the combination coefficients, together with the weight matrices and biases that are used in conventional artificial neural networks. The activation function, which is now a linear combination of basis eigen functions, can be unique for each hidden layer (even the output layer) or each neuron in each hidden layer (even the output layer). Since in our system, the activation functions are determined from variational principle, we name our neural network as variational neural network (VNN). In this article, we will derive the formulae for the gradients of loss function with respect to the expansion coefficients. Programmatic implementation of this algorithm is still in its way. We find that in our variational neural network, back propagation method still applies, and thus no major modification of conventional neural network programs is needed.

The organization of this paper is as follows. In section II, we derive the gradient formulae for loss function with respect to the expansion coefficients. In this section, the activation functions are represented as a linear combination of candidate eigen functions, and we impose a restraint that neurons residing in the same layer should share the same activation function. In section III, we relax the restraint in section II, and now each neuron has its own unique activation function. The gradient formulae for the loss function with respect to the expansion coefficients are derived in a similar manner as that in section II. We make a conclusion in section IV.

Ii Variation of Hidden Layer Activation Functions

Suppose is the loss function, is the activation function in output layer, is the weight on the connection between the nueron in hidden layer and the neuron in hidden layer, is the weight on the connection between the neuron in the last hidden layer and the neuron in the output layer, and is the activation function in hidden layer. , where is a set of eigen-functions (for example, it can be and ). In practice, we can cut-off the summation to a large number : .

Now during the backward-propagation, the network will not only update the weights in each connection between neurons, but also update the weights of the eigen-functions in activation function. The neuron connection weight update is the same as before. The eigen function weight update for the last hidden layer is as follows (suppose we have hidden layers in total),

(1)

where

is the learning rate. Using the chain rule, we get (suppose the loss function has such format

, where is the element of the label)

(2)

where is based on the format of loss function and is based on the format of activation function in the output layer, which are straight-forward to compute, is the input from the neuron in the output layer. Suppose there are neurons in the last hidden layer

(3)

The last part of Eq. (2) is

(4)

Therefore, the eigen function weight update for the last hidden layer is

(5)

If we transform Eq. (5) into matrix format, then

(6)

where , is a matrix [], T is transpose operation, and .

Similarly, the eigen function weight update for the second last hidden layer is

(7)

where

(8)

Therefore, the eigen function weight update is

(9)

If we transform Eq. (II) into matrix format, then

(10)

where .

It is easy to demonstrate that the general formula of updating the eigen function weight for the last hidden layer

(11)

where .

Therefore, the entire update of the eigen function weights for the last hidden layer is

(12)

where .

In general, we can also treat the activation function of the output layer as a summation of eigen functions with distinct weights. In this case, we can regard as a scaling function (e.g. softmax), then the format of the formula above would remain the same.

Iii Variation of Hidden Node Activation Functions

Theoretically, there is no constraint that all the nodes in the same hidden layer must have exactly same activation function. So we can generalize our method such that each node in each hidden layer can have its unique activation function (the activation function for the node in the hidden layer), where . Then, the update of the eigen function weight in the neuron of the last hidden layer is

(13)

where

Then, we get

(14)

So the update of all the eigen function weights in the last hidden layer is

(15)

where , , and the operation is the element-wise multiplication between each column of

and the vector

.

Similarly, the update of the eigen function weight in the neuron of the second last hidden layer is

(16)

where

(17)

Then, we get

(18)

Then, the update of all the eigen function weights in the second last hidden layer is

(19)

where and .

Therefore, the update of all the eigen function weights in the last hidden layer is

(20)

In general, we can also treat the activation function of the neurons in the output layer as a summation of

eigen functions with distinct weights. The formula format will remain the same. However, the problem is that, take classification as an instance, the outputs of neurons in the output layer should be probability or probability-like values. If we use different activation function for different output neurons, it is very difficult to tell what is the meaning of the outcomes from the output layer. Thus, here, we choose not to vary the activation functions for the output layer.

Iv Conclusion and outlook

In this article, we have proposed a method to allow each layer and even each neuron in the neural networks to have its own activation function. The activation functions are represented as a linear combination of basis eigen functions. We train the neural network by minimizing a loss function with respect to these expansion coefficients together with conventional weight matrices and biases. After training the networks, we will not only be able to get optimal weights and biases between neurons in nearest layers, but also the optimal activation functions. Our ongoing work will focus on building a real model and test the performance of this variational neural networks against conventional neural networks.

References

  • LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
  • Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1 (MIT press Cambridge, 2016).
  • LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, nature 521, 436 (2015).
  • Glorot et al. (2011) X. Glorot, A. Bordes, and Y. Bengio, in

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    (2011), pp. 315–323.