Estimating Multiplicative Relations in Neural Networks

10/28/2020
by   Bhaavan Goel, et al.
0

Universal approximation theorem suggests that a shallow neural network can approximate any function. The input to neurons at each layer is a weighted sum of previous layer neurons and then an activation is applied. These activation functions perform very well when the output is a linear combination of input data. When trying to learn a function which involves product of input data, the neural networks tend to overfit the data to approximate the function. In this paper we will use properties of logarithmic functions to propose a pair of activation functions which can translate products into linear expression and learn using backpropagation. We will try to generalize this approach for some complex arithmetic functions and test the accuracy on a disjoint distribution with the training set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

12/06/2020

The universal approximation theorem for complex-valued neural networks

We generalize the classical universal approximation theorem for neural n...
12/22/2019

Universal Hysteresis Identification Using Extended Preisach Neural Network

Hysteresis phenomena have been observed in different branches of physics...
10/09/2020

Neural Random Projection: From the Initial Task To the Input Similarity Problem

In this paper, we propose a novel approach for implicit data representat...
10/04/2021

Universal approximation properties of shallow quadratic neural networks

In this paper we propose a new class of neural network functions which a...
02/18/2020

A Neural Network Based on First Principles

In this paper, a Neural network is derived from first principles, assumi...
01/18/2021

Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples

In this paper we approach the problem of unique and stable identifiabili...
07/04/2019

Neural Networks, Hypersurfaces, and Radon Transforms

Connections between integration along hypersufaces, Radon transforms, an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The quality of a model is typically measured by its ability to generalize from a training set to previously unseen data from the similar distribution. Models based on neural network architecture with linear activation functions provide good accuracy on testing data for functions which are linear combination of inputs described by:

To estimate non linear outputs, the activation functions can be carefully set to some non-linear functions. The non linear activation functions have to be differential in the relevant domain for back propagation learning[1] to work. Generally these non linear activation functions can approximate a wide range of practical problems and work very well on previously unseen real life data. However, we focus on a very specific problem of trying to approximate a product function:

Multiplication cannot be generalized using some linear combination of variables, thus it becomes necessary to think of some other approach to accommodate these product functions within the existing neural networks. In this paper we will:

  1. introduce customized logarithm-exponential activation pair which could learn multiplication

  2. test the accuracy on data which is disjoint with the training set

  3. compare the results with other activation functions

The following section describes the data used in the experiments. Afterwards, we introduce our method and discuss the architecture, its training and relation to prior art. Throughout our discussion, we will refer to as for convenience.

Ii Data

We will generate some uniformly distributed synthetic data. We will have a set of variables

as input and a single value as output. Each input variable will be in the range for training and in the range for testing. Using disjoint sets for training and testing, we will get a better picture if a model is over-fitting the training data. The histogram of the training and testing data is shown in figure 2 and 2 respectively. As a primary objective, we will try to approximate three functions:

and compare the results with various activation functions. Here represents the normalizing factor to keep the scale of output comparable to the inputs . The normalizing factor N will be kept variable to see how it impacts the accuracy of the model.

Fig. 1: Training dataset
Fig. 2: Testing dataset

Iii Symmetric Logarithm-Exponential activation pair

The output of a product function scales very quickly compared to the individual inputs, which creates the problem of gradient explosion during backpropagation. In this paper, we try to leverage the well-known property of logarithm function:

We will set logarithm as the activation of a layer to capture the sense of multiplication through addition of log values. Now using the property of exponential function:

we get the initial inputs as a product with their corresponding weights as exponents.

Backpropagation should update the weights and to 1 if multiplication is suitable for reducing the loss and -1 if division can reduce the loss.

Iii-a Symmetric Logarithm Unit

Since is only defined for positive inputs, we need to define how the activation function would behave for 0 and negative values. Keeping in mind the discussed property of , we will try to define the activation such that it is:

  • defined for real number space

  • differentiable in the whole domain

  • symmetric about origin

The first two conditions allow backpropagation to work for all the real inputs and symmetricity about origin avoids bias shift. Thus we define activation function for the first layer as:

(1)

The differentiability of function 1 follows from differentiability of , .

Fig. 3: Symmetric function defined in eq.1

Considering the critical point

shows that the function is differentiable for .

Iii-B Symmetric Exponential Unit

Similarly based on the properties discussed for Symmetric Logarithm unit, we will define the exponential activation function for the second layer which is continuous and differentiable :

(2)
Fig. 4: Symmetric exponential function defined in eq.2
Fig. 5: Architecture of shallow NN model, A1 and A2 represent activation functions

Iv Experimental Setup

Keras[5]

with Google TensorFlow backend

[6]

was used to implement the deep learning algorithms in this study with the aid of other scientific computing libraries: matplotlib

[7], numpy[8], and Scikit-learn[9]. All experiments in this study were conducted on a laptop computer with Intel Core (TM) i7-9750H CPU @ 2.60GHz, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 1650 4GB.

V The Model

We will prepare a shallow feed-forward neural network containing 2 hidden layers and test various combinations of activation functions on our prepared dataset. The final output dense layer will use linear activation which is default for Keras

[5]. The abstract architecture of the model has been described in Fig. 5 and the implemented Keras[5] architecture in figure  6. For comparison, we will consider activation functions from {elu[10]

, hard sigmoid, linear, relu

[12], selu[13], sigmoid, softmax, softplus, softsign, swish[14] and tanh}. These 11 unique activation functions make a total of activation pairs. We will test the score of our proposed activation function against all the possible pairs.

Fig. 6: Keras model
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7:

Training loss over 100 epochs for different input size and normalization. Here ”lg_model” is our proposed model.

Vi Results

Vi-a Product Functions

Vi-A1 Training Loss

The model is trained to minimize the Mean Absolute Error (MAE) using adaptive momentum estimation (Adam)[11] optimizer for our experiments. We have trained the model for product functions with up to 4 input variables and different normalization factors. Normalization is essential to ensure that the product of inputs does not increase unboundedly. The plots in figure 7 show (out of ) activation pairs with minimum training error after epochs along with our proposed logarithm – exponential activation pair. In all cases our model not only achieves the minimum training loss but also converges in minimum number of epochs. In cases where the output is not normalized to the scale of input variables, other activation functions show much larger training loss than our model.

Vi-A2 Testing Error

We will consider the usual MAE and Mean Percentage Error () which is calculated as:

(3)

The table depicts that the proposed model has much lower compared to other activation pairs for the test data. It also suggests that in cases where other activation functions showed lower training error were over-fitting the training data.

Our Model Next Best
Expression MAE Activation Pair
15.453791 47284.32 relu_linear 61.191017
7.801138 1565.0941 relu_selu 69.28548
16.815907 27887974 elu_elu 98.30871
24.82434 394495.03 softplus_relu 93.41624
28.907835 3.4e+10 softplus_selu 99.79
39.76962 34425464 elu_linear 99.70
TABLE I: Testing results for different product functions

Vi-B Complex function

We will also show the effectiveness of our method for estimating more complex arithmetic functions. The function we try to estimate is:

(4)

Vi-B1 Training Loss

The training loss for the function described by eq.4 is also minimum for our model and also converges quickly. The training loss is depicted in figure 8.

Fig. 8: Training loss for . Here ”lg_model” is our proposed model

Vi-B2 Testing Error

The mean percentage error without any architectural changes for the complex formula is 22.794909. which is much less than the next best 97.09597 for selu-linear activation pair.

Vii Related Work

Durbin and Rumelhart, 1989 [3] tries to achieve the results of product function by taking weighted product of inputs. This approach achieves desired results but requires changes in the commonly used training loop where we generally take weighted sum of the inputs. Georg Martius and Christoph H. Lampert [4] propose a cell that takes two inputs at a time and calculates their product. It also performs a weighted product of the inputs. NLRELU [2] introduces an activation similar to the activation function used in the first layer of our model. NLRELU has various advantages but it alone cannot estimate a product function very well.

Viii Conclusion

We presented and studied the behavior of logarithm paired with an exponential activation function for estimating product functions. For data which have multiplicative relations between the inputs, we saw that our customized activation pair achieved minimum scoring error compared to other activation functions and also avoids over-fitting on the training data with a very shallow feed-forward neural network. We proved the differentiability of our functions which allows end-to-end training using backpropagation and thus can be integrated within deep neural networks to estimate hidden product relations.

Ix Acknowledgement

We thank Soham Pal and Shivam Gupta for continuous encouragement and feedback throughout this paper.

References