I Introduction
The quality of a model is typically measured by its ability to generalize from a training set to previously unseen data from the similar distribution. Models based on neural network architecture with linear activation functions provide good accuracy on testing data for functions which are linear combination of inputs described by:
To estimate non linear outputs, the activation functions can be carefully set to some non-linear functions. The non linear activation functions have to be differential in the relevant domain for back propagation learning[1] to work. Generally these non linear activation functions can approximate a wide range of practical problems and work very well on previously unseen real life data. However, we focus on a very specific problem of trying to approximate a product function:
Multiplication cannot be generalized using some linear combination of variables, thus it becomes necessary to think of some other approach to accommodate these product functions within the existing neural networks. In this paper we will:
-
introduce customized logarithm-exponential activation pair which could learn multiplication
-
test the accuracy on data which is disjoint with the training set
-
compare the results with other activation functions
The following section describes the data used in the experiments. Afterwards, we introduce our method and discuss the architecture, its training and relation to prior art. Throughout our discussion, we will refer to as for convenience.
Ii Data
We will generate some uniformly distributed synthetic data. We will have a set of variables
as input and a single value as output. Each input variable will be in the range for training and in the range for testing. Using disjoint sets for training and testing, we will get a better picture if a model is over-fitting the training data. The histogram of the training and testing data is shown in figure 2 and 2 respectively. As a primary objective, we will try to approximate three functions:and compare the results with various activation functions. Here represents the normalizing factor to keep the scale of output comparable to the inputs . The normalizing factor N will be kept variable to see how it impacts the accuracy of the model.


Iii Symmetric Logarithm-Exponential activation pair
The output of a product function scales very quickly compared to the individual inputs, which creates the problem of gradient explosion during backpropagation. In this paper, we try to leverage the well-known property of logarithm function:
We will set logarithm as the activation of a layer to capture the sense of multiplication through addition of log values. Now using the property of exponential function:
we get the initial inputs as a product with their corresponding weights as exponents.
Backpropagation should update the weights and to 1 if multiplication is suitable for reducing the loss and -1 if division can reduce the loss.
Iii-a Symmetric Logarithm Unit
Since is only defined for positive inputs, we need to define how the activation function would behave for 0 and negative values. Keeping in mind the discussed property of , we will try to define the activation such that it is:
-
defined for real number space
-
differentiable in the whole domain
-
symmetric about origin
The first two conditions allow backpropagation to work for all the real inputs and symmetricity about origin avoids bias shift. Thus we define activation function for the first layer as:
(1) |
The differentiability of function 1 follows from differentiability of , .

Considering the critical point
shows that the function is differentiable for .
Iii-B Symmetric Exponential Unit
Similarly based on the properties discussed for Symmetric Logarithm unit, we will define the exponential activation function for the second layer which is continuous and differentiable :
(2) |


Iv Experimental Setup
with Google TensorFlow backend
[6]was used to implement the deep learning algorithms in this study with the aid of other scientific computing libraries: matplotlib
[7], numpy[8], and Scikit-learn[9]. All experiments in this study were conducted on a laptop computer with Intel Core (TM) i7-9750H CPU @ 2.60GHz, 16GB of DDR3 RAM, and NVIDIA GeForce GTX 1650 4GB.V The Model
We will prepare a shallow feed-forward neural network containing 2 hidden layers and test various combinations of activation functions on our prepared dataset. The final output dense layer will use linear activation which is default for Keras
[5]. The abstract architecture of the model has been described in Fig. 5 and the implemented Keras[5] architecture in figure 6. For comparison, we will consider activation functions from {elu[10], hard sigmoid, linear, relu
[12], selu[13], sigmoid, softmax, softplus, softsign, swish[14] and tanh}. These 11 unique activation functions make a total of activation pairs. We will test the score of our proposed activation function against all the possible pairs.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Training loss over 100 epochs for different input size and normalization. Here ”lg_model” is our proposed model.
Vi Results
Vi-a Product Functions
Vi-A1 Training Loss
The model is trained to minimize the Mean Absolute Error (MAE) using adaptive momentum estimation (Adam)[11] optimizer for our experiments. We have trained the model for product functions with up to 4 input variables and different normalization factors. Normalization is essential to ensure that the product of inputs does not increase unboundedly. The plots in figure 7 show (out of ) activation pairs with minimum training error after epochs along with our proposed logarithm – exponential activation pair. In all cases our model not only achieves the minimum training loss but also converges in minimum number of epochs. In cases where the output is not normalized to the scale of input variables, other activation functions show much larger training loss than our model.
Vi-A2 Testing Error
We will consider the usual MAE and Mean Percentage Error () which is calculated as:
(3) |
The table depicts that the proposed model has much lower compared to other activation pairs for the test data. It also suggests that in cases where other activation functions showed lower training error were over-fitting the training data.
Our Model | Next Best | |||
---|---|---|---|---|
Expression | MAE | Activation Pair | ||
15.453791 | 47284.32 | relu_linear | 61.191017 | |
7.801138 | 1565.0941 | relu_selu | 69.28548 | |
16.815907 | 27887974 | elu_elu | 98.30871 | |
24.82434 | 394495.03 | softplus_relu | 93.41624 | |
28.907835 | 3.4e+10 | softplus_selu | 99.79 | |
39.76962 | 34425464 | elu_linear | 99.70 |
Vi-B Complex function
We will also show the effectiveness of our method for estimating more complex arithmetic functions. The function we try to estimate is:
(4) |
Vi-B1 Training Loss
The training loss for the function described by eq.4 is also minimum for our model and also converges quickly. The training loss is depicted in figure 8.

Vi-B2 Testing Error
The mean percentage error without any architectural changes for the complex formula is 22.794909. which is much less than the next best 97.09597 for selu-linear activation pair.
Vii Related Work
Durbin and Rumelhart, 1989 [3] tries to achieve the results of product function by taking weighted product of inputs. This approach achieves desired results but requires changes in the commonly used training loop where we generally take weighted sum of the inputs. Georg Martius and Christoph H. Lampert [4] propose a cell that takes two inputs at a time and calculates their product. It also performs a weighted product of the inputs. NLRELU [2] introduces an activation similar to the activation function used in the first layer of our model. NLRELU has various advantages but it alone cannot estimate a product function very well.
Viii Conclusion
We presented and studied the behavior of logarithm paired with an exponential activation function for estimating product functions. For data which have multiplicative relations between the inputs, we saw that our customized activation pair achieved minimum scoring error compared to other activation functions and also avoids over-fitting on the training data with a very shallow feed-forward neural network. We proved the differentiability of our functions which allows end-to-end training using backpropagation and thus can be integrated within deep neural networks to estimate hidden product relations.
Ix Acknowledgement
We thank Soham Pal and Shivam Gupta for continuous encouragement and feedback throughout this paper.
References
- [1] David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams ”Learning representations by back-propagating errors”
- [2] Y. Liu, J. Zhang, C. Gao, J. Qu and L. Ji, ”Natural-Logarithm-Rectified Activation Function in Convolutional Neural Networks,” 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 2019, pp. 2000-2008, doi: 10.1109/ICCC47050.2019.9064398.
- [3] Durbin, R. M., and Rumelhart, D. E. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1(1), 133–142. https://doi.org/10.1162/neco.1989.1.1.133
- [4] Martius, G. and Lampert, C. H., “Extrapolation and learning equations”, , 2016.
- [5] François Chollet et al. 2015. Keras. https://github.com/keras-team/keras. (2015).
-
[6]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015).
http://tensorflow.org/ Software available from tensorflow.org. - [7] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science and Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
- [8] Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science and Engineering 13, 2 (2011), 22–30.
- [9] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
- [10] Clevert, Djork-Arné and Unterthiner, Thomas and Hochreiter, Sepp. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).
- [11] Kingma, Diederik and Ba, Jimmy. (2014). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
- [12] A. F. Agarap, “Deep Learning using Rectified Linear Units (ReLU).,” CoRR, vol. abs/1803.08375, 2018.
- [13] Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S., “Self-Normalizing Neural Networks”, , 2017.
- [14] Ramachandran, P., Zoph, B., and Le, Q. V., “Searching for Activation Functions”, , 2017.
Comments
There are no comments yet.