Activation functions are not needed: the ratio net

05/14/2020 ∙ by Chi-Chun Zhou, et al. ∙ 0

The function approximator that finds the function mapping the feature to the label is an important component in a deep neural network for classification tasks. To overcome nonlinearity, which is the main difficulty in designing the function approximator, one usually uses the method based on the nonlinear activation function or the nonlinear kernel function and yields classical networks such as the feed-forward neural network (MLP) and the radial basis function network (RBF). Although, classical networks such as the MLP are robust in most of the classification task, they are not the most efficient. E.g., they use large amount of parameters and take long times to train. Additionally, the choice of activation functions has a non-negligible influence on the effectiveness and efficiency of the network. In this paper, we propose a new network that is efficient in finding the function that maps the feature to the label. Instead of using the nonlinear activation function, the new proposed network uses the fractional form to overcome the nonlinearity, thus for the sake of convenience, we name the network the ratio net. We compare the effectiveness and efficiency of the ratio net and the classical networks such as the MLP and the RBF in the classification task on the mnist database of handwritten digits and the IMDb dataset which is a binary sentiment analysis dataset. The result shows that the ratio net outperforms both the MLP and the RBF.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A deep neural network for classification tasks is consist of two components: 1) features extractors, such as the convolutional neural network

[6, 12]

, the recurrent neural network

[9, 8], and the transformer [15], that extract the feature from the raw data and 2) the function approximator that finds the function mapping features to labels. Those two components are both important for the effectiveness and efficiency in solving classification tasks.

In designing the function approximator that finds the function mapping features to labels, on the one hand, one has to make sure that the target function is in the range of searching or equivalently, the network should have the property of universal approximation [16, 7, 11, 13]. On the other hand, one should also consider the efficiency of potential networks. The nonlinearity is the main difficulty in designing the function approximator, one usually uses the method based on the nonlinear activation function [11, 5, 17, 14] or the nonlinear kernel function [13, 2] and thus yields classical networks such the feed-forward neural network (MLP) [16, 7, 11] and the radial basis function network (RBF) [13, 2]. In the past decades, a lot of researches prove that those networks have the property of universal approximation [16, 7, 11, 13].

Although, classical networks such as the MLP are robust in most of classification tasks, they are not the most efficient [18]. E.g., they use large amount of parameters and take long times to train. Additionally, the choice of activation functions has a non-negligible influence on the effectiveness and efficiency of the network. That is, we need new networks that is more efficient.

In this paper, we propose a new network that is efficient in finding the function that maps the feature to the label. Instead of using the nonlinear activation function, the new proposed network uses the fractional form to overcome the nonlinearity, thus for the sake of convenience, we name the network the ratio net. The ratio net is inspired by the previous work [18], where we find that the Pade approximant with fractional forms is highly efficient in searching a target function. We compare the effectiveness and efficiency of the ratio net with the classical networks such as the MLP and the RBF in the classification task on the mnist database of handwritten digits [10, 3] and the IMDb dataset [4] which is a binary sentiment analysis dataset. The result shows that the ratio net outperforms both the MLP and the RBF. The source code of the present paper is given in github .

The work is organized as follows: in Sec. 2, we give the structure of the ratio net. In Sec. 3, we compare the effectiveness and efficiency of the ratio net with the classical networks such as the MLP and the RBF in the classification task on the mnist database of handwritten digits and the IMDb dataset. Conclusions and outlooks are given in Sec. 4.

2 The structure of the ratio net

The nonlinearity is the main difficulty in designing the function approximator. One usually uses the method based on the nonlinear activation function and gives classical networks such as the MLP [11, 5, 17]. Others use the nonlinear kernel function and gives classical networks such as the RBF [13, 2]. In the past decades, those networks are proved to have the property of universal approximation [16, 7, 11, 13, 2]. In this section, we show that there is another way to handle the difficulty caused by the nonlinearity: the ratio net with fractional forms. We introduce the structure of the ratio net.

The structure of the ratio net.

The function approximator gives a function that maps a vector in

to another vector in . The structure of classical networks such as the MLP and the RBF are and where is the component of the output, is the activation function, and is the kernel function. , , , and are parameters. Usually,

can be chosen from the sigmoid function, the hyperbolic tangent (tanh) function and the rectified linear unit (relu) function.

can be a Gauss function. As shown in Figs. (1) and (2).

Figure 1: An example of the structure of the MLP.
Figure 2: An example of the structure of the RBF.

Here, we propose the ratio net. The structure of the ratio net is

(2.1)

where and are parameters. In Eq. (3.1), instead of the nonlinear activation function or the nonlinear kernel function, the nonlinearity is overcomed by using the fractional form. As shown in Fig. (3).

Figure 3: An example of the structure of the ratio net.

The property of universal approximation of the ratio net: a brief discussion. In designing the function approximator, one has to make sure that the target function mapping the feature to the label is in the range of searching. That is, the network should have the property of universal approximation. Although, the rigorious proof of the property of universal approximation of the ratio net is not given, we show that, the property of universal approximation of the ratio net is inherited from the Pade approximant.

The conventional Pade approximant is given as [1]

(2.2)

The Pade approximant has a Maclaurin expansion which agrees with a power series of order [1], that is

(2.3)

For example, with a polymial of order , one can draw an elephant. Thus the Pade approximant is capable of approximating various kind of functions.

The conventional Pade approximant gives functions that maps a one-dimension vector to another one-dimension vector. The ratio net proposed in the paper is a generalization of the Pade approximant and gives functions that a vector in to another vector in . Thus, the property of universal approximation of the ratio net is inherited from the Pade approximant.

3 The classification task on the mnist database and the IMDb dataset

In this section, in order to show that the ratio net is capable in finding the function that maps the feature to the label efficiently, we compare the effectiveness and efficiency of the ratio net and the classical networks such as the MLP and the RBF in the classification task on the mnist database of handwritten digits and the IMDb dataset. The result shows that the ratio net outperforms both the MLP and the RBF.

3.1 The task on the mnist database

The mnist database of handwritten digits [10, 3] is an famous open dataset for image classification task. There are 60,000 training images and 10,000 test images in the database. In this section, we use the convolutional auto-encoder network to extract features from a picture. As a result, we obtained a -dimensional feature from a pixes picture. Then, an MLP, an RBF, and a ratio net with different structures are used to find the function that maps the -dimensional feature to the label, which has different values.

The result is given as follows:

Structures Number of parameters The accuracy

Where "the ratio net:" denotes a ratio net with the structure

(3.1)

and so on. "MLP:" denotes a two-layered MLP with neurons in each layer and the activation function the hyperbolic tangent function and so on.

The efficiency of each methods are show in Fig. (5). It shows that the ratio net converges fast and outperforms the classical networks such as the MLP and the RBF.

Figure 4: The accuracy versus the steps of different methods in the classification task on the mnist database of handwritten digits.

3.2 The task on the IMDb dataset

The IMDb dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The IMDb dataset is an famous open dataset for NLP classification task. In this section, we use the textcnn model to extract features from a sentence. As a result, we obtained a -dimensional feature from a given sample in the IMDb dataset. Then, an MLP and a ratio net with different structures are used to find the function that maps the -dimensional feature to the binary label.

The result is given as follows.

Structures Number of parameters The accuracy
text-cnn with random embedding

The efficiency of each methods are show in Fig. (5)

Figure 5: The accuracy versus the steps of different methods in the classification task on the IMDb dataset.

4 Conclusions and outlooks

In this paper, we propose a new network that is efficient in finding the function that maps features to labels. Instead of using the nonlinear activation function, the new proposed network uses the fractional form to overcome the nonlinearity, thus for the sake of convenience, we name the network the ratio net. The ratio net is inspired by the previous work [18], where we find that the Pade approximant is highly efficient in searching a target function. We compare the effectiveness and efficiency of the ratio net and the classical networks such as the MLP and the RBF in the classification task on the mnist database of handwritten digits and the IMDb dataset. The result shows that the ratio net outperforms both the MLP and the RBF.

The ratio net can replace the MLP in various kinds of classification tasks and thus improve the effectiveness and efficiency.

5 Acknowledgments

We are very indebted to Prof. Wu-Sheng Dai for his enlightenment and encouragement.

References

  • [1] G. A. Baker, G. A. Baker Jr, G. Baker, P. Graves-Morris, and S. S. Baker (1996) Pade approximants: encyclopedia of mathematics and it’s applications, vol. 59 george a. baker, jr., peter graves-morris. Vol. 59, Cambridge University Press. Cited by: §2.
  • [2] S. Chen, C. F. Cowan, and P. M. Grant (1991) Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on neural networks 2 (2), pp. 302–309. Cited by: §1, §2.
  • [3] L. Deng (2012)

    The mnist database of handwritten digit images for machine learning research [best of the web]

    .
    IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §1, §3.1.
  • [4] S. Dooms, T. De Pessemier, and L. Martens (2013) Movietweetings: a movie rating dataset collected from twitter. In Workshop on Crowdsourcing and human computation for recommender systems, CrowdRec at RecSys, Vol. 2013, pp. 43. Cited by: §1.
  • [5] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    ,
    pp. 315–323. Cited by: §1, §2.
  • [6] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1.
  • [7] K. Hornik, M. Stinchcombe, and H. White (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 3 (5), pp. 551–560. Cited by: §1, §2.
  • [8] B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §1.
  • [9] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §1.
  • [10] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §1, §3.1.
  • [11] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6 (6), pp. 861–867. Cited by: §1, §2.
  • [12] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.
  • [13] J. Park and I. W. Sandberg (1991) Universal approximation using radial-basis-function networks. Neural computation 3 (2), pp. 246–257. Cited by: §1, §2.
  • [14] S. Sonoda and N. Murata (2017) Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis 43 (2), pp. 233–268. Cited by: §1.
  • [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [16] H. White (1990) Connectionist nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural networks 3 (5), pp. 535–549. Cited by: §1, §2.
  • [17] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §1, §2.
  • [18] C. Zhou and Y. Liu (2020) The pade approximant based network for variational problems. arXiv preprint arXiv:2004.00711. Cited by: §1, §1, §4.