. The DNN surrogate model is constructed using recursive composition of linear transformations and nonlinear activation functions. To ensure good performance, it is essential to choose activation functions suitable for specific applications. In practice, Rectified Linear Unit (ReLU) is one of the most popular activation functions for its simplicity and efficiency. A drawback of ReLU is the presence of vanishing gradient in the training process, known as dying ReLU problem. Several relatively new activation approaches are proposed to overcome this problem, e.g., the simple Leaky ReLU, and Piecewise Linear Unit (PLU) , Softplus , Exponential Linear Unit (ELU) , Scaled Exponential Linear Unit (SELU) , and Gaussian Error Linear Unit (GELU) .
Although the aforementioned activation functions are shown to be competitive in benchmark tests, they are still fixed nonlinear functions. In a DNN structure, it is often hard to determine a priori the optimal activation function for a specific application. In this paper, we shall generalize ReLU and introduce arbitrary trainable matrix activation functions. The effectiveness of the proposed method is validated using function approximation examples and well-known benchmark datasets such as MNIST and CIFAR-10. There are a few classical works on adaptively tuning of parameters in the training process, e.g., the parametric ReLU. However, our adaptive matrix activation functions are shown to be competitive and more robust in those experiments.
We consider the general learning process based on a given training set , where the inputs and outputs are implicitly related via an unknown target function with . The ReLU activation function is a piecewise linear function given by
In the literature is acting component-wise on an input vector. In a DNN, let be the number of layers and
denote the number of neurons at the-th layer for with and . Let denote the tuple of admissible weight matrices and the tuple of admissible bias vectors. The ReLU DNN approximation to at the -th layer is recursively defined as
The traditional training process for such a DNN is to find optimal , , (and thus optimal ) such that
In other words, best fits the data with respect to the discrete norm within the function class . In practice, the sum of squares norm in could be replaced with more convenient norms.
2. Trainable matrix activation function
Having a closer look at ReLU we observe that the activation with could be given as a matrix-vector multiplication , where is a diagonal matrix-valued function mapping from to with entries taking values from the discrete set . This is a simple but quite useful observation. There is no reason to restrict on and we thus look for a larger set of values over which the diagonal entries of are running or sampled. With slight abuse of notation, our new DNN approximation to is calculated using the following recurrence relation
Here each is diagonal and is of the form
where is a nonlinear function to be determined. Since piecewise constant functions can approximate a continuous function within arbitrarily high accuracy, we specify with as
where is a positive integer and and are constants. We may suppress the indices in , , , and write them as , , , when those quantities are uniform across layers and neurons. If , , , then the DNN (2.1) is exactly the ReLU DNN. If , , and is a fixed small negative number, (2.1) reduces to the DNN based on Leaky ReLU. If , , , , , then actually represents a discontinuous activation function.
In our case, we shall fix some parameters from and and let the rest of them vary in the training process. The resulting DNN might use different activation functions for different neurons and layers, adapted to the target function . Since ReLU and Leaky ReLU are included by our DNN as special cases, the proposed DNN is supposed to behave no worse than the traditional ones in practice. In the following, we call (2.1) with the activation function in (2.3) with trainable parameters a DNN based on the “trainable matrix activation function (TMAF)”.
Starting from the diagonal activation , we can go one step further to construct more general activation matrices. First we note that could be viewed as a nonlinear operator , where
There seems no convincing reason why considering only the diagonal operators. Hence we become more ambitious and consider a trainable nonlinear activation operator determined by more general matrices, e.g., the following tri-diagonal operator
The diagonal is given in (2.3) while the off-diagonals , are piecewise constant functions in the -th coordinate of defined in a fashion similar to . Theoretically speaking, even trainable full matrix activation is possible despite of potentially huge training cost. In summary, the corresponding DNN based on trainable nonlinear activation operators reads
The evaluation of and
are cheap because they require only scalar multiplications and comparisons. When calling a general-purpose packages such as PyTorch in the training process, it is observed that the computational time ofand is comparable to the classical ReLU.
Our observation also applies to an activation function other than ReLU. For example, we may rescale to obtain for constants varying layer by layer and neuron by neuron. Then are used to form a matrix activation function and a TMAF DNN, where are trained according to given data and are adapted to the target function. This observation may be useful for specific applications.
3. Numerical results
In this section, we demonstrate the feasibility and efficiency of TMAF by comparing it with the traditional ReLU-type activation functions. In principle, all parameters in (2.3) are allowed to be trained while we shall fix the intervals in (2.3) and only let function values
vary for simplicity in the following. In each experiment, we use the same neural network structure, as well as the same learning rates, stochastic gradient descent (SGD) optimization and number NE of epochs (SGD iterations). In particular, the learning rate 1e-4 is used for epochsto and 1e-5 is used for epochs to . We provide two sets of numerical examples:
Function approximations by TMAF network and ReLU-type networks;
Classification problems for
CIFAR-10sets solved by TMAF and ReLU networks.
For the first class of examples we use the
-loss function as defined in (1.2). For the classification problems we consider the cross-entropy that is widely used as a loss function in classification models. The cross entropy is defined using a training set which consists of images, each with pixels. Thus, we have a matrix and each column corresponds to an image with pixels. Each image belongs to a fixed class from the set of image classes , where . The network structure maps to , and each column of is an output of the network evaluation at the corresponding column of . More precisely,
The cross entropy loss function then is defined by
To evaluate the loss function at a given image , we first evaluate the network at with the given . We then define the class of and the loss at as follows:
3.1. Approximation of a smooth function
As our first example, we use neural networks to approximate
The training datasets are 20000 input-output data pairs where the input data are randomly sampled from the hypercube . The networks (1.1) and (2.1) have single or double hidden layers with neurons per layer. For TMAF in (2.2), the function in (2.3) uses intervals , , , . The approximation results are shown in Table 3.1 and Figure 1–3. It is observed that TMAF is the most accurate activation approach. Moreover, the parametric ReLU does not approximate well, see Figure (b)b.
|Single hiden layer||Two hiden layers|
3.2. Approximation of an oscillatory function
The next example is on approximating the following function having high and low frequency components
see Figure 4 for an illustration. In fact, the function in (3.1) is notoriously difficult to capture by numerical methods in scientific computing. The training datasets are 20000 input-output data pairs where the input data are randomly sampled from the interval . We test the diagonal TMAF (2.2) and the function (2.3) uses intervals , , with . We also consider the tri-diagonal TMAF (2.4), where is the same as the diagonal TMAF, and are all piecewise constants based on intervals , , and , , with , , respectively. Numerical results could be found in Figures 4, 5 and Table 3.2.
For this challenging problem, we note that the diagonal TMAF and tri-diagonal TMAF produce high-quality approximations while ReLU and parametric ReLU are not able to approximate the highly oscillating function within reasonable accuracy. It is observed from Figure 5 that ReLU actually approximates the low frequency part of (3.1). To capture the high frequency, ReLU clearly has to use more neurons and thus much more weight and bias parameters. On the other hand, increasing the number of intervals in TMAF only lead to a few more training parameters.
3.3. Classification of Mnist and Cifar-10 data sets
We now test TMAF by classifying images in theMNIST and CIFAR-10 data sets. For TMAF in (2.2), the function (2.3) uses intervals , , with .
For the MNIST set, we implement single and double layer fully connected networks (1.1) and (2.1) with neurons per layer (except at the first layer ), and ReLU or diagonal TMAF (2.2) activation. Numerical results are shown in Figures (a)a, (b)b, (a)a, (b)b and Table 3.3. We note the the TMAF with single hidden layer ensures higher evaluation accuracy than ReLU, see Table 3.3.
For the CIFAR-10 dataset, we use the
ResNet18 network structure with layers and number of neurons provided by . The activation functions are still ReLU and the diagonal TMAF (2.2). Numerical results are presented in
Figures (a)a, (b)b and Table 3.3. Those parameters given in  are already tuned well with respect to ReLU. Nevertheless, TMAF still produces smaller errors in the training process and returns better classification results in the evaluation stage.
It is possible to improve the performance of TMAF applied to those benchmark datasets. The key point is to select suitable intervals in to optimize the performance. A simple strategy is to let those intervals in (2.3) be varying and adjusted in the training process, which will be investigated in our future research.
|MNIST (1 hidden layer)||86.1%||92.1%|
|MNIST (2 hidden layers)||91.8%||92.2%|
-  Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018:7068349, February 2018.
-  Daniel W. Otter, Julian R. Medina, and Jugal K. Kalita. A survey of the usages of deep learning in natural language processing. arXiv preprint, arXiv: 1807.10854, 2018.
-  Lu Lu, Yeonjong Shin, Yanhui Su, and George E. Karniadakis. Dying relu and initialization: Theory and numerical examples. arXiv preprint, arXiv: 1903.06733, 2019.
-  Andrei Nicolae. PLU: the piecewise linear unit activation function. arXiv preprint, arXiv: 1809.09534, 2018.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík,
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLR Proceedings, pages 315–323. JMLR.org, 2011.
-  Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
-  Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 971–980, 2017.
-  Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint, arXiv: 1606.08415, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint, arXiv: 1502.01852, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint, arXiv: 1512.03385, 2015.
-  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.