Neural networks with trainable matrix activation functions

09/21/2021 ∙ by Yuwen Li, et al. ∙ Penn State University 0

The training process of neural networks usually optimize weights and bias parameters of linear transformations, while nonlinear activation functions are pre-specified and fixed. This work develops a systematic approach to constructing matrix activation functions whose entries are generalized from ReLU. The activation is based on matrix-vector multiplications using only scalar multiplications and comparisons. The proposed activation functions depend on parameters that are trained along with the weights and bias vectors. Neural networks based on this approach are simple and efficient and are shown to be robust in numerical experiments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent decades, deep neural networks (DNNs) have achieved significant successes in many fields such as computer vision and natural language processing

[1, 2]

. The DNN surrogate model is constructed using recursive composition of linear transformations and nonlinear activation functions. To ensure good performance, it is essential to choose activation functions suitable for specific applications. In practice, Rectified Linear Unit (ReLU) is one of the most popular activation functions for its simplicity and efficiency. A drawback of ReLU is the presence of vanishing gradient in the training process, known as dying ReLU problem

[3]. Several relatively new activation approaches are proposed to overcome this problem, e.g., the simple Leaky ReLU, and Piecewise Linear Unit (PLU) [4], Softplus [5], Exponential Linear Unit (ELU) [6], Scaled Exponential Linear Unit (SELU) [7], and Gaussian Error Linear Unit (GELU) [8].

Although the aforementioned activation functions are shown to be competitive in benchmark tests, they are still fixed nonlinear functions. In a DNN structure, it is often hard to determine a priori the optimal activation function for a specific application. In this paper, we shall generalize ReLU and introduce arbitrary trainable matrix activation functions. The effectiveness of the proposed method is validated using function approximation examples and well-known benchmark datasets such as MNIST and CIFAR-10. There are a few classical works on adaptively tuning of parameters in the training process, e.g., the parametric ReLU[9]. However, our adaptive matrix activation functions are shown to be competitive and more robust in those experiments.

1.1. Preliminaries

We consider the general learning process based on a given training set , where the inputs and outputs are implicitly related via an unknown target function with . The ReLU activation function is a piecewise linear function given by

In the literature is acting component-wise on an input vector. In a DNN, let be the number of layers and

denote the number of neurons at the

-th layer for with and . Let denote the tuple of admissible weight matrices and the tuple of admissible bias vectors. The ReLU DNN approximation to at the -th layer is recursively defined as


The traditional training process for such a DNN is to find optimal , , (and thus optimal ) such that


In other words, best fits the data with respect to the discrete norm within the function class . In practice, the sum of squares norm in could be replaced with more convenient norms.

2. Trainable matrix activation function

Having a closer look at ReLU we observe that the activation with could be given as a matrix-vector multiplication , where is a diagonal matrix-valued function mapping from to with entries taking values from the discrete set . This is a simple but quite useful observation. There is no reason to restrict on and we thus look for a larger set of values over which the diagonal entries of are running or sampled. With slight abuse of notation, our new DNN approximation to is calculated using the following recurrence relation


Here each is diagonal and is of the form


where is a nonlinear function to be determined. Since piecewise constant functions can approximate a continuous function within arbitrarily high accuracy, we specify with as


where is a positive integer and and are constants. We may suppress the indices in , , , and write them as , , , when those quantities are uniform across layers and neurons. If , , , then the DNN (2.1) is exactly the ReLU DNN. If , , and is a fixed small negative number, (2.1) reduces to the DNN based on Leaky ReLU. If , , , , , then actually represents a discontinuous activation function.

In our case, we shall fix some parameters from and and let the rest of them vary in the training process. The resulting DNN might use different activation functions for different neurons and layers, adapted to the target function . Since ReLU and Leaky ReLU are included by our DNN as special cases, the proposed DNN is supposed to behave no worse than the traditional ones in practice. In the following, we call (2.1) with the activation function in (2.3) with trainable parameters a DNN based on the “trainable matrix activation function (TMAF)”.

Starting from the diagonal activation , we can go one step further to construct more general activation matrices. First we note that could be viewed as a nonlinear operator , where

There seems no convincing reason why considering only the diagonal operators. Hence we become more ambitious and consider a trainable nonlinear activation operator determined by more general matrices, e.g., the following tri-diagonal operator


The diagonal is given in (2.3) while the off-diagonals , are piecewise constant functions in the -th coordinate of defined in a fashion similar to . Theoretically speaking, even trainable full matrix activation is possible despite of potentially huge training cost. In summary, the corresponding DNN based on trainable nonlinear activation operators reads


The evaluation of and

are cheap because they require only scalar multiplications and comparisons. When calling a general-purpose packages such as PyTorch in the training process, it is observed that the computational time of

and is comparable to the classical ReLU.

Remark 2.1.

Our observation also applies to an activation function other than ReLU. For example, we may rescale to obtain for constants varying layer by layer and neuron by neuron. Then are used to form a matrix activation function and a TMAF DNN, where are trained according to given data and are adapted to the target function. This observation may be useful for specific applications.

3. Numerical results

In this section, we demonstrate the feasibility and efficiency of TMAF by comparing it with the traditional ReLU-type activation functions. In principle, all parameters in (2.3) are allowed to be trained while we shall fix the intervals in (2.3) and only let function values

vary for simplicity in the following. In each experiment, we use the same neural network structure, as well as the same learning rates, stochastic gradient descent (SGD) optimization and number NE of epochs (SGD iterations). In particular, the learning rate 1e-4 is used for epochs

to and 1e-5 is used for epochs to . We provide two sets of numerical examples:

  • Function approximations by TMAF network and ReLU-type networks;

  • Classification problems for MNIST and CIFAR-10 sets solved by TMAF and ReLU networks.

For the first class of examples we use the

-loss function as defined in (

1.2). For the classification problems we consider the cross-entropy that is widely used as a loss function in classification models. The cross entropy is defined using a training set which consists of images, each with pixels. Thus, we have a matrix and each column corresponds to an image with pixels. Each image belongs to a fixed class from the set of image classes , where . The network structure maps to , and each column of is an output of the network evaluation at the corresponding column of . More precisely,

The cross entropy loss function then is defined by

To evaluate the loss function at a given image , we first evaluate the network at with the given . We then define the class of and the loss at as follows:

3.1. Approximation of a smooth function

As our first example, we use neural networks to approximate

The training datasets are 20000 input-output data pairs where the input data are randomly sampled from the hypercube . The networks (1.1) and (2.1) have single or double hidden layers with neurons per layer. For TMAF in (2.2), the function in (2.3) uses intervals , , , . The approximation results are shown in Table 3.1 and Figure 13. It is observed that TMAF is the most accurate activation approach. Moreover, the parametric ReLU does not approximate well, see Figure (b)b.

Approximation error
Single hiden layer Two hiden layers
1 2 3 4 5 6 7 8
ReLU 0.089 0.34 0.39 0.41 0.14 0.21 0.25 0.31
TMAF 0.015 0.016 0.13 0.18 0.07 0.105 0.153 0.17
Table 3.1. Approximation errors for by neural networks
Figure 1. Training errors for , single hidden layer
(a) ReLU
(b) TMAF
Figure 2. Neural network approximations to , single hidden layer
Figure 3. Training errors for , two hidden layers.

3.2. Approximation of an oscillatory function

The next example is on approximating the following function having high and low frequency components


see Figure 4 for an illustration. In fact, the function in (3.1) is notoriously difficult to capture by numerical methods in scientific computing. The training datasets are 20000 input-output data pairs where the input data are randomly sampled from the interval . We test the diagonal TMAF (2.2) and the function (2.3) uses intervals , , with . We also consider the tri-diagonal TMAF (2.4), where is the same as the diagonal TMAF, and are all piecewise constants based on intervals , , and , , with , , respectively. Numerical results could be found in Figures 4, 5 and Table 3.2.

For this challenging problem, we note that the diagonal TMAF and tri-diagonal TMAF produce high-quality approximations while ReLU and parametric ReLU are not able to approximate the highly oscillating function within reasonable accuracy. It is observed from Figure 5 that ReLU actually approximates the low frequency part of (3.1). To capture the high frequency, ReLU clearly has to use more neurons and thus much more weight and bias parameters. On the other hand, increasing the number of intervals in TMAF only lead to a few more training parameters.

(a) Exact oscillating function
(b) Training loss comparison
Figure 4. Plot of and training loss comparison
(a) ReLU approximation
(b) TMAF approximation
Figure 5. Approximations to by neural networks
RELU 0.97
Diag-TMAF 0.033
Tri-diag TMAF 0.029
Table 3.2. Error comparison for

3.3. Classification of Mnist and Cifar-10 data sets

We now test TMAF by classifying images in the

MNIST and CIFAR-10 data sets. For TMAF in (2.2), the function (2.3) uses intervals , , with .

For the MNIST set, we implement single and double layer fully connected networks (1.1) and (2.1) with neurons per layer (except at the first layer ), and ReLU or diagonal TMAF (2.2) activation. Numerical results are shown in Figures (a)a(b)b, (a)a, (b)b and Table 3.3. We note the the TMAF with single hidden layer ensures higher evaluation accuracy than ReLU, see Table 3.3.

(a) Training loss comparison
(b) Classification accuracy
Figure 6. MNIST: Single hidden layer
(a) Training loss
(b) Classification accuracy
Figure 7. MNIST: Two hidden layers

For the CIFAR-10 dataset, we use the ResNet18 network structure with layers and number of neurons provided by [10]. The activation functions are still ReLU and the diagonal TMAF (2.2). Numerical results are presented in Figures (a)a,  (b)b and Table 3.3. Those parameters given in [11] are already tuned well with respect to ReLU. Nevertheless, TMAF still produces smaller errors in the training process and returns better classification results in the evaluation stage.

It is possible to improve the performance of TMAF applied to those benchmark datasets. The key point is to select suitable intervals in to optimize the performance. A simple strategy is to let those intervals in (2.3) be varying and adjusted in the training process, which will be investigated in our future research.

(a) Training loss
(b) Classification accuracy
Figure 8. Comparison between ReLU and TMAF for CIFAR-10
Dataset Evaluation Accuracy
MNIST (1 hidden layer) 86.1% 92.1%
MNIST (2 hidden layers) 91.8% 92.2%
CIFAR-10 (Resnet18) 92.8% 93.2%
Table 3.3. Evaluation accuracy for the MNIST and CIFAR-10