I Introduction
In a deep neural network, an activation function plays critical roles. A sigmoid function defined by
was classically used as the activation function. Currently, the Rectified Linear Unit (ReLU) [1] defined by is widely used for the activation function. Many activation functions have been proposed to replace the ReLU in the literatures [2, 3, 4, 5, 6, 7, 8, 9, 10].In those activation functions, a sigmoid gating approach is considered as promising approach. The Sigmoidweighted Linear Unit (SiL) [3] defined by
has been proposed for reinforcement learning as an approximation of the ReLU. Socalled Swish unit
[2] defined by , where is a trainable parameter, has been proposed. The SiL can be considered as a special case of Swish with . In addition, the Swish become like the ReLU function for. They have shown the Swish unit outperforms other activation functions in object recognition tasks. It is known that Sigmoid gating is used in powerful tool of a long shortterm memory (LSTM)
[11].Researchers mainly focused the activation function for a scalar input. For the vector input, the scalar activation function is elementwisely applied. Recently, sigmoid gaiting approaches for the vector inputs have been proposed
[10, 12, 13].In this paper, we propose a sigmoid gating approach as a activation function for vector input. The proposed activation unit is called a weighted sigmoid gate unit (WiG). The proposed WiG consists of the multiplication of the inputs and weighted sigmoid gate as shown in Fig. 2. The SiL [3] and the Swish [2] can be considered as a special case of the proposed WiG.
In the literature like [2], previously activation functions have been only evaluated with an object recognition task. In this paper, we evaluate the proposed WiG with the object recognition task and the image restoration task. Experiments demonstrate that the proposed WiG outperforms existing other activation functions in both the object recognition task and the image restoration task.
Ii Weighted Sigmoid Gate Unit (WiG)
A proposed weighted sigmoid gate unit (WiG) is expressed by elementwise product of the input and the sigmoid activation as
(1) 
where is dimensional input vector, is square weighting matrix, is
dimensional bias vector,
represents the elementwise sigmoid function, and represents elementwise products. Figure 2 show a block diagram of the proposed WiG, where the bias component is omitted for the simplification. The activation function usually follow the weighting unit. The WiG can be used as the activation function.In the deep neural network, the activation function usually follow the weighting unit. The combination of the weighting unit and the WiG shown in Fig. 2 can be expressed as
(2)  
where is the weighting matrix, , and the bias components are omitted for the simplification. Equation 2 can be implemented as shown in Fig. 2. This network is similar to that in [10]. In terms of the calculation complexity, the networks of Figs. 2 and 2 are same. However, if we consider the parallel computation, the network of Fig. 2 is computationally efficient. For the training, the network of Fig. 2 might be better, because the statistical properties of matrices and are expected to be different.
The derivative of the WiG in Eq. 1 can be expressed by
For the convolutional neural network, the proposed WiG can be expresses by
(4) 
where is the feature map, is the convolutional kernel, is the bias map, and represents convolutional operator.
Iia Simple Example
Here, we consider a simple scalar input case of WiG. Namely,
(5) 
Figure 4 plots the graph of activation functions and derivatives of the WiG in Eq. 5 for different gains of , where the bias is set to zero. The SWISH [2] is a special case of the proposed WiG when the bias equals zero. If the gain is 0, the proposed WiG is identical to the linear activation. As the gain is becoming positive large value, the proposed WiG function is becoming like the ReLU function. As the gain is becoming negative large value, the proposed WiG function is becoming like the negativeReLU defined by . It has been reported that the combination of the positive and the negative ReLUs is effective to improve the performance of the networks [14, 15].
IiB Initialization
The neural network model is highly nonlinear model. Then, a parameter initialization is very important for optimization. The weighting parameters are usually initialized by zeromean Gaussian distribution with small variance. The bias parameters are set to zero.
However, we apply different initialization for the WiG parameters and in Eq. 1
. We initialize the scaled identical matrix
for the weighting parameter and zero for the bias parameter , so that the WiG is to be the Swish [2], where is scale parameter and is an identical matrix. If the scale parameteris large value, the WiG is initialized by the approximated ReLU. This initialization with large scale parameter is very useful for the transfer learning from the network with widely used ReLU. When we train the network with the WiG from scratch, the scale parameter is set with one. Then, the WiG is initialized with the SiL
[3].IiC Sparseness constraint
A weight decay, which is regularization technique for the weight, is usually applied for the optimization. This technique can be considered as adding constraint
to the loss function, where
is constraint parameter, is weighting parameter, and represents norm. We introduce sparseness constraint for the WiG. The sparseness constraint of the WiG can be formulated by the norm of gate as(6) 
where is constraint parameter. The output of the sigmoid function can be considered as the mask. Eq. 6 is the sparseness constraint of the mask. It also enforces the sparseness of the output of the WiG.
Iii Experiments
We experimentally demonstrate the effectiveness of the proposed WiG activation function comparing with existing activation functions: the ReLU [1], the selu [5], the ELU [8], the softplus [4], the Leaky ReLU [6], the SiL [3], the PReLU [7], and the Swish [2]. Several activation functions are required parameters. For those parameters, we used default parameters which are used in the original paper. In related works [2, 3], they have only evaluated with object recognition task, while we evaluate the proposed WiG activation function with the object recognition task and the image restoration task^{1}^{1}1The reproduction code is available on http://www.ok.sc.e.titech.ac.jp/ mtanaka/proj/WiG/.
cifar10  cifar100  

ReLU [1]  0.927  0.653 
selu [5]  0.899  0.572 
ELU [8]  0.903  0.550 
softplus [4]  0.908  0.598 
Leaky ReLU [6]  0.918  0.673 
SiL [3]  0.919  0.638 
PReLU [7]  0.935  0.678 
Swish [2]  0.935  0.689 
WiG (Pro.)  0.949  0.742 
Iiia Object Recognition Task
Here, we evaluate the activation function with cifar data set [16]. For the object recognition task, we build a vgglike network [17] as shown in Fig.5, where we use a spatial dropout [18] and a convolution pooling [19]
instead of a max pooling. The dropout rates are increased for deep layer which is close to the output following paper
[20].We built the networks with nine different activation functions as mentioned in the previous section. Then, each network was trained with 1,200 epochs, where minibatch size was 32 and we applied geometrical and photometric data argumentation. We used Adamax
[21] for the optimization. The loss function is a categorical cross entropy.Figure 6 shows a training cross entropy for each epoch. That comparison demonstrate that the network with the proposed WiG can rapidly learn and archive lower training cross entropy. The validation accuracy for cifar10 and cifar100 with different activation functions are summarized in Table I, where the bold font represents the best performance and the underline represents higher accuracy than that of the ReLU. First, the simple ReLU [1] has good performance. The proposed WiG demonstrates the best performance in both cifar10 and cifar100. We can find large improvement of the proposed WiG, esspecially for the cifar100 dataset.
IiiB Image Restoration Task
In this paper, we also compare the activation functions with an image restoration task. There are various image restoration tasks. The denoising task is known as important and essential image restoration task [22]. Therefore, we evaluate the activation function with image denoising task.
Noise Level  5  10  15  20  25  30 

ReLU [1]  37.13  33.86  32.02  30.76  29.77  28.96 
selu [5]  36.89  33.68  31.88  30.60  29.60  28.79 
ELU [8]  37.07  33.63  31.86  30.60  29.65  28.86 
softplus [4]  36.41  33.42  31.67  30.45  29.50  28.69 
Leaky ReLU [6]  37.07  33.81  31.97  30.68  29.69  28.89 
Swish [2]  37.13  33.70  31.91  30.69  29.72  28.93 
WiG (Pro.)  37.29  34.00  32.16  30.88  29.90  29.10 
Noise Level  5  10  15  20  25  30 

ReLU [1]  0.9383  0.8971  0.8646  0.8378  0.8142  0.7932 
selu [5]  0.9355  0.8929  0.8607  0.8328  0.8084  0.7855 
ELU [8]  0.9362  0.8910  0.8601  0.8331  0.8094  0.7888 
softplus [4]  0.9337  0.8878  0.8547  0.8274  0.8041  0.7822 
Leaky ReLU [6]  0.9380  0.8959  0.8628  0.8350  0.8104  0.7888 
Swish [2]  0.9377  0.8928  0.8621  0.8370  0.8137  0.7928 
WiG (Pro.)  0.9390  0.8993  0.8679  0.8412  0.8188  0.7981 
We built image denoising networks with different activation functions following papers [23, 24, 25]. An dilated convolution [26] and a skip connection [27] are used as shown in Fig. 7.
The datasets for the training were Yang91 [28], General100 [29], and Urban100 [30]. The patch size of training patch was . The minibatch size was 256. The input images for the training were generated by adding noise to the ground truth image. The standerd deviation was randomly set from 0 to 55. The optimizer was Adamax [21]. We trained 80,000 minibatches for each activation function.
We used seven sized images called Lena, Barbara, Boats, F.print, Man, Couple, and Hill for the evaluation [31]. The PSNR and SSIM comparisons are summarized in Table II and III, respectively, where the bold font represents the best performance and the underline represents higher performance than that of the ReLU. Those comparisons demonstrate that the activation functions proposed in the literature [5, 8, 4, 6, 2] can not overcome the simple ReLU [1] in the denoising task in any noise level. The performance of the proposed WiG activation function only superiors that of the ReLU in terms of both the PSNR and the SSIM.
Iv Conclusion
We have proposed the weighted sigmoid gate unit (WiG) for the activation function of the deep neural networks. The proposed WiG consists as the multiplication of the inputs and the weighted sigmoid gate. It have been shown that the proposed WiG includes the ReLU and several activation functions proposed in the literature as a special cases. We evaluate the WiG with the object recognition task and the image reconstruction task. The experimental comparisons that the proposed WiG overcomes existing activation functions including widely used ReLU.
References

[1]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML10)
, 2010, pp. 807–814.  [2] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
 [3] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoidweighted linear units for neural network function approximation in reinforcement learning,” arXiv preprint arXiv:1702.03118, 2017.
 [4] V. Nair and G. Hinton, “Rectified linear units improve restricted boltzmann machines,” International Conference on Machine Learning (ICML), 2017.
 [5] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Selfnormalizing neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 972–981.
 [6] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in International Conference on Machine Learning (ICML), vol. 30, no. 1, 2013, p. 3.

[7]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in
IEEE International Conference on Computer Vision (ICCV)
, 2015, pp. 1026–1034.  [8] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” International Conference on Learning Representations (ICLR), 2016.

[9]
K. Konda and D. Memisevic, Roland Krueger, “Zerobias autoencoders and the benefits of coadapting features,”
International Conference on Learning Representations (ICLR), 2015.  [10] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 4790–4798.
 [11] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[12]
Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. R. Salakhutdinov, “On multiplicative integration with recurrent neural networks,” in
Advances in Neural Information Processing Systems, 2016, pp. 2856–2864.  [13] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
 [14] J. Kim, O. Sangjun, Y. Kim, and M. Lee, “Convolutional neural network with biologically inspired retinal structure,” Procedia Computer Science, vol. 88, pp. 145–154, 2016.
 [15] K. Uchida, M. Tanaka, and M. Okutomi, “Coupled convolution layer for convolutional neural network,” Neural Networks, vol. 105, pp. 197–205, 2018.
 [16] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” International Conference on Learning Representations (ICLR), 2015.

[18]
J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object
localization using convolutional networks,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015, pp. 648–656.  [19] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” International Conference on Learning Representations (ICLR) Workshop Track, 2015.
 [20] M. Ishii and A. Sato, “Layerwise weightdecay for deep neural network,” Meeting on Image Recognition and Understanding (MIRU), 2016.
 [21] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (ICLR), 2015.

[22]
X. Liu, M. Tanaka, and M. Okutomi, “Singleimage noise level estimation for blind denoising,”
IEEE transactions on image processing, vol. 22, no. 12, pp. 5226–5237, 2013.  [23] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [24] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938.
 [25] K. Uchida, M. Tanaka, and M. Okutomi, “Nonblind image restoration based on convolutional neural network,” arXiv preprint arXiv:1809.03757, 2018.
 [26] F. Yu and V. Koltun, “Multiscale context aggregation by dilated convolutions,” International Conference on Learning Representations (ICLR), 2016.

[27]
J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image superresolution using very deep convolutional networks,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1646–1654.  [28] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image superresolution as sparse representation of raw image patches,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008.
 [29] C. Dong, C. Change Loy, and X. Tang, “Accelerating the superresolution convolutional neural network,” in European Conference on Computer Vision, 2016.
 [30] J.B. Huang, A. Singh, and N. Ahuja, “Single image superresolution from transformed selfexemplars,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
 [31] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising with blockmatching and 3d filtering,” in Image Processing: Algorithms and Systems, Neural Networks, and Machine Learning, vol. 6064. International Society for Optics and Photonics, 2006, p. 606414.
Comments
There are no comments yet.