Deep neural networks have become a powerful tool in machine learning and have achieved remarkable success in many computer vision and image processing tasks, including classification, semantic segmentation  and object detection [3, 4]. After the breakthrough result in the ImageNet classification challenge , different kinds of neural network architectures have been proposed and the performance is improved year by year. GoogLeNet  seems to achieve the bottleneck of performance in 2014 with the traditional feedforward neural network structure, where the units are connected only with the ones in the following layer. After that, new neural network structures have been proposed. Examples include ResNet , DenseNet  and CliqueNet , where skip connections and dense connections are adopted to ease the network training and further push the state of the arts.
Although the new structures lead to a significant improvement compared with the traditional feedforward structure, it seems to require profound understandings of practical neural networks and substantial trials of experiments to design effective neural network structures. Thus we believe that the design of neural network structure needs a unified guidance. This paper serves as a preliminary trial towards this goal.
1.1 Related Work
There has been extensive work on the neural network structure design. Generic algorithm [9, 10] based approaches were proposed to find both architectures and weights in the early stage of neural network design. However, networks designed with the generic algorithm perform worse than the hand-crafted ones .  proposed a “Fabric” to sidestep the CNN model architecture selection problem and it performs close to the hand-crafted networks.  used Bayesian optimization for network architecture selection and  used a meta-modeling approach to choose the type of layers and hyper parameters. ,  and  used the adaptive strategy that grows the network structure layer by layer from a small network based on some principles, e.g.,  minimized some loss value to balance the model complexity and the empirical risk minimization.  and 
used the reinforcement learning to search the neural network architecture. All of these are basically heuristic search based approaches. They are difficult to produce effective neural networks if the computing power is insufficient or the search strategy is inefficient as the search space is huge. DNN structure designed via minimizing some loss values is only best for given data. It may not generalize to other datasets if noregular
structure exists in the network. Although some recently proposed methods that utilize a recurrent neural network and reinforcement learning scheme also achieve impressive results[19, 20], they differ from us due to the lack of explicit guidance to indicate where the connections should appear.
In this paper, we design the neural network structures based on the inspiration from optimization algorithms. Our idea is motivated by the recent work in the compressive sensing community. Traditional methods for compressive sensing solve a well-defined problem 555We denote and . and employ iterative algorithms to solve it, e.g., the ISTA algorithm  with iterations , where . The iterative algorithms often need many iterations to converge and thus suffer from high computational complexity. , , ,  and  developed a series of neural network based methods for compressive sensing. Their main idea is to train a non-linear feedforward neural network with a fixed depth. At each layer, a linear transformation is applied to the input and then a nonlinear transformation follows, which can be described as . In the traditional optimization based compressive sensing, the linear transformation is fixed. As a comparison, in the neural network based compressive sensing, is learnable so that each layer has a different linear transformation matrix. The neural network based compressive sensing often needs much less computation compared with the optimization based ones.
Since ISTA is almost the most popular algorithm for compressive sensing, most of the existing neural network based methods [22, 23, 24] are inspired by ISTA and thus have the feedforward structure.  proposed a FISTA-net  by adding a skip connection to the feedforward structure. However, all these networks are for image reconstruction, based on the compressive sensing model. The design methodology of deep neural networks for image recognition tasks is still lacking.
In this paper, we study the design of the neural network structures for image recognition tasks666We only focus on the part before SoftMax as SoftMax will be connected to all networks in order to produce label information.. To make our network structure easy to generalize to other datasets, our methodology separates the structure design and weights search, i.e., we do not consider the optimal weights in the structure design stage. The optimal weights will be searched via training after the structure design. Our methodology is inspired by optimization algorithms. Specifically, our contributions include:
For the standard feedforward neural network that shares the same linear transformation and nonlinear activation function at different layers, we prove that the propagation in the neural network is equivalent to using the gradient descent algorithm to minimize some function. As a comparison, the neural network based compressed sensing only studied the soft thresholding as the nonlinear activation function and the goal of designing the network is to solve the compressive sensing problem as accurately as possible.
Based on the above observation, we propose the hypothesis that a faster optimization algorithm may inspire a better neural network structure. Especially, we give the neural network structures inspired by the heavy ball algorithm and Nesterov’s accelerated gradient algorithm, which include ResNet and DenseNet as two special cases.
Numerical experiments on CIFAR-10, CIFAR-100 and ImageNet verify that the optimization algorithm inspired neural network structures outperform ResNet and DenseNet. These show that our methodology is very promising.
Our methodology is still preliminary. Although we have shown in some degree the connection between faster optimization algorithms and better deep neural networks, currently we haven’t revealed the connection between optimization algorithm and DNN structure design in a theoretically rigorous way. It is an analogy. However, analogy does not mean unsolid. For example, DNN is inspired by brain. It is also an analogy and has no strict connections to brain either. However, no one can say that DNN is insignificant or ineffective.
2 Reviews of Some Optimization Algorithms
In this section, we review the gradient descent (GD) algorithm , the heavy ball (HB) algorithm , Nesterov’s accelerated gradient descent (AGD) algorithm  and the Alternating Direction Method of Multipliers (ADMM) [30, 31] to solve the general optimization problem .
The gradient descent algorithm is one of the most popular algorithms in practice. It consists of the following iteration777For direct use in our network design, we fix the stepsize to 1. It can be obtained by scaling the objective function such that the Lipschitz constant of is 1.:
The heavy ball algorithm is a variant of the gradient descent algorithm, where a momentum is added after the gradient descent step:
Nesterov’s accelerated gradient algorithm has the similar idea with the heavy ball algorithm, but it uses the momentum in another way:
where is computed via and 888When is -strongly convex and its gradient is -Lipschitz continuous, is fixed at .. When is -strongly convex999I.e., . and its gradient is -Lipschitz continuous101010I.e., ., the heavy ball algorithm and Nesterov’s accelerated gradient algorithm can find an -accuracy solution in iterations, while the gradient descent algorithm needs iterations. Iteration (3) has an equivalent form of
ADMM and its linearized version can also be used to minimize by reformulating it as . Linearized ADMM consists the following steps111111For direct use in our network design, we fix the penalty parameter to 1.:
3 Modeling the Propagation in Feedforward Neural Network
In the standard feedforward neural network, the propagation from the first layer to the last layer can be expressed as:
where is the output of the -th layer,
is the activation function such as the sigmoid and ReLU, andis a linear transformation. As claimed in Section 1.3, we do not consider the optimal weights during the structure design stage. Thus, we fix the matrix as to simplify the analysis.
Suppose is a symmetric and positive definite matrix121212This assumption is just for building the connection between network design and optimization algorithms. will be learnt from data once the structure of network is fixed.. Let . Then there exists a function such that (7) is equivalent to minimizing using the following steps:
Define a new variable ,
Using (1) to minimize ,
Recovering from via .
We can find such that for the commonly used activation function . Then we can have that . So if we let
where is the -th column of , then we have
Now we define a new function
Let for variable substitution and minimize to obtain a sequence of . Then we use this sequence to recover by , which leads to
We list the objective function for the commonly used activation functions in Table 1.
|Activation function||Optimization objective|
|Leaky ReLU||, where|
4 From GD to Other Optimization Algorithms
As shown in Section 3, the propagation in the general feedfroward neural network can be seen as using the gradient descent algorithm to minimize some function . In this section, we consider to use other algorithms to minimize the same function .
Variable substitution .
Recovering from via :
. Now we consider to use iteration (3). Following the same three steps, we have:
where . Then we use iteration (4) to minimize . Following the same three steps, we have
5 Hypothesis: Faster Optimization Algorithm May Imply Better Network
In this section, we consider the general representation learning task: Given a set of data points , where is the -th data and is its label, we want to find a deep neural network which can learn the best feature for each , which exists in theory but actually unknown in reality, such that can perfectly predict . For simplicity, we assume that and have the same dimension. As clarified at the beginning of Section 1.3, in this paper, we only consider the learning model form to and do not consider the prediction model from to . We do not consider the optimal weights during the structure design stage and thus, we study a simplified neural network model with the same linear transformation in different layers. Actually, this corresponds to the recurrent neural networks . In this section we use instead of for simplicity.
5.1 Same Linear Transformation in Different Layers
We first consider the simplified neural network model with the same linear transformation in different layers. As stated in Section 3, the propagation in the standard feedforward network can be seen as using the gradient descent algorithm to minimize some function . Assume that there exists a special function with some parameter dependent on such that . Now we use different algorithms to minimize this and we want to find the minimizer of via as few iterations as possible.
When we use the gradient descent algorithm to minimize this with initializer , the iterative procedure is equivalent to (7), which corresponds to the propagation in the feedforward neural network characterized by the parameter discussed above. Let be the output of this feedforward neural network. As is known, the gradient descent algorithm needs iterations to reach an -accuracy solution, i.e., . In other words, this feedforward neural network needs layers for an -accuracy prediction.
When we use some faster algorithm to minimize this , e.g., the heavy ball algorithm and Nesterov’s accelerated gradient algorithm, their iterative procedures are equivalent to (11) and (13), respectively and the new algorithms will need iterations for . That is, the networks corresponding to faster algorithms (also characterized by the same but has different structures) will need fewer layers than the feedforward neural network discussed above.
We define a network with fewer layers to reach the same approximation accuracy as a better network. As is known, training a deep neural network model is a nonconvex optimization problem and it is NP-hard to reach its global minima. The training precess becomes more difficult when the network becomes deeper. So if we can find a network with fewer layers and no loss of approximation accuracy, it will make the training process much easier.
5.2 Different Linear Transformation in Different Layers
In the previous discussion, we require the linear transformation in different layers to be the same. This is not true except in recurrent neural networks and is only for theoretical explanation. Now we allow each layer to have a different transformation.
Assume that we have a network that is inspired by using some optimization algorithm to minimize , where its -th layer has the operation of (7), (11), (12), (13) or (14). Denote its final output as . Then for a network with finite layers, we have .
Now we relax the parameter to be different in different layers. Then the output can be rewritten as . Now we can use the following model to learn the parameters with a fixed network structure:
and denote as the solution. Then we have , which means that different linear transformation in different layers will not make the network worse. In fact, model (15) is the general training model for a neural network with a fixed structure.
5.3 Simulation Experiment
|depth||ADMM (14)||GD (7)||HB (11)||AGD (12)||AGD2 (13)|
In this section, we verify our hypothesis that the network structures inspired by faster algorithms may be better than the ones inspired by slower algorithms.
We compare five neural network structures, which are inspired by the gradient descent algorithm, the heavy ball algorithm, ADMM and two variants of Nesterov’s accelerated gradient algorithm, which have the operation of (7), (11), (14), (12) and (13) at each layer, respectively. We set for (11) and the parameters of (12) and (13
) are exactly the same with their corresponding optimization algorithms. We use the sigmoid function forand is a full-connection linear transformation. Then we use model (15)131313The optimization algorithms we present in Section 2 are not related to what algorithm we use to train model (15). The algorithms in Section 2 are for designing network structures. to train the parameters of each layer under the fixed network structures. We generate 10,000 random pairs of in as the training data. Each and has a dimension of 100. We use as the input of the network and use its output to fit
. We report the Mean Squared Error (MSE) loss value of the five aforementioned models with different depths after training 1,000 epoches.
Table 2 shows the experimental results. We can see that HB, AGD and AGD2 inspired neural network structures perform better than GD inspired network structure. This corresponds to the fact that the HB algorithm and AGD algorithm have a better theoretical convergence rate than the GD algorithm. The ADMM inspired network performs the worst. In fact, although ADMM has been widely used in practice, it does not have a faster theoretical convergence rate than GD. We also observe that the MSEs of GD, HB and AGD inspired network will not always decrease when the depth increases. This means that the deeper networks with GD, HB and AGD inspired structures are harder to train. However, the AGD2 inspired networks can still be efficiently trained with a larger depth when GD, HB and AGD inspired structures fail. This AGD2 is better may be because it has better numerical stability, although it is theoretically equivalent to AGD if there is no numerical error. Such a phenomenon is yet to be further explored.
6 Engineering Implementation
In the above section, we hypothesize and verify that faster optimization algorithm inspired neural network structure may need fewer layers without accuracy loss. In this section, we consider the practical implementations in engineering. Specifically, we consider the network structures inspired by algorithm iterations (7), (11), (12), (13) and (14).
We define the following three meta operations for practical implementation.
. We use as the linear transformation with full-connection in Section 4, which is the product of a matrix
and a vector. We may relax it to the convolution operation, which is also a linear transformation. Moreover, different layers may have different weight matrix and may not be a square matrix, thus the dimensions of input and output may be different.
is the nonlinear transformation defined by the activation function. It can also be relaxed to pooling and batch normalization (BN). Moreover,can be a composite of nonlinear activation, pooling, BN, convolution or full-connection linear transformation. Using the different combinations of these operations, the network structure (7) covers many famous CNNs, e.g., LeNet  and VGG . The activation function can also be different for different layers, e.g., be learnable .
In the following discussions, we replace with the operator for more flexibility.
. Inspired by (11), (12) and (13), we can design some practical neural network structures. However, the coefficients in these formulations may be impractical.So we keep the structure inspired by these formulations but allow the coefficient to have other values or even be learnable. Specifically, we rewrite (11) as
where and can be set as any constants, e.g., 0, which means that we drop the corresponding term. It can also be co-optimized with the training of network’s weights.
We demonstrate the structures of (17) and (18) in Figure 1(b) and 1(c). From Figure 1(b) we can see that it first makes a combination of and is then the operation follows. In Figure 1(c), combines all of and .
It is known that ResNet adds an additional skip connection that bypasses the non-linear transformations with an identity transformation:
The structure of ResNet can be recovered from (16) by setting . If we also set in (17), then it is similar to the structure of ResNet. The difference is that (16) performs the operation before adding the skip connection while (17) do it after the skip connection.
DenseNet is an extension of ResNet, which connects each layer with all its following layers. Consequently, the -th layer receives the feature-maps of all preceding layers as its input, and produces
where refers to the concatenating operation. (18) recovers the structure of DenseNet by setting .
We can also rewrite (14) as
Different from the previous network structures, the ADMM inspired network has two parallel paths, which makes the network wider. It also recovers the DMRNet in , which can be expressed as
We list the optimization algorithms that relate to the commonly used existing network structures in Table 3.
. (16), (17) and (18) are not available when the size of feature-maps changes, especially when down-sampling is used. To facilitate down-sampling, we can divide the network into multiple connected blocks. The formulationtions (16), (17) and (18) are used in each block. Such an operation is used in both ResNet and DenseNet.
|Algorithm||Network Structure||Transforming Setting|
|HB (2)||ResNet||in (16)|
|AGD (4)||DenseNet||, in (18)|
|ADMM (6)||DMRNet||in (20)|
6.1 HB Inspired Network
In this section, we describe the practical implementation of the neural network structure inspired by the heavy ball algorithm (2), which is used in our experiments. Specifically, we implement the HB inspired network by setting directly in (2):
is a composite function including two weight layers. According to the residual structure in ResNet, the first layer is composed of three consecutive operations: convolution, batch normalization, and ReLU, while the second one is performed only with convolution and batch normalization. The feature maps are down-sampled at the first layer of each block by convolution with a stride of.
6.2 AGD Inspired Network
In line with the analysis above, we introduce the AGD inspired network of (18) as follow, which is easy to implement.
where is the composite function including batch normalization, ReLU and convolution, following DenseNet. Different from DenseNet, where all preceding layers () are concatenated first and then mapped by to produce , AGD inspired network makes each preceding layer produce its own output first, and then sum the outputs by weights . The weights of the first term, are co-optimized with the network, and the weights of the third term, are calculated by formulation (5). The parameter is set to be 0.1 in our experiments.
7.1 Datasets and Training Details
CIFAR Both CIFAR-10 and CIFAR-100 datasets consist of
colored natural images. The CIFAR-10 dataset has 60,000 images in 10 classes, while the CIFAR-100 dataset has 100 classes, each of which containing 600 images. Both are split into 50,000 training images and 10,000 testing images. For image preprocessing, we normalize the images by subtracting the mean and dividing by the standard deviation. Following common practice, we adopt a standard scheme for data augmentation. The images are padded by 4 pixels on each side, filled with 0, resulting inimages, and then a crop is randomly sampled from each image or its horizontal flip.
ImageNet We also test the vadility of our models on ImageNet, which contains 1.2 million training images, 50,000 validation images, and 100,000 test images with 1000 classes. We adopt standard data augmentation for the training sets. A crop is randomly sampled from the images or horizontal flips. The images are normalized by mean values and standard deviations. We report the single-crop error rate on the validation set.
Training Details For fair comparison, we train our ResNet based models and DenseNet based models using training strategies adopted in the DenseNet paper weight decay. We adopt the weight initialization method in , and use Xavier initialization  for the fully connected layer. For CIFAR, we train 300 epoches in total with a batchsize of 64. The learning rate is set to be 0.1 initially, and divided by 10 at 50% and 75% of the training procedure. For ImageNet, we train 100 eoches and drop learning rate at epoch 30, 60, and 90. The batchsize is 256 among 4 GPUs.
|HB-Net (16) ()||10.17||38.52||5.46||26|
|HB-Net (16) ()||8.66||36.4||5.04||23.93|
AGD-Net (18) ()
AGD-Net (18) ()
7.2 Comparison with State of The Arts
The experimental results on CIFAR are shown in Table 4, where the first two blocks are for ResNet based models and the last two blocks are for DenseNet based models. For ResNet based models, we conduct experiments with parameter of 9 and 18, corresponding to a depth of 56 and 110. For DenseNet based models, we consider two cases where the growth rate is 12 and the depth equals to 40 and 52, respectively. We do not consider a larger DenseNet model (e.g. ) due to the memory constraint of our single GPU. We can see that our proposed AGD-Net and HB-Net have a better performance than their respective baseline. For DenseNet based models, when and , our model has an improvement on all datasets except for augmented CIFAR-100. When and , AGD-Net’s superiority is obvious on all datasets. Similar to DenseNet based models, the superiority of HB-Net over ResNet increases as the model capacity goes larger from to . Besides, as reported by the ResNet paper , when , the standard training strategy is difficult to converge and a warming-up is necessary for training. Our reimplementation of ResNet () in Table 4 indeed needs repeating the experiments to get a converged result. But for HB-net, we can use exactly the same training strategy adopted by other models and get a converged performance with training only once. Therefore, the training procedure of HB-Net is more stable than original ResNet when the network goes deeper.
As shown in Table 5, our proposed structures are also effective on the ImageNet dataset. Both HB-ResNet and AGD-Net has the better performance than their baselines. All the above experiments show that our design methodology is very promising.
8 Conclusion and Future Work
In this paper, we use the inspiration from optimization algorithms to design neural network structures. We propose the hyphothesis that a faster algorithm may inspire us to design a better neural network. We prove that the propagation in the standard feedforward network with the same linear transformation in different layers is equivalent to minimizing some functions using the gradient descent algorithm. Based on this observation, we replace the gradient descent algorithm with the faster heavy ball algorithm and Nesterov’s accelerated gradient algorithm to design better network structures, where ResNet and DenseNet are two special cases of our design.
Our methodology is still preliminary and not conclusive as many engineering tricks may also affect the performance of neural networks greatly. Nonetheless, our methodology can serve as the start point of network design. Practitioners can easily make changes to optimization algorithm inspired networks out of various insights and integrate various engineering tricks to produce even better results. Such a practice should be much easier than designing from scratch.
A. Krzhevsky, I. Sutshever, and G. Hinton.
ImageNet classification with deep convolutional neural networks.In NIPS, 2012.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  R. Girshick. Fast R-CNN. In ICCV, 2015.
-  S. Ren, R. Girshick K. He, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. PAMI, 39:1137–1149, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2015.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  Y. Yang, Z. Zhong, T. Shen, and Z. Lin. Convolutional neural networks with alternately updated clique. In CVPR, 2018.
J. Schaffer, D. Whitley, and L. Eshelman.
Combinations of genetic algorithms and neural networks: A survey of the state of the art.In International Workshop on Combinations of Genetic Algorithms and Neural Networks, 1992.
-  H. Lam, F. Leung, and P. Tam. Tuning of the structure and parameters of a neural network using an improved genetic algorithm. IEEE Trans. on Neural Networks, 14:79–88, 2003.
-  P. Verbancsics and J. Harguess. Generative neuroevolution for deep learning. In arxiv:1312.5355, 2013.
-  S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, 2016.
T. Domhan, J. Springenberg, and F. Hutter.
Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.In IJCAI, 2015.
-  J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In ICML, 2013.
-  T. Kwok and D. Yeung. Constructive algorithms for structure learning feedforward nerual networks for regression problems. IEEE Trans. on Neural Networks, 8(3):630–645, 1997.
-  L. Ma and K. Khorasani. A new strategy for adaptively constructing multiplayer feedforward neural networks. Neurocomputing, 51:361–385, 2003.
-  C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. AdaNet: Adaptive structure learning of artificial nerual networks. In ICML, 2017.
-  B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcemen learning. In arxiv:1611.02167, 2016.
-  B. Zoph and Q. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
-  Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu. Practical block-wise neural network architecture generation. In CVPR, 2018.
-  A. Beck and M. Teboulle. A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
-  K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, 2010.
-  B. Xin, Y. Wang, W. Gao, B. Wang, and D. Wipf. Maximal sparsity with deep networks? In NIPS, 2016.
-  K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok. Reconnet: Non-iterative reconstruction of images from compressively sensed mmeasuremets. In CVPR, 2016.
-  J. Zhang and B. Ghanem. ISTA-Net: Iterative shrinkage-thresholding algorithm inspired deep network for image compressie sensing. In arxiv:1706.01929, 2017.
-  Y. Yang, J. Sun, H. Li, and Z. Xu. Deep admm-net for compressive sensing mri. In NIPS, 2016.
-  D. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Ma, 1999.
-  B. Polyak. ome methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Soviet Mathematics Doklady, 27(2):372–376, 1983.
-  D. Gabay. Applications of the method of multipliers to variational inequalities. Studies in Mathematics and its applications, 15:299–331, 1983.
-  Z. Lin, R. Liu, , and Z. Su. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, 2011.
-  P. Putzky and M. Welling. Recurrent inference machines for solving inverse problems. In arXiv:1706.04008, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, and S. Yan. Deep learning with s-shaped rectified linear activation units. In AAAI, 2016.
-  L. Zhao, J. Wang, X. Li, Z. Tu, and W. Zeng. Deep convolutional neural networks with merge-and-run mappings. In arxiv:1611.07718, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
-  X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.