Deep leaning is nearly the most popular highlight of artificial intelligence nowadays and has made great successes in speech recognition2], playing game go , and so on. Despite its successful applications and history of nearly 40 years since Fukushima’s paper in 1982 , the underlying principle still remains unclear, so that deep learning is often referred to as “black box”, which greatly hinders its development.
One of the main concerns is why deep neural networks are more powerful than those with shallow layers. The answer to this question is the key of understanding deep learning, for which there are mainly two kinds of existing results. The first kind is specific, explaining particular class of functions realized by deep learning, such as[5, 6, 7, 8]. The second kind is more general by studying the expressive ability of deep layers compared with shallow ones, such as [9, 10, 11, 12, 13].
. Region and subregion dividing recursively with respect to layer depths is an interpretation of deep learning composed of ReLUs (rectified linear units)[14, 15]. The related contents of this paper differ from [9, 10, 11] in that: First, we rigorously realize the region and subregion dividing by giving a thorough deep learning structure, as well as present a new interpretation of the superiority of deep layers over shallow ones. Second, the region-transmitting property through layers is proved rigorously and is given in more formal descriptions due to its extreme importance.
Based on the discussions of region dividing, we’ll further prove that ReLU deep learning can classify multi-category data points. And then, turn to function approximation problems and demonstrate that ReLU deep learning can approximate an arbitrary continuous function on a closed set of -dimensional space. Finally, some conclusions of ReLU deep learning are generalized to the case of sigmoid-unit deep learning.
2 The mechanism of 3-layer networks
The discussion of 3-layer networks is the basis of comparisons between shallow and deep networks. And also, there exists 3-layer subnetworks in deep learning, in which the mechanism is the same as that of ordinary 3-layer networks.
We begin the discussion from a concrete example of two-category classification realized by a 3-layer network. It is well known that each ReLU corresponds to a hyperplane dividing the input space into two regions. In the case of two-dimensional input space, a hyperplane is reduced to a line.
Hereafter, unless otherwise stated, when referring to the classification by a hyperplane, the data points just being on hyperplanes are not taken into consideration. We’ll not distinguish between the term of region dividing and that of data classification in this paper. For simplicity, all the figures of neural networks ignore the biases, which actually exist, however.
Fig.1 (a) is a 3-layer network with three ReLUs in the hidden layer denoted by 1, 2 and 3, corresponding to lines of 1, 2, and 3 of Fig.1 (b), respectively. We denote the two different sides of a hyperplane by “-”, where is the index of the hyperplane and expresses the output of the ReLU with respect to this hyperplane. -+ represents one side of hyperplane , where the ReLU output is greater than 0; and the other side is denoted by -0, where the ReLU output is zero. For instance, in Fig.1(b), 1-+ is the side above line 1 because the data in that half plane gives positive ReLU output, while 1-0 represents the below side producing zero outputs. The objective of the 3-layer network of Fig.1 (a) is to classify the data points of Fig.1 (b) into two categories: the output of the third layer should be 0 or 1 when the input sample belongs to “o” or “” category, respectively. Output 1 can be obtained by normalizing the nonzero output of the ReLU.
In Fig.1 (b), we can add lines of 1, 2, and 3 one by one for classification. First, Line 1 is added, when the “o” samples below line 1 are correctly classified. The samples above line 1 should be further classified by more lines, such as line 2 and line 3. However, for example, when line 3 is added, the “o” samples below line 1 is simultaneously in the side of 3-+, producing nonzero outputs; that is to say, the subdividing of the half plane above line 1 by line 3 makes the ever correct classification result below line 1 change to be wrong, for which we may need to add other lines to eliminate the influence of line 3.
In fact, the final output expression of 3-layer networks (with a single output) is
where is the th ReLU output of the hidden layer. If and , the th ReLU can influence the whole sum by its nonzero output; in geometry language, it means that the th hyperplane for region dividing will influence half of the input space where this ReLU output is nonzero. The influenced region may include ever correctly divided regions and the right results may be reversed. If the influence cannot be eliminated by adjusting present hyperplanes, new hyperplanes should be added. This procedure may occur recursively; hence the number of hyperplanes needed in 3-layer networks may be extremely larger than that it really needs, when we just want to divide the input space into separated regions without considering mutual influences. This is the general explanation of Fig.1, from which a conclusion follows:
In 3-layer networks, any new added ReLU of the hidden layer will influence half of the input space where the output of this ReLU is nonzero.
We shall show that the interference of hyperplanes to each other can be avoided in deep learning.
3 The transmitting of input-space regions through layers
In deep learning, the input space is only directly connected to the first hidden layer; how a region of the input space passes to subsequent layers is a key foundation of subregion dividing via a sequence of layers.
Pascanu et al.  used “intermediate layer” to transmit an input-space region, which is actually by means of affine transforms; however, no general rigorous conclusions with proofs were presented. Although trivial in mathematics, due to great importance, we’ll give detailed descriptions rigorously both in the conclusions and proofs about this problem, as well as add some necessary prerequisites for the establishing of the results.
Suppose that the input space is n-dimensional. The nonzero outputs of n ReLUs in the first hidden layer form a new space . If the weight matrix of size between the input layer and the first hidden layer is nonsingular, then is n-dimensional and is an affine transform of a region of . The intersection of the nonzero-output areas of ReLUs in is the region to be transformed.
We know that the nonzero output of a ReLU is for . So an
-nonzero output vectorof can be written as
where is a vector of a certain region of and
is the bias vector of theReLUs. (2) only combines the outputs of ReLUs to the matrix form. Obviously, (2) is an affine transform and if is nonsingular, the dimension of would be . ∎
The geometric meaning of Lemma 1: Nonsingular
The geometric meaning of Lemma 1: Nonsingularof (2) implies non-parallel hyperplanes. Lemma 1 is equivalent to say that if the hyperplanes with respect to ReLUs are not parallel to each other, the space would be -dimensional as well as an affine transform of a region of the input space.
In deep learning with -dimensional input, if each succeeding layer has ReLUs with nonsingular weight matrix, a certain region of the input space can be transmitted to subsequent layers one by one in the sense of affine transforms.
The first hidden layer of Lemma 1 again can be considered as a new input space. By doing this recursively, a certain region of the initial input space can be transmitted to succeeding layers one by one in the sense of affine transforms, as long as this region is always in the nonzero parts of all the ReLUs in each layer. ∎
4 Region dividing without mutual interference
Section 2 has mentioned the mechanism of 3-layer networks that adding a new ReLU in hidden layer would influence half of the input space. Based on the results of Section 3, we now show that this disadvantage can be avoided in deep learning.
4.1 The two-dimensional case
Also begin with an example. Fig.2 is corresponding to Fig.1 of Section 2. In Fig.1, to subdivide the region above 1-+, line 3 is added in the hidden layer; however, this operation influences the ever correctly classified results. In deep learning, the mutual interference among lines can be avoided by adding a new layer to restrict the influencing area of a line.
As shown in Fig.2 (b), first, line 1 is selected to divide the data points into two separate parts in different regions; then, we can always find line 2 having the same classification effect as line 1. In fact, when line 2 in Fig.2 (b) rotates counterclockwise towards line 1, the region between 1-+ and 2-+ (or between 1-0 and 2-0) can be as large as possible, such that all the data points above line 1 (or below line 1) are encompassed by 1-+ and 2-+ (or by 1-0 and 1-0); this is the way of finding line 2. Thus, all the data points are either in the region between 1-+ and 2-+ (denoted by region-+) or in the region between 1-0 and 2-0 (denoted by region-0).
Since line 1 and line 2 are not parallel to each other, by the remark of Lemma 1, the output space of ReLU 1 and ReLU 2, i.e., the space of the first hidden layer, is two-dimensional as well as an affine transform of region-+; while region-0 is excluded from this layer in terms of zero outputs. Affine transforms do not affect the linear classification property of the data; so the linear classification in region-+ can be done in the space of the first hidden layer, without influencing region-0 because it has been excluded.
Now, instead of adding ReLU 3 in the same layer as ReLU 1 and ReLU 2 in Fig.1 (a), we add it in a new layer called the second hidden layer as shown in Fig.2 (a) to perform the classification of the first hidden layer. Correspondingly, in Fig.2 (b), line 3 should be added in the space of the first hidden layer, which is an affine transform of region-+ of the input space; however, this illustration is reasonable because the effect of linear classification is equivalent.
Obviously, the principle and operation underlying this example are general in two-dimensional space. In what follows we shall directly generalize it to the -dimensional case.
4.2 The -dimensional case
For a 3-layer network with -dimensional input, the hidden layer can be designed to realize an arbitrary linearly separable classification of two categories. One of the category will be excluded by the hidden layer, while the other one changes into its affine transform. Adding a new hidden layer can divide a selected region of the input space in the sense of affine transforms without influencing an excluded region.
When the input space is -dimensional, we need hyperplanes (ReLUs) to construct an -dimensional space of the hidden layer, each of which realizes a same two-category classification. The function of those hyperplanes to be constructed is similar to that of line 1 and line 2 in Fig.2 (b).
First choose hyperplane 1 to divide the input space into two regions, containing the data points of category-0 and category-+, respectively; category-0 should be excluded, while category-+ may need to be subdivided. Then hyperplane 2 with the same classification effect as hyperplane 1 can be found by the similar method of the two-dimensional case. When hyperplane 2 rotates towards hyperplane 1 (counterclockwise or clockwise according to their relative positions), there exists infinite number of hyperplanes between them, all of which can classify the data in the same effect; choose of them as the left hyperplanes to construct an -dimensional coordinate system. Since the selected hyperplanes are not parallel to each other, by the remark of Lemma 1, the nonzero outputs of ReLUs with respect to those hyperplanes form an -dimensional linear space, which is an affine transform of a region of the input space; while the region giving zero outputs of the ReLUs will be excluded.
The constructed hidden layer has successfully excluded a region containing category-0 (region-0), as well as transmitted a region containing category-+ (region-+). If adding a new hidden layer, we can subdivide region-+ of the input space in the sense of affine transforms without influencing region-0. ∎
The purpose of selecting non-parallel hyperplanes (ReLUs) is to construct an -dimensional space to maintain the complete data structure of the -dimensional input space in the sense of affine transforms. If the number of non-parallel hyperplanes is less than , the outputs will be the subspace of the input space, which may lose information.
Denote an arbitrary 2-layer subnetwork of a deep learning by - with -dimensional input, representing the previous layer and current layer, respectively; is the weight matrix between layer and layer as in (2). Then we have:
In deep learning, if current layer has ReLUs with nonsingular weight matrix , adding ReLUs in a new layer after layer can divide a certain region of previous layer in the sense of affine transforms without influencing an excluded region. Similarly, adding new layers one by one can realize subregion dividing recursively; in each layer, data points that do not need to be subdivided can be put into the excluded region, so that the region dividing of succeeding layers will have no impact on them.
The first part of the theorem is similar to Lemma 2. As long as is nonsingular, even if the ReLUs of layer are not specially designed, the region-transmitting property still holds. The left proof is the recursive application of the first part. ∎
Theorem 3 indicates the advantages of deep layers for a type of deep learning structure. To classify complex data points, the deeper the network, the finer the subdividing will be. Once the step of adding new layers stops, the last three layers will perform the classification via the mechanism of 3-layer networks in a ultimate subregion.
5 The classification ability of deep learning
On the basis of above discussions, the classification ability of deep learning can be derived.
For a finite number of data points composed of two categories in -dimensional space, deep learning can classify them as a decision tree.
The proof is constructive by Lemma 2 and the theory of decision trees. First, we can always construct a decision tree to realize this two-category classification, whose decision functions are linear classifiers. Second, there exists a deep learning structure equivalent to that decision tree, which is given by the following method.
As shown in Fig.3 (a), it’s a four-level decision tree classifying two-dimensional data points and Fig.3 (b) is its corresponding deep learning structure. The layer of the deep learning corresponds to the level of the decision tree except that the deep learning adds an output layer with one ReLU. The root node 1 has two ReLUs in the first layer because the input space is two-dimensional.
In each layer, for the node having two child nodes, construct ReLUs in the next layer: The of them (left child) separate the data points into region-+ and region-0, which are designed according to the decision function of this node by the method of Lemma 2; data points in region-+ can be subclassified by succeeding layers of child nodes without influencing region-0 excluded. The other ReLUs (right child) are different from the first group of ReLUs only in the parameter signs, respectively; they reverse the ReLU outputs of data points in region-+ and region-0, which instead makes region-0 to be subdivided. For example, in Fig.3, node 1 has two child nodes, so that four ReLUs are needed in the next layer; two of them are for left child 2, while the other two are for right child 3. In the second layer, the weights and biases of ReLU 3’s are opposite in the signs to those of ReLU 2’s as well as with same absolute values, respectively.
For the leaf node, if the next layer is the last one, just connect its related ReLUs to the output ReLU, as node 6 and node 7 of Fig.3. Otherwise, we should add one ReLU in each succeeding layer (except for the last one) to transmit the classification result to the last layer, such as node 2 and node 5 in Fig.3; make sure that the weights and bias of the single ReLU of a leaf node in each layer maintain the nonzero output.
The weights and bias of the output-layer ReLU should be designed to distinguish between a left leaf node and a right leaf node. For instance, let the left leaf node and right leaf node of Fig.3 (a) correspond to zero output and nonzero output of deep learning of Fig.3 (b), respectively. The design is easy because in the layer previous the last one, when the output of a leaf node is nonzero, those of other leaf nodes will be mutually exclusive to be zero, due to the properties of decision trees. For example, in Fig.3 (b), when the output of ReLU 2 in the fourth layer (previous the last one) is nonzero, all the outputs of other ReLUs of this layer will be zero. So just consider that only ReLU 2 exists in that layer, by which the weight between ReLU 2 and the output-layer ReLU can be designed without influencing other leaf nodes. The bias of the output-layer ReLU can be set to zero because the weight itself is enough to produce the desired output. Since ReLU 2 is corresponding to a left leaf node, obviously, when the bias is zero, if the weight is set to a value less than or equal to zero, the design will meet the need. The general case is similar. This completes the constructing process. ∎
Deep learning can classify arbitrary multi-category finite number of data points.
The proof is to reduce the multi-category classification to the two-category case of Lemma 3 . Fig.4 is an example of three-category classification, in which each dotted rectangle classifies one of the three categories using the two-category method of Lemma 3. No matter how many categories should be classified, employ the two-category method to deal with each category separately and combine them into a whole deep learning structure. ∎
Bengio et al.  stated that decision trees are not easily generalized to variations of the training data, while forests do not have this limitation. By Theorem 4, deep learning can realize the function of forests and its generalization ability can be assured.
6 The function approximation of deep learning
Now turn to the function approximation problem of deep learning by establishing the relationship between region dividing and piecewise-constant functions.
There exists general results about the function approximation ability of 3-layer sigmoid-unit networks, such as Hecht-Nielsen , Cybenko , and Hornik et al.. Among them, Hecht-Nielsen’s proof is constructive. Until now, deep learning has no such similar conclusions; here we’ll deal with this problem.
Any piecewise-constant function of Haar wavelets with finite number of building-block domains can be approximated by deep learning with arbitrary precision.
The proof is based on Lemma 2 and Theorem 3. First prove the two-dimensional case. For a Haar wavelet represented function defined on a closed set with finite building-block domains, we can always divide its domains into rectangles (or squares, similar hereafter) with different sizes and locations, each having a constant value (maybe the same with some other rectangles) of the function. The basic idea is to approximate the function by deep learning in each rectangle as precisely as possible. Because the number of rectangles is finite, if the approximation error for each rectangle is arbitrarily small, then the deep learning approximation to the whole function will be arbitrarily precise. So we just need to prove the case of one rectangle.
First, for an isolated rectangle, such as in Fig.5 (a), it can be separated via deep learning. For each side of , such as the bottom one, we can always find two lines (ReLUs) to divide some rectangle domains of into two parts in two different regions, with one of the two lines parallel to the bottom side (such as line 1). is in the region where the outputs of the two ReLUs are both nonzero; all the rectangles below line 1 should be excluded by line 1 and line 2, and are in the other region (zero-output region). We see that the region between 1-+ and 2-0 or between 1-0 and 2-+ also gives nonzero output, which needs to be specially processed later different from the classification of discrete data points in Theorem 4.
After doing similar operations to the other sides, except for , the rectangle domains of are all excluded. However, the intersection of nonzero-output regions of the four separations is not , but the region of the plane excluding four zero-output regions, which is a concave polygon (denoted by ) formed by eight lines such as in Fig.5 (a).
Note that only the separation of the bottom side of handled first is done in the input space of deep learning; the separations of three other sides should be done in the spaces of three succeeding layers, respectively, as shown in Fig.5 (c). However, the above operations are reasonable because of the properties of affine transforms. For example, if the second hidden layer of Fig.5 (c) corresponds to the separation of the left side of , as long as we can find two lines for this side in the input space such as in Fig.5 (a), the corresponding two lines in the space of the first hidden layer can also be found, with the parallel and collinear properties invariant. By the architecture of Fig.5 (c), the effects of four separations can be combined and finally only the data points in polygon can give nonzero outputs. The left proof will not remind this related issue again.
In polygon , let the output of deep learning be the value of the approximated function in rectangle . We now show that the limit of a sequence of can be by adjusting the parameters of eight lines. Denote an outer rectangle formed by four of the eight lines parallel to the four respective sides of (such as line 1) by . For the separation of the bottom side of , when line 2 rotates clockwise towards line 1 parallel to the bottom side, the limit of line 2 is line 1; during the rotating process, the classification result of rectangle domains of remains unchanged, while the region between 1-+ and 2-0 or between 1-0 and 2-+ becomes smaller and smaller. If we do the similar rotating operations to the cases of three other sides, the concave polygon can approximate by any desired precision.
When the outer rectangle shrinks to , the polygon constructed by deep learning can also approximate with arbitrary precision; therefore, deep learning can approximate in as precisely as possible.
Now discuss the case of adjacent rectangles. We call two rectangles adjacent when their two respective sides are on a same line. In Fig.5 (b), has different constant values in the big rectangle and small rectangle . As can be seen, the bottom side of shares a same line with the top side of , so that they are adjacent. is to be separated and may have more than one adjacent rectangles; however, we only illustrate one of them, which is enough for the description of the proof.
As the case of an isolated rectangle, a concave polygon encompassing can be constructed by deep learning. In polygon , the output of deep learning is normalized to the value of function in . As shown in Fig.5 (b), part of is separated into , where the output of deep learning is not equal to the actual function value in . This type of approximation error occurs in all the adjacent rectangles separated into polygon , where the function value is different from that of . So the region of outside (denoted by ) is the source of approximation error of deep learning. Define the approximation error in as
where is the approximating function of deep learning and is a subset of region on which is defined.
where is the domain of and is the maximum variation of , which always exists because only has finite number of function values. Since the value of is also derived from , it’s obvious that
where is the area of region . Because the area of can be arbitrarily close to that of , tends to be zero as ; thus, can be as small as possible.
Fig.5 (c) is the structure of deep learning constructed for Fig.5 (a) or Fig.5 (b). The first hidden layer is corresponding to the region dividing by line 1 and line 2 with respect to the bottom side of a rectangle; and the succeeding three layers are the cases of three other sides. The four times of region dividing must be done in different layers successively to ensure that their effects can be combined. The final output should be normalized to the function value.
The whole structure of deep learning approximating can be obtained by combining the subnetworks of all rectangle domains just like Fig.4, each module of a dotted rectangle representing a certain rectangle domain of . This completes the proof of the two-dimensional case.
Similarly, the -dimensional case can be proved. Change the dimension of lines and rectangles, and use hidden layers instead of four in Fig.5 (c), with each layer having ReLUs. The rotating operations can refer to the proof of Lemma 2. To each side of a hyperrectangle, hyperplanes for separation are constructed by the method of Lemma 2, with hyperplane 1 parallel to the side. Hyperplane 2 is second added and other hyperplanes are chosen between hyperplane 1 and hyperplane 2. So we just need to rotate hyperplane 2 as in the two-dimensional case, and then to insert other new hyperplanes between hyperplane 1 and the rotated hyperplane 2. The left proof is trivial when according to the two-dimensional case. ∎
Deep learning can approximate any continuous function on a closed set of -dimensional space with arbitrary precision.
We know that Haar wavelets are capable of approximating continuous functions, while deep learning can approximate Haar wavelets as demonstrated in Lemma 4. This completes the proof. ∎
Lippmann  ever gave a little similar proof about the classification ability of 3-layer networks composed of threshold logic units (TLUs). Although he didn’t mention the function approximation problem, his region dividing by neural networks can accurately represent a Haar wavelet function. However, he only discussed the case of 3-layer networks with TLUs.
7 Several conclusions of sigmoid-unit deep learning
, although its training is relatively more difficult due to the saturation property of sigmoid function in two directions. In this section, we’ll give several conclusions of the sigmoid-unit deep learning on the basis of the ReLU case.
All the conclusions of this paper about ReLU deep learning still hold in the case of a modified ReLU, which is
where and are real with .
(6) only changes the slope of the linear part and the position in axis of a ReLU; however, as long as a neural unit has zero and linear outputs separated by a threshold, all the proofs related to the ReLU are applicable to the modified case of (6). ∎
In sigmoid-unit deep learning, a certain region of input space can be approximately transmitted to hidden layers by any desired precision in the sense of affine transforms.
The derivative of sigmoid function is , tending to 1/4 when ; that is to say, is approximately a line of as precisely as possible when is close enough to zero. Thus, certain segment of sigmoid function can be approximately considered as a line. Combining with Theorem 2, this corollary holds. ∎
In the classic paper of artificial neural network , Hopfield also referred to the “linear central region” of at and used this approximately linear property to transmit information between nonlinear neurons. The thought is similar; however, the details are different from the background of applications.
and used this approximately linear property to transmit information between nonlinear neurons. The thought is similar; however, the details are different from the background of applications.
Sigmoid-unit deep learning can exclude a certain region of the input space or a hidden layer space with arbitrary precision, so that region dividing in some other regions can not influence it.
The sigmoid function tends to zero as
, approximately corresponding to the zero-output part of a ReLU. Selecting probable parameters of sigmoid units can exclude a certain region as the case of ReLUs with arbitrary precision. ∎
The above three corollaries suggest that sigmoid-unit deep learning can realize the function of ReLU deep leaning to some extent.
The “black-box” problem of deep learning is important both in theory and engineering, which puzzles many people using it or having interests in it. We hope that this paper will be helpful to this theme.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE 86.11, 2278-2324 (1998).
-  D.Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, D. Hassabis, Mastering the game of go without human knowledge. Nature 550.7676, 354 (2017).
K. Fukushima, S. Miyake, Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition 15.6, 455-469 (1982).
-  O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks. Advances in Neural Information Processing Systems. 666-674 (2011).
-  R. Eldan, O. Shamir, The power of depth for feedforward neural networks. Conference on learning theory, 907-940 (2016).
-  B. McCane, L. Szymanskic, Deep networks are efficient for circular manifolds. 23rd International Conference on Pattern Recognition (ICPR). IEEE, 3464-3469 (2016).
-  D. Rolnick, M. Tegmark, The power of deeper networks for expressing natural functions. arXiv preprint arXiv:1705.05502v2 (2018).
-  R. Pascanu, G. Montúfar, Y. Bengio, On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098v5 (2014).
M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, J. S. Dickstein, On the expressive power of deep neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2847-2854 (2017).
-  G. Montúfar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks. Advances in neural information processing systems, 2924-2932 (2014).
-  M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, J. Sohl-Dickstein, Survey of expressivity in deep neural networks. arXiv preprint arXiv:1611.08083v1 (2016).
-  H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning work so well?. Journal of Statistical Physics, 168(6): 1223-1247 (2017).
V. Nair, G. Hinton, Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th international conference on machine learning (ICML-10), 807-814 (2010).
-  X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks. Proceedings of the fourteenth international conference on artificial intelligence and statistics, 315-323 (2011).
-  Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436 C444 (2015).
-  R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification, 2nd ed, John Wiley & Sons, 2001.
-  Y. Bengio, O. Delalleau, C. Simard, Decision trees do not generalize to new variations. Computational Intelligence, 26(4): 449-467 (2010).
R. Hecht-Nielsen, Theory of the backpropagation neural network. Neural networks for perception, Academic Press, 65-93 (1992).
-  G. Cybenko, Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4): 303-314 (1989).
-  K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators. Neural networks, 2(5): 359-366 (1989).
-  R. P. Lippmann, An introduction to computing with neural nets. IEEE Assp magazine, 4(2): 4-22 (1987).
-  J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8): 2554-2558 (1982).
To memorialize my respected teacher, Prof. Guanggui Bao, who ever gave me great encouragement, instruction, and care in my research and life.