Artificial neural networks perform very well on classification problems. They are known to be able to linearly separate almost all input sets efficiently. However, it is not generally known how the artificial neural networks actually obtain this separation so efficiently. Therefore, it is difficult to choose a suitable network to separate a particular dataset. Hence, it would be useful if, given a dataset used for training and a chosen activation function, one can analytically derive how many layers and nodes are necessary and sufficient for achieving linear separability on the training set. Some steps in this direction have already been taken.
In an2015 it has been shown for rectified linear activation functions that the number of hidden layers sufficient for linearly separating any number of (finite) datasets is two (follows from universality as well) and that the number of nodes per layer can be determined using disjoint convex hull decompositions. yuan2003
have provided estimates for the number of nodes per layer in a two-layer network based on information-entropy.fujita1998 has done the same based on statistics by adding extra nodes one by one. Another approach by kurkova1997 is to calculate how well a function can be approximated using a fixed number of nodes. Recently, a paper (ShwartzZiv2017) has appeared that uses the information plane and information bottleneck to understand the inner workings of neural networks. baum1988 has shown that a single-layer network can approximate a random dichotomy with only units for an arbitrary set of points in general position in dimensions. He also makes the link to the Vapnik-Chervonenkis dimension of the network. In this work we do not use statistics to achieve an estimate of the number of nodes but rather simple algebra to obtain an absolute upper bound, in the spirit of an2015 and baum1988. In contrast to an2015, we will obtain this bound for multiple activation functions and in contrast to baum1988 the bound will hold for arbitrary finite sets.
It is well-known that neural networks with one hidden layer
are universal approximators (e.g. hornik1989; arteaga2013 or more recently sonoda2017). However, even though we know there should exist a network that can linearly separate two arbitrary finite sets, we do not know which one it is. Choosing the wrong kind of network can lead to severe overfitting and reduced performance on the test set yuan2003. Therefore, it is useful to have an upper bound on the number of nodes. The upper bound can aid in choosing an appropriate network for a task. With this in mind, we aim to give a theoretical upper bound on the size of a network with two hidden layers in terms of nodes, that is easily computable for any finite input sets that need to be separated.
The rest of this work is organized as follows: In Section 2 we repeat some of the definitions from an2015 and we give a direct extension of two of their theorems for which their proof does not need to be changed. In Section 3 we present our main theorem, which generalizes the two theorems from Section 2 to a larger class of activation functions. In Section 4 we add some corollaries and refer to an extension to multiple sets that is given in an2015, we also provide an algorithm to estimate the upper bound on the number of nodes. We show simulation results that support our claims in Section 5 and conclude with some final remarks in Section 6.
2 Achieving linear separability
We want to emphasize that the following definitions and theorems (Definition 1, Theorems 3 and 5 and Corollary 12) are due to an2015 and are repeated here for convenience. We took the liberty of adapting some of these definitions for clarity and giving slightly stronger versions of their Theorems 4 and 5 in Theorems 3 and 5 which follow directly from the proof given by an2015.
Throughout the article, we will use the following notation and conventions: all sets of data points are finite. We use to denote a non-constant activation function that is always applied element-wise to its argument. So
We define the convex hull of a set as the set of all convex combinations of the points in the set. In set notation:
We will now first define what is meant by a disjoint convex hull decomposition. is the set of real numbers.
Let , , be disjoint, finite sets in . A decomposition of , with is called a disjoint convex hull decomposition if the unions of the convex hulls of ,
are still disjoint. I.e. for all : For an illustration see Figure 1B.
Since we are interested in finite sets, we can always define a disjoint convex hull decomposition (just take every point as a singleton, giving ). Such a decomposition is not unique. In practice we find decompositions with smaller ’s using the algorithm in Section 5. The following definition concerns two sets, but can easily be extended to multiple sets by applying it pairwise.
If , and are called linearly separable. If or , and are called convexly separable. If all disjoint convex hull decompositions of and satisfy , and are called convexly inseparable.
We start by giving a generalization of Theorem 4 from an2015
. Instead of considering a rectified linear classifier activation function, we consider the more general class of functions that satisfyfor and for . We will call these functions semi-positive. Notice that they can be any function of as long as they remain positive. This generalization is straightforward and the proofs do not need to be adapted but are given here for easy reference.
Let and be two convexly separable sets, with a finite number of points in . Say, and with such that for each . Let be linear classifiers of and such that for all
Let , and . Here is a semi-positive function that is applied component-wise. Then and are linearly separable. For this we need affine transformations.
For all we have that . So Now, for an , there exists a such that . So, there exists a such that . Therefore, each has components greater or equal to zero and at least one component that is strictly greater than zero. This means . We used transformations to create and . ∎
The initial sets that the network needs to separate are denoted by , see Figure 1A. After applying a linear classifier to the initial sets, these will be denoted by such that after applying the transformation on all , we get , see Figure 1C. When we apply the activation function to elements in , we denote the resulting set by , shown in Figures 1D (the constant should be taken for now). This means that a neural network with a single hidden layer with nodes, can transform into . The following theorem is a generalization of Theorem 5 from an2015. Again, this is straightforward and does not require any changes to the proof. The theorem will make use of the following lemma.
Two finite sets are linearly separable if and only if there exists a one-dimensional projection that maps the sets to linearly separable sets.
Suppose we have two sets that are linearly separable. Let
be the hyperplane that separates the data. Project the data on the axis that is orthogonal to the hyperplane. By this,will be collapsed into a point that lies at the threshold between the two separated sets. If we have a one-dimensional projection of the two sets, and a threshold , let be the hyperplane orthogonal to the projection axis containing . Then the sets will be linearly separated by . ∎
Let and be finite and convexly inseparable. Let be linear classifiers of and such that for all
Let and . Let , and for . Also, let . Here is again semi-positive. Then and are disjoint, so and are convexly separable. For this we need nodes.
Define and . Notice that these sets are projections of and . Apply Theorem 3 on , and their images and under the transformation . Then we have
With Lemma 4, we then also have that
Since , we have . Therefore and are convexly separable. We needed linear transformations to separate a single part of from all parts of . So in total we need transformations to create and . ∎
3 A general upper bound
For a given and a fixed , let be increasing with a left asymptote to zero and . Then such that and .
Choose and such that and . Let . Then . Let , then . ∎
Lemma 6 puts a constraint on the speed with which the function increases (near ). We need . So if we move by , the value of the function will be multiplied by . Notice that this is a very rapidly growing function. If a function does not satisfy this constraint, we find there is a minimum distance needed between the and . Lemma 6 also implies that the function should have a left asymptote to zero, however, we can shift an activation function with a different left asymptote such that this holds and then shift it back later using Corollary 13. This way, the lemma, and therefore Theorems 7 and 8, holds for all commonly used activation functions. We will compute the distance for the sigmoid, hyperbolic tangent, rectified linear function and leaky rectified linear function in Corollary 16.
We will from now on define where
is the smallest distance between the convex hulls of two sets.
Let and be two convexly separable sets, with a finite number of points in . So and with such that for each . Let be increasing with a left asymptote to zero and define as in Equation 1, such that
For satisfying this inequality, let be linear classifiers of and such that for all
Let , and . Then and are linearly separable.
Choose, using Lemma 6, an and such that for all we have and for all we have . For all we have . So . Therefore, is contained in a positive hypercube . For all there is a such that . For we have that . So for at least one coordinate, and all other coordinates are larger than . Therefore, , see Figure 1D. The convex hulls of these two sets can be separated by the hyperplane . This is because the convex hull of is contained in the hypercube with edges and the convex hull of is bounded by the separating hyperplane. ∎
Let and be finite and convexly inseparable. Let be increasing with a left asymptote to zero, define as in Equation 1 such that . For satisfying this inequality, let be linear classifiers of and such that for all
Let and . Let , and for . Also, let . Then and are convexly separable.
So we see that both Theorems 3 and 5 can be generalized to increasing functions with a left asymptote to zero. We still need two layers with and nodes respectively. However, we also need a minimal separation between the convex hulls of the two sets (in Euclidean distance) after applying the first linear transform. We formalize this in the following theorem:
Given disjoint finite sets and with a disjoint convex hull decomposition with and sets in the partitions, and given an increasing activation function with a left asymptote to zero, we can linearly separate and using an artificial neural network with an input layer, a layer with hidden nodes, a layer with hidden nodes and an output layer.
We can assume and are convexly inseparable and have linear classifiers as in Theorem 5. The corresponding is always greater than zero, and scales with . Since we can scale such that . Then apply Theorem 8. We need affine transformations for separating the parts of and the parts of . Then is applied to all transformations. A neural network can do this by learning the weights and biases of the affine transformations and then applying . Now we have pairs of convexly separable sets, which can be linearly separated using Theorem 7. For each we need to find an affine plane that separates from . This means we have to learn affine transformations before applying , which can be done by a neural network with nodes. Now we have two linearly separable sets, which can be separated by using a linear classifier as the output layer, which proves the theorem. ∎
Note that this proof implies that we can separate and independent of the distance between and , so independent of . The learning algorithm should be able to scale the weights and biases such that the sets can be separated no matter how small was originally.
We can also prove a similar theorem for the leaky rectified linear activation function, which does not have a left asymptote to zero. However, we need to prove Lemma 10 first. The diameter of a set is defined as .
Suppose , as in Equation 1, and for and for , where . Then is reached at .
We have four cases:
, then :
, then :
for increasing to zero.
, then :
which is an increasing function on the interval . Therefore the infimum will be at .
, then :
Since cases (a) and (d) are equal, and since we see that the infimum is assumed at the value .
Now we are ready to prove the following theorem for leaky rectified linear functions. Because of Lemma 10 we can assume .
Let . Then we have . And we have . For all we know . Therefore for all we have and for all we have that and there exists a such that . See Figure 1E. Therefore the convex hulls of and are disjoint. ∎
We will not prove a version of Theorem 5 for the leaky rectified linear activation function because this is straightforward and the proof is the same as the proof of Theorem 5. But we can conclude that for leaky rectified linear activation functions a network that consists of two layers and and nodes respectively, can achieve linear separability. If is not large enough, there are two options now: the network could learn to scale the weights and biases appropriately, or function could be adjusted manually by increasing the fraction . In the next section we will explore some consequences of these results. We will also provide a way to calculate and .
4 Corollaries and a practical algorithm
We can generalize the results from Section 3 to any number of sets (Corollary 12), by using the similar result in Sections 3.4 and 3.5 from an2015 as a foundation. After stating this result we will show that we can apply any translation to the function in the above theorems while retaining their validity (Corollary 13). Then we will provide a cheap way to estimate and in the disjoint convex hull decomposition (Algorithm 1) and we will calculate , for the most commonly used activation functions (Corollary 16).
We can generalize the theorems still a little more by showing that they also hold for translated versions of the activation function that satisfies the constraints.
Define . Then and . Therefore the theorem holds for . Adding a constant to the linear separable sets and does not affect their separability. So the theorem holds for .
Let be a translated version of . If we apply we see that we could just subtract from to get back to the original theorem. Therefore left and right translation of functions is allowed. ∎
We need a way to estimate and for arbitrary datasets. Since it is difficult to decompose the sets in a high dimensional space, we found a way to do it in a low dimensional space. This will allow for a rough upper bound on and but does not guarantee that the smallest disjoint convex hull decomposition can be found.
If we have a disjoint convex hull decomposition of a projection of our dataset, this partition will also form a disjoint convex hull decomposition of the original dataset.
Suppose and are -dimensional projections of and . Assume we have disjoint convex hull decompositions and such that . Now take such that and such that . Then we see that . Therefore . So we can conclude that . So also and are a disjoint convex hull decomposition. ∎
We can estimate the number of sets in the convex hull decomposition using Lemma 14 as follows: Take a random projection of the datasets, preferably a one-dimensional projection. Then find the disjoint convex hull decomposition of this projection alone. This is easy in one dimension as it can be done by counting how often one switches from one set to the other when traversing through the projection. This number is an upper bound for and , but a very coarse one, as we are using a random projection. So it is necessary to repeat this for many more random projections and minimize for and . We use this procedure in Algorithm 1 to find a reasonable estimate for and . When taking the difference of the means of the two sets instead of the random projection, we may find a large portion of either sets at the extreme ends, as in Figure 2. We utilize this in our algorithm. We can prove that this algorithm will actually give a disjoint convex hull decomposition.
Algorithm 1 gives a disjoint convex hull decomposition of the input sets and has complexity of order where is the size of the input sets.
Define and as the parts of and that are outside and . Call the overlap . Notice that and have disjoint convex hulls because their projections have disjoint convex hulls, with Lemma 4. Within the overlap we can again compute new and and we call the parts of and that are in and outside and , and . We call the overlap . Continue in this way to obtain for . Then and . For all we have that and . Therefore, and . Equivalently for , and . So then have disjoint convex hulls. When the while-loop terminates, we may still have a non-empty overlap. So suppose . Then using a random projection, we find a disjoint convex hull decomposition of . The convex hulls of these sets will all be contained within the convex hull of and therefore disjoint from the convex hulls of all previously found sets. Therefore, the algorithm gives a disjoint convex hull decomposition of the input sets.
The algorithm has complexity . The worst-case scenario for the while-loop contributes a factor and computing the inner-products also contributes a factor . It is fair to mention the algorithm also depends on the dimension of the data and on the number of random projections that is used. Both can be quite large. The number of random projections needs to be significantly larger than the dimension to get good results. Also notice that adding count to and in lines 20 and 21 is naïve and can easily be improved. ∎
For a dataset with convex hull decomposition in sets, recall that the minimal distance between the and is , see Equation 1. For the sigmoid the minimal needed for separation equals . For a shifted hyperbolic tangent the minimal equals . For the ReLU the minimal
. For the ReLU the minimalequals . For the leaky rectified linear activation function . Or equivalently, , then we are able to separate any two sets with a leaky rectified linear activation function.
Note that proving that the limit becomes smaller than implies that the infimum also becomes smaller than . For practical purposes we will use the limit in this proof.
The sigmoid function is written as. If we calculate
and then take the limit we see that equation 2 goes to . To get this smaller than we need .
Hyperbolic tangent. We start by writing a shifted hyperbolic tangent out in terms of exponentials. If we calculate
Rectified linear function. We did not need any in the proof for the rectified linear function, so the minimal equals zero.
Leaky rectified linear activation function. With Lemma 10 we get:
where denotes the leaky rectified linear function. To get this smaller than we need . ∎
5 Experimental validation
We could validate the theory by showing that in fact a network of the estimated size can perfectly learn to classify the two training sets. For this we will need a proper estimate of and , but it is difficult to get a tight approximation. We also need a perfect training framework, which of course does not exist. So working with the tools we have, we show an estimate for and
provided by the algorithm. It is a good estimate, but can definitely be improved. We train the network using stochastic gradient descent for a long number of epochs. The loss does converge to a number close to zero, but does not become zero. This we believe is caused by imperfect training.
We tested the ideas in Sections 3 and 4 empirically. We trained several networks with different sizes and activation functions on the first two classes (number classes 0 and 1) of the MNIST dataset (MNIST). We calculated the minimal distance between these two sets and found . This is a sufficient distance for any of the activation functions we used, which means the network is able to use weights close to one. Next we estimated and . For this dataset with more than 12000 data points, we found and . That would mean that a network with nodes in the first and nodes in the second layer would be sufficient to linearly separate the data in the two sets.
Several hidden layer sizes were tested. All networks had a depth of three, hence four layers of nodes – an input layer with 784 nodes, two hidden layers with the sizes mentioned before, and an output layer with two nodes which acts as a classifier. Linear separability, as discussed in this paper, precisely means that this output layer can classify the input sets perfectly.
We compared the ReLU, sigmoid, leaky ReLU and tanh networks trained for 150 epochs using stochastic gradient descent optimization. For the leaky ReLU the slope was set to the standard value of 0.2. We implemented the linear classifier multi-layer perceptron in the neural network framework Chainer v2.0 (Chainer). We regard the training capabilities of this framework as a black box sufficient for our simulation needs. The results are displayed in Figure 3.
Indeed as expected, the network with the hidden layer sizes estimated based on the proposed theoretical analysis (i.e. ) performs very well. We see clearly that the losses barely decrease for larger networks. The error is not yet zero for the predicted network but this may be explained by the imperfect training. The ReLU network performs poorly for the smallest network. This may be explained by the fact that the ReLU maps a lot of information to zero even though it has the smallest of the tested activation functions. The sigmoid consistently has a larger loss than the other functions. This is not necessarily predicted by the theory since the sigmoid’s is only a factor larger than the hyperbolic tangent’s . Also interesting is the very good performance of the leaky ReLU network. This could be caused by not mapping a lot of information to zero like the ReLU as well as having two options to compensate for the .
All activation functions seem to imply that there exists a slightly smaller network that can achieve linear separability on the test set. A better algorithm for determining and
can probably confirm this.
In conclusion we can say that in theory we are now able to find a network with two hidden layers that will perfectly solve any finite problem. In practice we see that the training error does not decrease to zero. We believe that this is caused by imperfect training.
The practical contribution of this article is heuristic. It is widely believed that deep neural networks need less nodes in total than shallow neural networks to solve the same problem. Our theory presents an upper bound on the number of nodes that a shallow neural network will need to solve a certain classification problem. Therefore, a deep neural network will not need more nodes. The theory does not give an optimal architecture, nor a minimum on the number of nodes. Still it is useful to have an inkling about the correct network size for solving a certain problem.
Contrary to what an2015 claim, their theory does not show why ReLU networks have a superior performance. We extended their theory to all commonly used activation functions. Only the leaky rectified linear networks seem to be at a disadvantage, but test results show the opposite. We think the differences between the functions may be caused by the scaling that needs to be done during learning. The linear functions and also the hyperbolic tangent are very easy to scale. Tweaking the sigmoid to the best slope can be quite difficult.
Some issues which we have not addressed in this article are worth mentioning. For example, we cannot make any statements about generalization performance of the networks. Of course, it is generally known that a network with too many parameters will not generalize well. So it is wise to use a network that is as small as possible, or even a bit smaller. This paper contributes an estimate for the number of nodes that is an absolute maximum. It should never be necessary to use more nodes than this estimate. We do not give a necessary number of nodes but rather an upper bound. A bound that is necessary and sufficient would be optimal, but this is a much harder problem to solve.
Another problem is that we do not know what will happen if we use too few nodes. The number of nodes that we estimated will guarantee linear separability. If the number of nodes is too small to achieve linear separability, performance on the training set will be reduced, but it is difficult to say anything about performance on the test set. We also do not know what will happen to the number and distribution of nodes as we increase the number of layers. An extension of the theory to an arbitrary number of layers would be very interesting.
Furthermore, in the simulations we cannot guarantee that the learning algorithm achieves zero error, even though it is possible in theory. The reason is that the algorithm does not always find the absolute minimum. Therefore it is hard to judge from the results whether the predicted network size is performing as expected.
Even though we already find small and , more elaborate simulations could use another algorithm to find the convex hull decomposition. Random projections are cheap to use, but they will always find a pair , such that . (We found since no random projections were necessary.) This is a serious constraint because the first layer of the network consists of nodes, and will therefore always be very large if and are similar size. An idea would be to use a method that uses higher dimensional projections. It is also not guaranteed that Algorithm 1 performs well on other input sets. A better algorithm might perform well on all types of input sets.
The results show a stunning performance of the leaky ReLU activation. More research is needed to understand why this is the case. There clearly is more to the performance of a neural network than revealed in this article. Still, it is an important result to have an estimation of sufficient network sizes for certain activation functions. It would also be interesting to see the effect of the slope of a leaky ReLU and the distance between the datasets on the performance of the network.
This paper provides a heuristic explanation why ReLU and perhaps leaky ReLU networks are easier to train than tanh and sigmoid networks. We give an upper bound on the number of nodes that is needed to achieve linear separability on the training set for feedforward networks with two hidden layers. It is still unclear how this generalises to more layers, which poses an interesting question for further research. Furthermore, our theory does not yet address convolutional networks, however it does represent a foundation for exploring their superior performance in an extension of this work.