A basic question in a rigorous study of neural networks is a precise characterization of the class of functions representable by neural networks with a certain activation function. The question is of fundamental importance because neural network functions are a popular hypothesis class in machine learning and artificial intelligence. Every aspect of learning using neural networks benefits from a better understanding of the function class: from the statistical aspect of understanding thebias introduced in the learning procedure by using a particular neural hypothesis class, to the algorithmic question of training, i.e., finding the “best” function in the class that extrapolates the given sample of data points.
. We wish to argue otherwise. Knowledge of the finer structure of the function class obtained by using a particular activation function can be exploited advantageously. For example, the choice of a certain activation function may lead to much smaller networks that achieve the same bias compared to the hypothesis class given by another activation function, even though the universal approximation theorems guarantee that asymptotically both activation functions achieve arbitrarily small bias. As another example, one can design targeted training algorithms for neural networks with a particular activation function if the structure of the function class is better understood, as opposed to using a generic algorithm like some variant of (stochastic) gradient descent. This has recently led to globally optimal empirical risk minimization algorithms forrectified linear units (ReLU) neural networks with specific architecture [2, 3, 5, 6] that are very different in nature from conventional approaches like (stochastic) gradient descent; see also [8, 10, 11, 12, 16, 7, 9].
Recent results of this nature have been obtained with ReLU neural networks. Any neural network with ReLU activations clearly give a piecewise linear function. Conversely, any piecewise linear function can be exactly represented with at most hidden layers , thus characterizing the function class representable using ReLU activations. However, it remains an open question if are indeed needed. It is conceivable that all piecewise linear functions can be represented by 2 or 3 hidden layers. It is believed this is not the case and there is a strict hierarchy starting from 1 hidden layer, all the way to hidden layers. It is known that there are functions representable using 2 hidden layers that cannot be represented with a single hidden layer, but even the 2 versus 3 hidden layer question remains open. Some partial progress on this question can be found in .
In this paper, we study the class of functions representable using threshold activations
(also known as the Heaviside activation, unit step activation, and McCulloch-Pitts neurons). It is easy to see that any function represented by such a neural network is a piecewise constant function. We show thatany piecewise constant function can be represented by such a neural network, and surprisingly – contrary to what is believed to be true for ReLU activations – there is always a neural network with at most 2 hidden layers that does the job. We also establish that there are functions that cannot be represented by a single hidden layer and thus one cannot do better than 2 hidden layers in general. Our constructions also show that the size of the neural network is at most linear in the number of “pieces” of the function, giving a relatively efficient encoding compared to recent results for ReLU activations which give a polynomial size network only in the case of fixed input dimension . Finally, we use these insights to design an algorithm to solve the empirical risk minimization (training) problem for these neural networks to global optimality whose running time is polynomial in the size of the data sample, assuming the input dimension and the network architecture are fixed. To the best of our knowledge, this is the first globally optimal training algorithm for any family of neural networks that works for arbitrary architectures and has computational complexity that is polynomial in the number of data points.
Another important message that we hope to advertise with this paper is that ideas from polyhedral geometry and discrete mathematics can be powerful tools for investigating important and fundamental questions in neural network research.
We next introduce necessary definitions and notation, followed by a formal statement of our results.
Definitions and notation. A polyhedral complex is a collection of polyhedra having the following properties:
For every is a face of and .
every face of a polyhedron in belongs to .
We denote by the dimension of a polyhedron and by the relative interior of . will denote the number of polyehedra in a polyhedral complex and is called the size of .
Definition 1 (Piecewise constant function).
We say that a function is piecewise constant if there exists a finite polyhedral complex that covers and is constant in the relative interior of each polyhedron in the complex. We use as a shorthand for the piecewise constant function from to :
Note that there may be multiple polyhedral complexes that correspond to a given piecewise constant function, with possibly different sizes. For example, the indicator function of the nonnegative orthant is a piecewise constant function but there are many different ways to break up the complement of the nonnegative orthant into polyhedral regions. We say that a polyhedral complex is compatible with a piecewise constant function if is constant in the relative interior of every polyedron in .
The threshold activation function is a map from to given by the indicator of the positive reals, i.e., . By extending this to apply coordinatewise, we get a function for any , i.e., is 1 if and only if for . For any subset , wil denote its indicator function, i.e., if and otherwise.
Definition 2 (Linear threshold deep neural network (DNN)).
For any number of hidden layers , input and output dimensions , , a linear threshold DNN is given by specifying a sequence of k natural numbers representing widths of the hidden layers, a set of affine transformations , . Such a linear threshold DNN is called a -layer DNN, and is said to have hidden layers. The function computed or represented by this DNN is:
The size of the DNN, or the number of neurons in the DNN, is . For natural numbers and a tuple , we use to denote the family of all possible linear threshold DNNs with input dimension , hidden layers with widths and output dimension . will denote the family of all linear threshold activation neural networks with hidden layers. We say that a neural network computes a subset if and only if it represents the indicator function of that subset.
Any function expressed by a linear threshold neural network is a constant piecewise function (i.e. for all natural numbers ), because a composition of piecewise constant functions is piecewise constant. In this work we show that linear threshold neural networks with 2 hidden layers can compute any constant piecewise function, i.e. . We also prove that this is optimal. More formally,
Theorem 1.1 ().
For all natural numbers ,
Equivalently, any piecewise constant function can be computed with linear threshold DNN with at most hidden layers. Moreover, the DNN needs at most neurons, where is any polyhedral complex compatible with .
Next, we show that the bound on the size of the neural network in Theorem 1.1 is in a sense best possible, up to constant factors.
Proposition 1 ().
There is a family of functions such that for any function in the family, any linear threshold DNN representing has size at least the size of the smallest polyhedral complex, i.e., one with minimum size, compatible with .
Finally, we present a new algorithm to perform exact empirical risk minimization (ERM) for linear threshold neural networks with fixed architecture, i.e., fixed and . Given data points , the ERM problem with hypothesis class is
is a convex loss function.
Theorem 1.2 ().
For natural numbers and tuple , there exists an algorithm that computes the global optimum (2) with running time . Thus, the algorithm is polynomial in the size of the data sample, if are considered fixed constants.
The rest of the article is organized as follows. In Section 2 we present our representability results for the class of linear threshold neural networks, including a proof of Theorem 1.1. In Section 3 we prove the lower bound stated in and Proposition 1 using the structure of breakpoints of piecewise constant functions. Our ERM algorithm and the proof of Theorem 1.2 are presented in Section 4 with intermediate results. We conclude with a short discussion and open problems in Section 5.
2 Representability results
The following lemma is clear from the definitions.
Lemma 1 ().
Let and let be a compatible polyhedral complex. Then, there exists a unique set of real numbers such that
Proposition 2 ().
, i.e., linear threshold neural networks with a single hidden layer can compute any piecewise constant function . Moreover, if and is a polyhedral complex of , then can be computed with neurons.
Let a piecewise constant function. Then using Lemma 1 there exists a polyhedral complex whose union is and such that is constant on the relative interior of each of the polyhedra. In , non empty polyhedra are either reduced to a point, or they are the intervals of the form , , with , or itself. We first show that we can compute the indicator function on each of the interior of those intervals with at most two neurons. The interior of , or can obviously be computed by one neuron (e.g. with for ). The last cases (singletons and polyhedron of the form ) requires a more elaborate construction. To compute the function , it is sufficient to implement a Dirac function, since where is the Dirac in , i.e, . can be computed by a linear combination of three neurons, since is equal to . Using a linear combination of the basis functions (polyhedra and faces), we can compute exactly . To show that neurons suffice, is computed with one shared neuron, and then other neurons are needed at most for one polyhedron using our construction. ∎
We next show that starting with two dimensions, linear threshold DNNs with a single hidden layer cannot compute every possible piecewise constant function.
Let . Then cannot be represented by any linear threshold neural network with one hidden layer.
Consider any piecewise constant function on represented a single hidden layer neural network, say with , and . We may suppose that for all , either or . This implies that the set of nondifferentiable points of is a union of lines in . However, the set of nondifferentiable points of are the sides of the cube, which is a union of finite length line segments. This shows that cannot be represented by a single hidden layer linear threshold DNN. ∎
We will now build towards a proof of Theorem 1.1 which states that hidden layers actually suffice to represent any piecewise constant function in .
Let be a polyhedron in given by the intersection of halfspaces. Then, can be computed with a two hidden layer neural network and neurons in total.
Let a polyhedron, i.e. with and . Let us consider the neurons , and . Then for all , if and only if . Now, defining yields . can obviously be computed with a neuron. Therefore, one can compute with neurons in the first hidden layer and one neuron in the second, which proves the result. ∎
Let be a polyhedron in . Then the indicator function of its relative interior can be computed with a two hidden layer neural network, using the indicator of and the indicators of its faces.
Let be a polyhedron. First, we always have . Therefore it is sufficient to prove that we can implement for any . Using the inclusion exclusion principle on indicator functions, suppose that the facets of are , then:
It should be noted that for any is either empty, or a face of , hence a polyhedron of dimension lower or equal to . Therefore, using Lemma 2, we can implement with a two hidden neural network with at most neurons, where is the number of halfspaces in an inequality description of . If is the number of faces of , then there are at most polyhedra to compute. ∎
Combining these results, we can now provide a:
Proof of Theorem 1.1.
Thanks to Lemma 1, in order to represent , it is sufficient to compute the indicator function of the relative interior of each polyhedron in one of its polyhedral complex . This can be achieved with just two hidden layers using Lemma 3. This establishes the equalities in the statement of the theorem. The strict containment is given by Proposition 3.
Let be the total number of halfspaces used in an inequality description of all the polyhedra in the polyhedral complex . Since all faces are included in the polyhedral complex, there exists an inequality description with . The factor 2 appears because for each facet of a full dimensional polyhedron in , one may need both directions of the inequality that describes this facet. Then the construction in the proofs of Lemmas 2 and 3 show that one needs at most neurons in the first hidden layer and at most neurons in the second hidden layer.∎
3 Proof of Proposition 1
Even though it is possible to represent any piecewise constant function using only two hidden layers, one may wonder if there is some advantage of using more hidden layers, for instance to decrease the number of neurons to compute a target function. We are unable to settle this question in general. However, we show that the linear bound in Theorem 1.1 cannot be improved in general. More precisely, we prove Proposition 1 in this section.
We first introduce the notion of breakpoint for piecewise constant functions.
Let . We say that is a breakpoint of if and only if for all the ball centered in and radius contains a point such that .
For any piecewise constant function , the breakpoints of are breakpoints of .
Let be a piecewise constant function and a breakpoint of . Then for any , there exists such that either and , or and . In both cases, so by definition is a breakpoint of . ∎
For any single neuron with a linear threshold activation with inputs, the output is the indicator of an open halfspace, i.e., for some and . We say that is the hyperplane associated with this neuron. This concept is needed in the next proposition.
Proposition 4 ().
The set of breakpoints of a function represented by a linear threshold DNN with any number of hidden layers is included in the union of hyperplanes associated with the neurons in the first hidden layer.
We give a proof by induction on the number of hidden layers. We remind that if and are the weights and bias of a neuron in the first layer, the corresponding hyperplane is .
Base case: For , let be a one hidden layer DNN with neurons, i.e there exists and closed half-spaces such that:
Let be the hyperplanes associated to each . Then we claim that does not have any breakpoint in where are the hyperplanes associated to . To formally prove it, let be a point of . Then for each , is either in or . This means that belongs to a intersection of open sets, say where or . First, is an open set so there exists such that the ball centered in and with radius is included in . Furthermore, by definition, is constant on , hence is constant on and cannot be a breakpoint. We proved that the breakpoints of are in .
Induction step: Let us suppose the statement is true for all neural networks with hidden layers, and consider a neural network with hidden layers. It should be noted that the output of a neuron in the last hidden layer is of the form
where is the piecewise function represented by a neural network of depth . Lemma 4 states that the breakpoints of are breakpoints of , a DNN of depth . Using the induction assumption, the breakpoints of belong to the hyperplanes introduced in the first layer. Hence the breakpoints of , which is a linear combination of such neurons, are included in the hyperplanes of the first layer. ∎
Proof of Proposition 1.
Let us construct a family of functions in . Let us consider the sets , for , and . Note that . Let such that such that . It is easy to see that is a piecewise constant function and that the breakpoints of is a set of hyperplanes, with empty pairwise intersections. By Proposition 4, any linear threshold neural network must have these hyperplanes associated with neurons in the first hidden layer, and therefore we must have at least neurons in the first hidden layer. ∎
4 Globally optimal empirical risk minimization
In this section we present a algorithm to train to optimality a linear threshold DNN with fixed architecture. Let us recall the corresponding optimization problem. Given data points , find the function represented by a -hidden layer DNN with widths that solves:
where is a convex loss function. In a first place we present the idea with -hidden layers and then adapt the method for an arbitrary number of layers. The first step consists in enumerating the hyperplane partitions of in the first layer. It is actually sufficient to look at a finite number of those. Then, we prove that the possible configurations of neurons in the second layers is finite given the neurons of the first layer. The algorithm enumerates the possible those configurations corresponding constant polyhedral regions by simply looking at possible combinations of the neurons of the first layer. In the last step, for each polyhedral complex, regression is carried out on the weights of the last layer with an additional convex constraint.
Let be any natural number. We say a collection of subsets of is linearly separable if there exist such that any subset is in if and only if . Define
i.e., denotes the set of all linearly separable collections of subsets of .
We note that given a collection of subsets of one can test if
is linearly separable by checking if the optimum value of the following linear program is strictly positive:
In Algorithm 1 below, we will enumerate through all possible collections in (for different values of ). We assume this has been done a priori using the linear programs above and this enumeration can be done in time during the execution of Algorithm 1.
In , is linearly separable, but is not linearly separable because the set of inequalities , , and have no solution. Two examples of linearly separable collections in are given in Figure 3.
Let and , and consider a DNN of . Any neuron of this neural network computes some subset of . Suppose we fix the weights of the neural network up to the -th hidden layer. This fixes the sets computed by the neurons in this layer.
Then a neuron in the -th layer of computes (by adjusting the weights and bias of this neuron) if and only if there exists a linearly separable collection of subsets of such that:
Let , be the weights and bias of the neuron in the -th layer. By definition, the set represented by this neuron is
We define the collection . By definition, is a linearly separable collection. Now let us consider the set:
Let . Then there exists such that and therefore, . This means that , hence . Now, let . Then . Let then and , hence , and .
Conversely, let be a linearly separable collection of subsets of . By definition there exists and such that . These are then taken as the weights of the neuron in the -th hidden layer and its output is the function. ∎
4.2 Proof of Theorem 1.2
To solve (2), we need to consider only the values of a function in on the data points ; the values the funciton takes outside these finitely many points are not relevant to the problem. Consider a neural network with hidden layers and widths that implements a function in . The output of any neuron on these data points is in and thus each neuron can be thought of as picking out a subset of the set . Proposition 5 provides a way to enumerate these subsets of in a systematic manner.
For any finite subset , a subset of is said to be linearly separable if there exists , such that .
The following is a well-known result in combinatorial geometry .
For any finite subset , there are at most linearly separable subsets.
By considering the natural mapping between subsets of and , we also obtain the following corollary.
For any , there are at most linearly separable collections of subsets of . In other words, .
Proof of Theorem 1.2.
Algorithm 1 solves (2). The correctness comes from the observation that a recursive application of Proposition 5 shows that the sets computed by the algorithm are all possible subsets of computed by the neurons in the last hidden layer. The are simply the weights of the last layer that combine the indicator functions of these subsets to yield the function value of the neural network on each data point. The convex minimization problem in line 13 finds the optimal values, for this particular choice of subsets . Selecting the minimum over all these choices solves the problem.
We now show that the exponential dependence on the dimension in Theorem 1.2 is actually necessary unless P=NP. We consider the version of (2) with single neuron and show that it is NP-hard with a direct reduction.
(NP-hardness). The One-Node-Linear-Threshold problem is NP-hard when the dimension is considered part of the input. This implies in particular that Problem 2 is NP-hard when is part of the input.
We here use a result of [14, Theorem 3.1], which showed that the following decision problem is NP-complete.
Given disjoint sets of positive and negative examples of and a bound , does there exist a separating hyperplane which leads to at most misclassifications?
MinDis(Halfspaces) is a special case of (2) with a single neuron: given data points and , compute that minimizes . ∎
5 Open questions
We showed that neural networks with linear threshold activations can represent any piecewise constant function with at most two layers and with a linear number of neurons with respect to the size of any polyhedral complex compatible with . Furthermore, we provided a family of functions for which this linear dependence cannot be improved. However, it is possible that there are other families of functions where the behaviour is different: by increasing depth, one can represent these functions using an exponentially smaller number of neurons, compared to what is needed with two layers. For instance, in the case of ReLU activations, there exist functions for which depth brings an exponential gain in the size of the neural network [18, 2]. We think it is a very interesting open question to determine if such families of functions exist for linear threshold networks.
On the algorithmic side, we solve the empirical risk minimization problem to global optimality with running time that is polynomial in the size of the data sample, assuming that the input dimension and the architecture size are fixed constants. The running time is exponential in terms of these parameters (see Theorem 1.2). While the exponential dependence on the input dimension cannot be avoided unless (see Theorem 4.2), another very interesting open question is to determine if the exponential dependence on the architectural parameters is really needed, or if an algorithm can be designed that has complexity which is polynomial in both the data sample and the architecture parameters. A similar question is also open in the case of ReLU neural networks .
Both authors gratefully acknowledge support from Air Force Office of Scientific Research (AFOSR) grant FA95502010341. The second author gratefully acknowledges support from National Science Foundation (NSF) grant CCF2006587.
-  Anthony, M., Bartlett, P.L.: Neural network learning: Theoretical foundations. cambridge university press (1999)
-  Arora, R., Basu, A., Mianjy, P., Mukherjee, A.: Understanding deep neural networks with rectified linear units. In: International Conference on Learning Representations (2018)
-  Boob, D., Dey, S.S., Lan, G.: Complexity of training relu neural network. Discrete Optimization p. 100620 (2020)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems2(4), 303–314 (1989)
-  Dey, S.S., Wang, G., Xie, Y.: Approximation algorithms for training one-node relu neural networks. IEEE Transactions on Signal Processing 68, 6696–6706 (2020)
-  Ergen, T., Pilanci, M.: Global optimality beyond two layers: Training deep relu networks via convex programs. In: International Conference on Machine Learning. pp. 2993–3003. PMLR (2021)
-  Froese, V., Hertrich, C., Niedermeier, R.: The computational complexity of ReLU network training parameterized by data dimensionality. arXiv preprint arXiv:2105.08675 (2021)
-  Goel, S., Kanade, V., Klivans, A., Thaler, J.: Reliably learning the relu in polynomial time. In: Conference on Learning Theory. pp. 1004–1042. PMLR (2017)
-  Goel, S., Klivans, A., Manurangsi, P., Reichman, D.: Tight hardness results for training depth-2 relu networks. arXiv preprint arXiv:2011.13550 (2020)
-  Goel, S., Klivans, A., Meka, R.: Learning one convolutional layer with overlapping patches. In: International Conference on Machine Learning. pp. 1783–1791. PMLR (2018)
-  Goel, S., Klivans, A.R.: Learning neural networks with two nonlinear layers in polynomial time. In: Conference on Learning Theory. pp. 1470–1499. PMLR (2019)
-  Goel, S., Klivans, A.R., Manurangsi, P., Reichman, D.: Tight hardness results for training depth-2 ReLU networks. In: 12th Innovations in Theoretical Computer Science Conference (ITCS ’21). LIPIcs, vol. 185, pp. 22:1–22:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
-  Hertrich, C., Basu, A., Di Summa, M., Skutella, M.: Towards lower bounds on the depth of relu neural networks. To appear in NeurIPS 2021 (arXiv preprint arXiv:2105.14835) (2021)
-  Hoffgen, K.U., Simon, H.U., Vanhorn, K.S.: Robust trainability of single neurons. Journal of Computer and System Sciences 50(1), 114–125 (1995)
-  Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural networks 4(2), 251–257 (1991)
-  Manurangsi, P., Reichman, D.: The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207 (2018)
-  Matousek, J.: Lectures on discrete geometry, vol. 212. Springer Science & Business Media (2013)
-  Telgarsky, M.: Benefits of depth in neural networks. In: Conference on learning theory. pp. 1517–1539. PMLR (2016)