Deep systems are believed to play an important role in information processing of intelligent agents. A common hypothesis underlying this belief is that deep models can be exponentially more efficient at representing some functions than their shallow counterparts (see Bengio, 2009).
The argument is usually a compositional one. Higher layers in a deep model can re-use primitives constructed by the lower layers in order to build gradually more complex functions. For example, on a vision task, one would hope that the first layer learns Gabor filters capable to detect edges of different orientation. These edges are then put together at the second layer to form part-of-object shapes. On higher layers, these part-of-object shapes are combined further to obtain detectors for more complex part-of-object shapes or objects. Such a behaviour is empirically illustrated, for instance, in Zeiler and Fergus (2013); Lee et al. (2009). On the other hand, a shallow model has to construct detectors of target objects based only on the detectors learnt by the first layer.
The representational power of computational systems with shallow and deep architectures has been studied intensively. A well known result Hajnal et al. (1993)
derived lower complexity bounds for shallow threshold networks. Other works have explored the representational power of generative models based on Boltzmann machinesMontúfar et al. (2011); Martens et al. (2013)Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; Montúfar and Ay, 2011), or have compared mixtures and products of experts models (Montúfar and Morton, 2012).
In addition to such inspections, a wealth of evidence for the validity of this hypothesis comes from deep models consistently outperforming shallow ones on a variety of tasks and datasets (see, e.g., Goodfellow et al., 2013; Hinton et al., 2012b, a). However, theoretical results on the representational power of deep models are limited, usually due to the composition of nonlinear functions in deep models, which makes mathematical analysis difficult. Up to now, theoretical results have focussed on circuit operations (neural net unit computations) that are substantially different from those being used in real state-of-the-art deep learning applications, such as logic gates (Håstad, 1986), linear + threshold units with non-negative weights (Håstad and Goldmann, 1991) or polynomials (Bengio and Delalleau, 2011). Bengio and Delalleau (2011) show that deep sum-product networks (Poon and Domingos, 2011) can use exponentially less nodes to express some families of polynomials compared to the shallow ones.
The present note analyzes the representational power of deep MLPs with rectifier units. Rectifier units (Glorot et al., 2011; Nair and Hinton, 2010) and piecewise linearly activated units in general (like the maxout unit (Goodfellow et al., 2013)), are becoming popular choices in designing deep models, and most current state-of-the-art results involve using one of such activations (Goodfellow et al., 2013; Hinton et al., 2012b). Glorot et al. (2011) show that rectifier units have several properties that make the optimization problem easier than the more traditional case using smooth and bounded activations, such as tanh or sigmoid.
In this work we take advantage of the piecewise linear nature of the rectifier unit to mathematically analyze the behaviour of deep rectifier MLPs. Given that the model is a composition of piecewise linear functions, it is itself a piecewise linear function. We compare the flexibility of a deep model with that of a shallow model by counting the number of linear regions they define over the input space for a fixed number of hidden units. This is the number of pieces available to the model in order to approximate some arbitrary nonlinear function. For example, if we want to perfectly approximate some curved boundary between two classes, a rectifier MLP will have to use infinitely many linear regions. In practice we have a finite number of pieces, and if we assume that we can perfectly learn their optimal slopes, then the number of linear regions becomes a good proxy for how well the model approximates this boundary. In this sense, the number of linear regions is an upper bound for the flexibility of the model. In practice, the linear pieces are not independent and the model may not be able to learn the right slope for each linear region. Specifically, for deep models there is a correlation between regions, which results from the sharing of parameters between the functions that describe the output on each region.
This is by no means a negative observation. If all the linear regions of the deep model were independent of each other, by having many more linear regions, deep models would grossly overfit. The correlation of the linear regions of a deep model results in its ability to generalize, by allowing it to better represent only a small family of structured functions. These are functions that look complicated (e.g., a distribution with a huge number of modes) but that have an underlying structure that the network can ‘compress’ into its parameters. The number of regions, which indicates the number of variations that the network can represent, provides a measure of how well it can fit this family of structured functions (whose approximation potentially needs infinitely many linear regions).
We believe that this approach, based on counting the number of linear regions, is extensible to any other piecewise linear activation function and also to other architectures, including themaxout activation and the convolutional networks with rectifier activations.
We know the maximal number of regions of linearity of functions computable by a shallow model with a fixed number of hidden units. This number is given by a well studied geometrical problem. The main insight of the present work is to provide a geometrical construction that describes the regions of linearity of functions computed by deep models. We show that in the asymptotic regime, these functions have many more linear regions than the ones computed by shallow models, for the same number of hidden units.
For the single layer case, each hidden unit divides the input space in two, whereby the boundary is given by a hyperplane. For all input values on one side of the hyperplane, the unit outputs a positive value. For all input values on the other side of the hyperplane, the unit outputs . Therefore, the question that we are asking is: Into how many regions do hyperplanes split space? This question is studied in geometry under the name of hyperplane arrangements, with classic results such as Zaslavsky’s theorem. Section 3 provides a quick introduction to the subject.
For the multilayer version of the model we rely on the following intuition. By using the rectifier nonlinearity, we identify multiple regions of the input space which are mapped by a given layer into an equivalent set of activations and represent thus equivalent inputs for the next layers. That is, a hidden layer can perform a kind of or operation by reacting similarly to several different inputs. Any subsequent computation made on these activations is replicated on all equivalent inputs.
This paper is organized as follows. In Section 2 we provide definitions and basic observations about piecewise linear functions. In Section 3 we discuss rectifier networks with one single hidden layer and describe their properties in terms of hyperplane arrangements which are fairly well known in the literature. In Section 4 we discuss deep rectifier networks and prove our main result, Theorem 1, which describes their complexity in terms of the number of regions of linearity of functions that they represent. Details about the asymptotic behaviour of the results derived in Sections 3 and 4 are given in the Appendix A. In Section 5 we analyze a special type of deep rectifier MLP and show that even for a small number of hidden layers it can generate a large number of linear regions. In Section 6 we offer a discussion of the results.
We consider classes of functions (models) defined in the following way.
A rectifier feedforward network
is a layered feedforward network, or multilayer perceptron (MLP), as shown in Fig.1, with following properties. Each hidden unit receives as inputs the real valued activations of all units in the previous layer, computes the weighted sum
and outputs the rectified value
The real parameters are the input weights and is the bias of the unit. The output layer is a linear layer, that is, the units in the last layer compute a linear combination of their inputs and output it unrectified.
Given a vector of naturals, we denote by the set of all functions that can be computed by a rectifier feedforward network with inputs and units in layer for . The elements of are continuous piecewise linear functions.
We denote by the maximum of the number of regions of linearity or response regions over all functions from . For clarity, given a function , a connected open subset is called a region of linearity or linear region or response region of if the restriction is a linear function and for any open set the restriction is not a linear function. In the next sections we will compute bounds on for different choices of . We are especially interested in the comparison of shallow networks with one single very wide hidden layer and deep networks with many narrow hidden layers.
In the remainder of this section we state three simple lemmas.
The next lemma states that a piecewise linear function has as many regions of linearity as there are distinct intersections of regions of linearity of the coordinates .
Consider a width layer of rectifier units. Let be the regions of linearity of the function computed by the -th unit, for all . Then the regions of linearity of the function computed by the rectifier layer are the elements of the set .
A function is linear iff all its coordinates are. ∎
In regard to the number of regions of linearity of the functions represented by rectifier networks, the number of output dimensions, i.e., the number of linear output units, is irrelevant. This is the statement of the next lemma.
The number of (linear) output units of a rectifier feedforward network does not affect the maximal number of regions of linearity that it can realize.
Let be the map of inputs to activations in the last hidden layer of a deep feedforward rectifier model. Let be the map of inputs to activations of the output units, given by composition of with the linear output layer, . If the row span of is not orthogonal to any difference of gradients of neighbouring regions of linearity of , then captures all discontinuities of . In this case both functions and have the same number of regions of linearity.
If the number of regions of is finite, then the number of differences of gradients is finite and there is a vector outside the union of their orthogonal spaces. Hence a matrix with a single row (a single output unit) suffices to capture all transitions between different regions of linearity of . ∎
A layer of rectifier units with inputs can compute any function that can be computed by the composition of a linear layer with inputs and outputs and a rectifier layer with inputs and outputs, for any .
A rectifier layer computes functions of the form , with and . The argument is an affine function of . The claim follows from the fact that any composition of affine functions is an affine function. ∎
3 One hidden layer
Let us look at the number of response regions of a single hidden layer MLP with input units and hidden units. We first formulate the rectifier unit as follows:
where is the indicator function defined as:
We can now write the single hidden layer MLP with outputs as the function ;
From this formulation it is clear that each unit in the hidden layer has two operational modes. One is when the unit takes value and one when it takes a non-zero value. The boundary between these two operational modes is given by the hyperplane consisting of all inputs with . Below this hyperplane, the activation of the unit is constant equal to zero, and above, it is linear with gradient equal to . It follows that the number of regions of linearity of a single layer MLP is equal to the number of regions formed by the set of hyperplanes .
A finite set of hyperplanes in a common -dimensional Euclidian space is called an -dimensional hyperplane arrangement. A region of an arrangement is a connected component of the complement of the union of the hyperplanes, i.e., a connected component of . To make this clearer, consider an arrangement consisting of hyperplanes for all , for some and some . A region of is a set of points of the form for some sign vector .
A region of an arrangement is relatively bounded if its intersection with the space spanned by the normals of the hyperplanes is bounded. We denote by the number of regions and by the number of relatively bounded regions of an arrangement . The essentialization of an arrangement is the arrangement consisting of the hyperplanes for all , defined in the span of the normals of the hyperplanes . For example, the essentialization of an arrangement of two non-parallel planes in is an arrangement of two lines in a plane.
How many regions are generated by an arrangement of hyperplanes in ?
We will only need the special case of hyperplanes in general position, which realize the maximal possible number of regions. Formally, an -dimensional arrangement is in general position if for any subset the following holds. (1) If , then . (2) If , then . An arrangement is in general position if the weights , defining its hyperplanes are generic. This means that any arrangement can be perturbed by an arbitrarily small perturbation in such a way that the resulting arrangement is in general position.
For arrangements in general position, Zaslavsky’s theorem can be stated in the following way (see Stanley, 2004, Proposition 2.4).
Let be an arrangement of hyperplanes in general position in . Then
In particular, the number of regions of a -dimensional arrangement of lines in general position is equal to
For the purpose of illustration, we sketch a proof of eq. (4) using the sweep hyperplane method. We proceed by induction over the number of lines .
Base case . It is obvious that in this case there is a single region, corresponding to the entire plane. Therefore, .
Induction step. Assume that for lines the number of regions is , and add a new line to the arrangement. Since we assumed the lines are in general position, intersects each of the existing lines at a different point. Fig. 2 depicts the situation for .
The intersection points split the line into segments. Each of these segments cuts a region of in two pieces. Therefore, by adding the line we get new regions. In Fig. 2 the two intersection points result in three segments that split each of the regions in two. Hence
For the number of response regions of MLPs with one single hidden layer we obtain the following.
The regions of linearity of a function in the model with inputs and hidden units are given by the regions of an arrangement of hyperplanes in -dimensional space. The maximal number of regions of such an arrangement is .
4 Multiple hidden layers
In order to show that a hidden layer model can be more expressive than a single hidden layer one with the same number of hidden units, we will need the next three propositions.
Any arrangement can be scaled down and shifted such that all regions of the arrangement intersect the unit ball.
Let be an arrangement and let be a ball of radius and center . Let be the supremum of the distance from the origin to a point in a bounded region of the essentialization of the arrangement . Consider the map defined by . Then is an arrangement satisfying the claim. It is easy to see that any point with norm bounded by is mapped to a point inside the ball . ∎
The proposition is illustrated in Fig. 3.
We need some additional notations in order to formulate the next proposition. Given a hyperplane , we consider the region , and the region . If we think about the corresponding rectifier unit, then is the region where the unit is active and is the region where the unit is dead.
Let be a region delimited by the hyperplanes . We denote by the set of all hyperplane-indices with . In other words, is the list of hidden units that are active (non-zero) in the input-space region .
The following proposition describes the combinatorics of -dimensional arrangements in general position. More precisely, the proposition describes the combinatorics of -dimensional arrangements with -dimensional essentialization in general position. Recall that the essentialization of an arrangement is the arrangement that it defines in the subspace spanned by the normals of its hyperplanes.
The proposition guarantees the existence of input weights and bias for a rectifier layer such that for any list of consecutive units, there is a region of inputs for which exactly the units from that list are active.
For any , , there exists an -dimensional arrangement of hyperplanes such that for any pair with , there is a region of with .
We show that the hyperplanes of a -dimensional arrangement in general position can be indexed in such a way that the claim of the proposition holds. For higher dimensional arrangements the statement follows trivially, applying the -dimensional statement to the intersection of the arrangement with a -subspace.
Proof of Proposition 4.
Consider first the case . We define the first line of the arrangement to be the x-axis of the standard coordinate system. To define the second line , we consider a circle of radius centered at the origin. We define to be the tangent of at an angle to the y-axis, where . The top left panel of Fig. 4 depicts the situation. In the figure, corresponds to inputs for which no rectifier unit is active, corresponds to inputs where the first unit is active, to inputs where the second unit is active, and to inputs where both units are active. This arrangement has the claimed properties.
Now assume that there is an arrangement of lines with the claimed properties. To add an -th line, we first consider the maximal distance from the origin to the intersection of two lines with . We also consider the radius- circle centered at the origin. The circle contains all intersection of any of the first lines. We now choose an angle with and define as the tangent of that forms an angle with the y-axis. Fig. 4 depicts adding the third and fourth line to the arrangement.
After adding line , we have that the arrangement
is in general position.
has regions with for all .
The regions of the arrangement are stable under perturbation of the angles and radii used to define the lines. Any slight perturbation of these parameters preserves the list of regions. Therefore, the arrangement is in general position.
The second property comes from the order in which intersects all previous lines. intersects the lines in the order in which they were added to the arrangement: . The intersection of and , , is above the lines , and hence the segment between the intersection with and with , has to cut the region in which only units to are active.
The intersection order is ensured by the choice of angles and the fact that the lines are tangent to the circles . For any and let be the line parallel to the y-axis passing through . Each line divides the space in two. Let be the half-space to the right of . Within any half-space , the intersection is above , because the angle of with the y-axis is larger than (this means has a stepper decrease). Since is tangent to the circle that contains all points , the line will intersect lines and in , and therefore it has to intersect first.
For we can consider an arrangement that is essentially -dimensional and has the properties of the arrangement described above. To do this, we construct a -dimensional arrangement in a -subspace of and then extend each of the lines of the arrangement to a hyperplane that crosses orthogonally. The resulting arrangement satisfies all claims of the proposition. ∎
The next proposition guarantees the existence of a collection of affine maps with shared bias, which map a collection of regions to a common output.
Consider two integers and . Let denote the -dimensional unit ball and let be some regions with non-empty interiors. Then there is a choice of weights and for which for all , where .
To see this, consider the following construction. For each region consider a ball of radius and center . For each , consider positive numbers such that for all . This can be done fixing equal to and solving the equation for all other numbers. Let be such that for any and . Scaling each region by transforms the center of to the same point for all . By the choice of , the minor radius of all transformed balls is larger than .
We can now set to be minus the common center of the scaled balls, to obtain the map:
These satisfy claimed property, namely that contains the unit ball, for all . ∎
Before proceeding, we discuss an example illustrating how the previous propositions and lemmas are put together to prove our main result below, in Theorem 1.
Consider a rectifier MLP with , such that the input space is , and assume that the network has only two hidden layers, each consisting of units. Each unit in the first hidden layer defines a hyperplane in , namely the hyperplane that separates the inputs for which it is active, from the inputs for which it is not active. Hence the first hidden layer defines an arrangement of hyperplanes in . By Proposition 4, this arrangement can be made such that it delimits regions of inputs , …, with the following property. For each input in any given one of these regions, exactly one pair of units in the first hidden layer is active, and, furthermore, the pairs of units that are active on different regions are disjoint.
By the definition of rectifier units, each hidden unit computes a linear function within the half-space of inputs where it is active. In turn, the image of by the pair of units that is active in is a polyhedron in . For each region , denote corresponding polyhedron by .
Recall that a rectifier layer computes a map of the form . Hence a rectifier layer with inputs and outputs can compute any composition of an affine map and a map computed by a rectifier layer with inputs and outputs (Lemma 3).
Consider the map computed by the rectifier units in the second hidden layer, i.e., the map that takes activations from the first hidden layer and outputs activations from the second hidden layer. We think of this map as a composition of an affine map and a map computed by a rectifier layer with inputs. The map can be interpreted as an intermediary layer consisting of two linear units, as illustrated in Fig. 5.
Within each input region , only two units in the first hidden layer are active. Therefore, for each input region , the output of the intermediary layer is an affine transformation of . Furthermore, the weights of the intermediary layer can be chosen in such a way that the image of each contains the unit ball.
Now, is the map computed by a rectifier layer with inputs and outputs. It is possible to define this map in such a way that it has regions of linearity within the unit ball, where is the number of regions of a -dimensional arrangement of hyperplanes in general position.
We see that the entire network computes a function which has regions of linearity within each one of the input regions . Each input region is mapped by the concatenation of first and intermediate (notional) layer to a subset of which contains the unit ball. Then, the second layer computes a function which partitions the unit ball into many pieces. The partition computed by the second layer gets replicated in each of the input regions , resulting in a subdivision of the input space in exponentially many pieces (exponential in the number of network layers).
Now we are ready to state our main result on the number of response regions of rectifier deep feedforward networks:
A model with inputs and hidden layers of widths can divide the input space in or possibly more regions.
Proof of Theorem 1.
Let the first hidden layer define an arrangement like the one from Proposition 4. Then there are input-space regions , with the following property. For each input vector from the region , exactly units from the first hidden layer are active. We denote this set of units by . Furthermore, by Proposition 4, for inputs in distinct regions , the corresponding set of active units is disjoint; that is, for all , .
To be more specific, for an input vectors from , exactly the first units of the first hidden layer are active, that is, for these input vectors the value of is non-zero if and only if . For input vectors from , only the next units of the first hidden layer are active, that is, the units with index in , and so on.
Now we consider a ‘fictitious’ intermediary layer consisting of linear units between the first and second hidden layers. As this intermediary layer computes an affine function, it can be absorbed into the second hidden layer (see Lemma 3). We use it only for making the next arguments clearer.
The map taking activations from the first hidden layer to activations from the second hidden layer is , where .
We can write the input and bias weight matrices as and , where , , and , .
The weights and describe the affine function computed by the intermediary layer, . The weights and are the input and bias weights of the rectifier layer following the intermediary layer.
We now consider the sub-matrix of consisting of the columns of with indices , for all . Then , where is the sub-matrix of consisting of its last columns. In the sequel we set all entries of equal to zero.
The map is thus written as the sum