Deep learning has significantly advanced our ability to address a wide range of difficult machine learning and signal processing problems. Today’s machine learning landscape is dominated by deep (neural) networks (DNs), which are compositions of a large number of simple parameterized linear and nonlinear transformations. Deep networks perform surprisingly well in a host of applications; however, surprisingly little is known about why or how they work so well.
connected a large class of DNs to a special kind of spline, which enables one to view and analyze the inner workings of a DN using tools from approximation theory and functional analysis. In particular, when the DN is constructed using convex and piecewise affine nonlinearities (such as ReLU, leaky- ReLU, max-pooling, etc.), then its layers can be written asMax-Affine Spline Operators (MASOs). An important consequence for DNs is that each layer partitions its input space into a set of regions and then processes inputs via a simple affine transformation that changes from region to region. Understanding the geometry of the layer partition regions – and how the layer partition regions combine into a global input partition for the entire DN – is thus key to understanding the operation of DNs.
There has only been limited work in the geometry of deep networks. The originating MASO work of Balestriero and Baraniuk (2018a, b) focused on the analytical form of the region-dependent affine maps and empirical statistics on the partition. The work of Wang et al. (2019) empirically studied this partitioning highlighting the fact that knowledge of the DN partitioning alone is sufficient to reach high performance. Other works have focused on the properties of the partitioning, such as upper bounding the number of regions Montufar et al. (2014); Raghu et al. (2017); Hanin and Rolnick (2019). An explicit characterization of the input space partitioning of one hidden layer DNs with ReLU activation has been proposed in Zhang et al. (2016) by means of tropical geometry.
In this paper, we adopt a computational and combinatorial geometry Pach and Agarwal (2011); Preparata and Shamos (2012) perspective of MASO-based DNs to derive the analytical form of the input-space partition of a DN unit, a DN layer, and an entire end-to-end DN. We demonstrate that each MASO DN layer partitions its input feature map space partitioning according to a power diagram (PD) (also known as a Laguerre–Voronoi diagram) Aurenhammer and Imai (1988) with an exponentially large number of regions. Furthermore, the composition of the several MASO layers comprising a DN effects a subdivision process that creates the overall DN input-space partition.
Our complete, analytical characterization of the input-space and feature map partition of MASO DNs opens up new avenues to study the geometrical mechanisms behind their operation.
We summarize our contributions, which apply to any DN employing piecewise affine and convex nonlinearities such as fully connected, convolutional, with residual connections:
We study the computational and combinatorial geometric properties of the layer and DN partitioning (Section 5.2). In particular, a DN can infers the PD region to which any input belongs with a computational complexity that is asymptotically logarithmic in the number of regions.
In the classification setting, we derive an analytical formula for the DN’s decision boundary in term of the DN input space partitioning (Section 6). The analytical formula enables us to characterize certain geometrical properties of the boundary.
Additional background information plus proofs of the main results are provided in several appendices.
2 Background on Deep Networks and Max-Affine Spline Operators
A deep network (DN) is an operator with parameters that maps an input signal to the output prediction . Current DNs can be written as a composition of intermediate layer mappings () with that transform an input feature map into the output feature map with the initializations and . The feature maps
DN layers can be constructed from a range of different linear and nonlinear operators. One important linear operator is the fully connected operator that performs an arbitrary affine transformation by multiplying its input by the dense matrix
and adding the arbitrary bias vectoras in . Further examples are provided in Goodfellow et al. (2016). Given the collection of linear and nonlinear operators making up a DN, the following definition yields a single, unique layer decomposition.
A DN layer comprises a single nonlinear DN operator composed with any (if any) preceding linear operators that lie between it and the preceding nonlinear operator.
Work from Balestriero and Baraniuk (2018a, b) connects DN layers with max-affine spline operators (MASOs) . A MASO is an operator that concatenates independent max-affine splines Magnani and Boyd (2009); Hannah and Dunson (2013), with each spline formed from affine mappings. The MASO parameters consist of the “slopes” and the “offsets/biases” .111 The three subscripts of the slopes tensor correspond to output , partition region , and input signal index . The two subscripts of the offsets/biases tensor correspond to output and partition region . Given the input , a MASO produces the output via
where denotes the dimension of . The key background result for this paper is that a very large class of DNs are constructed from MASOs layers.
For example, a layer made of a fully connected operator followed by a leaky-ReLU with leakiness has parameters for the slope parameter and for the bias. A DN comprising MASO layers is a continuous affine spline operator with an input space partition and a partition-region-dependent affine mapping. However, little is known analytically about the input-space partition.
This paper characterizes the geometry of the MASO partitions of the input space and the feature map spaces . We proceed by first studying the geometry of a single layer (Section 4.2) and then the composition of layers that forms a complete DN (Section 5). Voronoi diagrams and their generalization, Power diagrams, play a key rôle in our analysis, and we turn to these next.
3 Background on Voronoi and Power Diagrams
A power diagram (PD), also known as a Laguerre–Voronoi diagram Aurenhammer and Imai (1988), is a generalization of the classical Voronoi diagram.
A PD partitions a space into disjoint regions such that , where each cell is obtained via , with
The parameter is called the centroid, while is called the radius. The PD is a generalization of the Voronoi diagram (VD) by introduction of the external radius term to the distance, leading to the Laguerre distance Imai et al. (1985). See Figure 1 for two equivalent geometric interpretations of a PD.
In general, a PD is defined with nonnegative radii to provide additional geometric interpretations (see Appendix A). However, the PD is the same under global shifting as . Thus, we allow for arbitrary radius since it can always be shifted back to nonnegative by setting . For additional geometric insights on VDs and PDs see Preparata and Shamos (2012) and Appendix A.
4 Input Space Power Diagram of a MASO Layer
Like any spline, it is the interplay between the (affine) spline mappings and the input space partition that work the magic in a MASO DN. Indeed, the partition opens up new geometric avenues to study how a MASO-based DN clusters and organizes signals in a hierarchical fashion. However, little is known analytically about the input-space partition other than in the simplest case of a single unit with a constrained bias value Balestriero and Baraniuk (2018a, b).
We now embark on a programme to fully characterize the geometry of the input space partition of a MASO-based DN. We will proceed in three steps by studying the partition induced by i) one unit of a single DN layer (Section 4.1), ii) the combination of all units in a single layer (Section 4.2), iii) the composition of L layers that forms the complete DN (Section 5).
4.1 MAS Unit Power Diagram
A MASO layer combines max affine spline (MAS) units to produce the layer output given an input . Denote each MAS computation from (1) as
where is the projection of
onto the hyperplane parameterized by the slopeand offset . By defining the following half-space consisting of the set of points above the hyperplane
we obtain the following geometric interpretation of the unit output.
The MAS unit maps its input space onto the boundary of the convex polytope , leading to
where we remind that is the image of .
To provide further intuition, we highlight the role of in term of input space partitioning.
The vertical projection on the input space of the faces of the polytope from (5) define the cells of a PD.
Since the MAS unit projects an input onto the polytope face given by (recall(2)) corresponding to
the collection of inputs having the same face allocation, defined as , constitutes the partition cell of the unit PD.
The MAS unit partitions its input space according to a PD with centroids given by , and (recall (2)).
The input space partitioning of a DN unit is composed of convex polytopes.
4.2 MASO Layer Power Diagram
We study the layer case by studying the joint behavior of all the layer units. A MASO layer is a continuous, piecewise affine operator made by the concatenation of MAS units (recall (1)); we extend (3) to
and the per unit face index function (6) into the operator defined as
where is the component of the vector .
The layer operator maps its input space into the boundary of the dimensional convex polytope as
Similarly to Proposition 1, the polytope is bound to the layer input space partitioning.
The vertical projection on the input space of the faces of the polytope from Proposition 2 define cells of a PD.
The MASO layer projects an input onto the polytope face indexed by corresponding to
The collection of inputs having the same face allocation jointly across the units constitutes the partition cell of the layer PD.
DN layer partitions its input space according to a PD with cells, centroids and radii (recall (2)).
The input space partitioning of a DN layer is composed of convex polytopes.
4.3 Weight Constraints and Cell Shapes
We highlight the close relationship between the layer weights from (1), the layer polytope from Proposition 2, and the boundaries of the layer PD from Theorem 3. In particular, how one can alter or constraint the shape of the cells by constraining the weights of the layer.
Example 1: Constraining the layer weights to be such that for some integer , , and arbitrary constant leads to an input power diagram with cell boundaries parallel to the input space basis vectors see Fig. 2. For instance if the input space is the Euclidean space equipped with the canonical basis, the previous Proposition translates into having PD boundaries parallel to the axes.
Example 2: Constraining the layer weights to be such that for some arbitrary constant leads to a layer-input power diagram with diagonal cell boundaries.222 Note that while in example 1 each per unit , per cell weight was constrained to contain a single nonzero element s.a. for , example 2 makes the weight vector filled with a single constant but varying signs such as .
Changing the radius of a given cell shrinks or expands w.r.t. the other Aurenhammer (1987).
Updating a single unit parameters (slope or offset of the affine transform and/or the nonlinearity behavior) affects multiple regions’ centroids and radius.
The above result recovers weight sharing concepts and implicit bias/regularization. In fact, most regions are tied together in term of learnable parameter. Trying to modify a single region while leaving everything else the same is not possible in general.
5 Input Space Power Diagram of a MASO Deep Network
We consider the composition of multiple layers, as such, the input space of layer is denoted as , with the DN input space.
5.1 Power Diagram Subdivision
We provide in this section the formula for deriving the input space partitioning of an -layer DN by means of a recursive scheme.
Recall that each layer defines its own polytope according to Proposition 2, each with domain .
The DN partitioning corresponds to a recursive subdivision where each per layer polytope subdivides the previously obtained partitioning, involving the representation of the considered layer polytope in the input space . This subdivision can be analytically derived from the following recursion scheme.
Initialization: The initialization consists of defining the part of the input space to consider .
Recursion step (): The first layer subdivides into a PD from Theorem 3 with parameters to obtain the layer 1 partitioning .
Recursion step (): The second layer subdivides each cell of . Let’s consider a specific cell ; all inputs in this cell are projected to by the first layer via .333Recall from (1) that are the affine parameters associated to cell The convex cell thus remains a convex cell in defined as the following affine transform of the cell
Since on the cell the first layer is linear; the slice of the polytope (recall (10)) having for domain formally defined as
can thus be expressed w.r.t. .
The domain restricted polytope (13) can be expressed in the input space as
with the hyperplane with slope and bias ,.
From Lemma 2, induces an underlying PD on its domain that subdivides the cell into a PD denoted as leading to the centroids
and radii . The PD parameters thus combine the affine parameters of the considered cell with the second layer parameters . Repeating this subdivision process for all cells from form the input space partitioning
Recursion step: Consider the situation at layer knowing from the previous subdivision steps. following the intuition from the , layer subdivides each cell in to produce leading to the -up to layer -layer DN partitioning defined as
Each cell is subdivided into , a PD with domain and parameters
with forming .
The described recursion construction also provides a direct result on the shape of the entire DN input space partitioning cells.
The cells of the DN input space partitioning are convex polygons.
5.2 Combinatorial Geometry Properties
We highlight a key computational property of DNs contributing to their success. While the actual number of cells from a layer PD varies greatly depending on the parameters, the cell inference task always search over the maximum number of cells as
The computational and memory complexity of this task is . While approximations exist Muja and Lowe (2009); Arya et al. (1998); Georgescu et al. (2003), we demonstrate how a MASO induced PD is constructed in such a way that it is parameter-, memory-, and computation-efficient.
A DN layer solves (18) with computational and memory complexity as opposed to for an arbitrary Power Diagram.
The entire DN then solves iteratively (18) for each layer.
An entire DN infers an input cell with computational and memory complexity as opposed to for an arbitrary hierarchy of Power Diagrams.
The above practical result is crucial to the ability of DN layers to perform extremely fine grained input space partitioning without sacrificing computation time especially at test time where one needs only feed forward computations of an input to obtain a prediction.
5.3 Centroid and Radius Computation
In practice, the number of centroids and radius for each of the partitioning contains too many cells to compute all the centroids and radius. However, given a cell (resp. a point ) and an up to layer code (resp. ), computing the centroid and radius can be done as follows:
where we remind that and from Theorem 5. Notice how centroids and biases of the current layer are mapped back to the input space via projection onto the tangent hyperplane with basis given by .
The centroids correspond to the backward pass of DNs and thus can be computed efficiently by backpropagations.
Note how the form in (20) correspond to saliency maps. In particular, at a given layer, the centroid of the region in which belongs is obtained by summing all the per unit saliency maps synonym of adding all the unit contributions in the input space. We provide in Fig. 4 computed centroids for a trained Resnet on CIFAR10, for each PD subdivision, see appendix LABEL:appendix_resnet
for details on the model, performance and additional figures. The ability to retrieve saliency maps and the form of the centroid opens the door to further use in many settings of the centroids. For example. semi supervised learning successfully leveraged the last layer centroid inBalestriero et al. (2018) by providing a loss upon them.
5.4 Empirical Region Characterization
We provide in Fig. 5 the distribution of distances from the dataset points to the nearest region boundaries of the input space partitioning for each layer (at a current subdivision step) and at different stages of the training procedure. Clearly, training slightly impacts those distances and slight increase the number of inputs with small distance to then nearest boundary. Yet, the main impact of learning resides in shaping the regions via learning of the weights. We also train a DN on various dataset and study how many inputs share the same input space partitioning region. We observed that from initialization to the end of the learning, there are never more than one image in a given region and this with standard DNs providing near or state of the art performances. Yet, drastically reducing the size of the DN allows to have more than one image per regions when considering the first steps of the subdivision.
6 Geometry of a Deep Network Decision Boundary
The analysis of the input space partitioning was achieved by successively expressing the layer polytope in the input space. While Sec. 5 focused on the faces of the polytope which define the cells in the input space, we now turn to the edges of the polytope which define the cells’ boundaries. In particular, we demonstrate how a single unit at layer defines multiple cell boundaries in the input space, and use this finding to finally derive the analytical formula of the DN decision boundary in classification tasks. In this section we focus on DN nonlinearities using nonlinearities such as ReLU, leaky-ReLU, and absolute value.
6.1 Partitioning Boundaries and Edges
In the case of nonlinearities, the polytope of unit contains a single edge. This edge can be expressed in some space , as a collection of continuous piecewise linear paths.
The edges of the polytope in some space are the collection of points defined as
Thus the edges correspond to the level curve of the unit in . Defining the polynomial
we obtain the following result where the boundaries of from Theorem 5 can be expressed in term of the polytope edges and roots of the polynomial.
The polynomial (23) is of order , its roots correspond to the partitioning boundaries:
and the root order defines the dimension of the root (boundary, corner, …).
6.2 Decision Boundary and Curvature
In the case of classification, the last layer typically includes a softmax nonlinearity and is thus not a MASO layer. However, Balestriero and Baraniuk (2019) demonstrated that it can be expressed as a MASO layer without any change in the model and output. As a result, this layer introduces a last subdivision of the DN partitioning. We focus on binary classification for simplicity of notations, in this case, and a single last subdivision occurs. In particular, using the previous result we obtain the following.
The decision boundary of a DN with layers is the edge of the last layer polytope expressed in the input space from Def. 3 as
To provide insights let consider a layer DN denoted as and the binary classification task; we have as the DN induced decision boundary the following
with and . Studying the distribution of characterizes the structure of the decision boundary and thus open the highlight the interplay between layer parameters, layer topology, and the decision boundary. For example, looking at Figure 3 and the red line demonstrates how the weight characterize the curvature and cuts position of the decision boundary.
We provide a direct application of the above finding by providing a curvature characterization of the decision boundary. First, we propose the following result stating that the form of and from (26) from one region to an neighbouring one only alters a single unit code at a given layer.
Any edge as defined in Def. 3 reaching a region boundary, must continue in this neighbouring region.
This comes directly from continuity of the involved operator. This demonstrates that the decision boundary as defined in (26) can have its curvature defined by comparing the form of the edges of adjacent regions.
The decision boundary curvature/angle between two adjacent regions and 444For clarity, we omit the subscripts. is given by the following dihedral angle Kern and Bland (1938) between the adjacent hyperplanes as
The above is illustrated in the adjacent figure with one of the angle highlighted with an arrow. The hyperplane offsets are irrelevant to the boundary curvature. Following this, the DN bias units are also irrelevant to the boundary curvature. Also, the norm of the gradient of the DN and thus the Lipschitz constant alone does not characterize the regularity of the decision boundary. In fact, the angle is invariant under scaling of the parameters. This indicates how measures based on the input-output sensitivity do not characterize alone the curvature of the decision boundary.
Finally, we highlight an intuitive result that can be derived from the above. We remind that a neighbouring regions implies a change of a single code unit. Let denote without loss of generality the changed code index by . The other codes remain the same. When dealing with nonlinearities, this implies that changes from a to a for those two neighbouring regions. Let denote by the case with a and by the case with a . With those notations, we can derive some special cases of the distance formula (27) for some DN topologies.
In a 2-layer DN with ReLU and orthogonal first layer weights, we have
From the above formula it is clear that reducing the norm of the weights alone does not impact the angles. However, we have the following result.
Regions of the input space in which the amount of firing ReLU is small will have greater decision boundary curvature than regions with most ReLU firing simultaneously.
As a result, a ReLU based DN will have different behavior at different regions of the space. And the angle is always positive. Interestingly, the use of absolute value on the other hand leads to the following.
In a 2-layer DN with absolute value and orthogonal first layer weights, we have
Now, not only are the angles between degrees and (as opposed to ReLU between and ), but the angles also do not depend on the state of the other absolute values but just on the norm of the weights of both layers.
We have extended the understanding of DNs by leveraging computational geometry to characterize how a DN partitions its input space via a multiscale Power Diagram subdivision. Our analytical formulation for the partitions induced by not only the entire DN but also each of its units and layers will open new avenues to study how to optimize DNs for certain tasks and classes of signals.
- Arya et al.  Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.
- Aurenhammer  Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987.
- Aurenhammer and Imai  Franz Aurenhammer and Hiroshi Imai. Geometric relations among voronoi diagrams. Geometriae Dedicata, 27(1):65–75, 1988.
- Balestriero and Baraniuk [2018a] R. Balestriero and R. Baraniuk. Mad max: Affine spline insights into deep learning. arXiv preprint arXiv:1805.06576, 2018a.
- Balestriero and Baraniuk [2018b] R. Balestriero and R. G. Baraniuk. A spline theory of deep networks. In Proc. Int. Conf. Mach. Learn., volume 80, pages 374–383, Jul. 2018b.
- Balestriero and Baraniuk  Randall Balestriero and Richard Baraniuk. From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Syxt2jC5FX.
- Balestriero et al.  Randall Balestriero, Hervé Glotin, and Richard Baraniuk. Semi-supervised learning enabled by multiscale deep neural network inversion. arXiv preprint arXiv:1802.10172, 2018.
- Georgescu et al.  Bogdan Georgescu, Ilan Shimshoni, and Peter Meer. Mean shift based clustering in high dimensions: A texture classification example. In ICCV, volume 3, page 456, 2003.
- Goodfellow et al.  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning, volume 1. MIT Press, 2016. http://www.deeplearningbook.org.
- Hanin and Rolnick  Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019.
- Hannah and Dunson  L. A. Hannah and D. B. Dunson. Multivariate convex regression with adaptive partitioning. J. Mach. Learn. Res., 14(1):3261–3294, 2013.
- Imai et al.  Hiroshi Imai, Masao Iri, and Kazuo Murota. Voronoi diagram in the laguerre geometry and its applications. SIAM Journal on Computing, 14(1):93–105, 1985.
- Johnson  Roger A Johnson. Advanced Euclidean Geometry: An Elementary Treatise on the Geometry of the Triangle and the Circle: Under the Editorship of John Wesley Young. Dover Publications, 1960.
- Kern and Bland  Willis Frederick Kern and James R Bland. Solid mensuration: with proofs. J. Wiley & Sons, Incorporated, 1938.
- Magnani and Boyd  A. Magnani and S. P. Boyd. Convex piecewise-linear fitting. Optim. Eng., 10(1):1–17, 2009.
- Montufar et al.  Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
- Muja and Lowe  Marius Muja and David G Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340):2, 2009.
- Pach and Agarwal  János Pach and Pankaj K Agarwal. Combinatorial geometry, volume 37. John Wiley & Sons, 2011.
- Preparata and Shamos  Franco P Preparata and Michael I Shamos. Computational geometry: an introduction. Springer Science & Business Media, 2012.
- Raghu et al.  Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2847–2854. JMLR. org, 2017.
Wang et al. 
Zichao Wang, Randall Balestriero, and Richard Baraniuk.
A MAX-AFFINE SPLINE PERSPECTIVE OF RECURRENT NEURAL NETWORKS.In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJej72AqF7.
- Zhang et al.  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Additional Geometric Insights
The Laguerre distance corresponds to the length of the line segment that starts at and ends at the tangent to the hypersphere with center and radius (see Figure LABEL:fig:power_diagram).
The hyperplanar boundary between two adjacent PD regions can be characterized in terms of the chordale of the corresponding hyperspheres Johnson . Doing so for all adjacent boundaries fully characterize those region boundaries in simple terms of hyperplane intersections from Aurenhammer .
a.1 Paraboloid Insights
A further characterization of the polytope boundary can be made by introducing the paraboloid defined as . Notice that the slope of the hyperplane is and its offset is . Defining the paraboloid defined as , we see how the hyperplane is the tangent of the paraboloid at the point . We now highlight that the hyperplane and the paraboloid intersect at an unique point
The faces of are the tangent of at the points given by leading to
Concerning the case of abitrary bias we hve the following insights. We can characterize the hypersphere as being the intersection of the hyperplanes with the paraboloid in the following result from Aurenhammer .
Aurenhammer  There is a bijective mapping between the hyperpshere in the input domain and the intersection of the hyperplane in with the paraboloid .
In fact, the projection of the intersection between the hyperplane and the paraboloid onto the input space is forming a circle where the radius corresponds to the shift of the hyperplane.
Appendix B Additional Figures
Appendix C Proofs
c.1 Proof of Lemma 1: Single Unit Projection
Follows from Johnson  that demonstrates that the boundaries in the input space defining the regions of the unit PD are the vertical projections of the polytope () face intersections defined as for neighbouring faces and .
c.2 Proof of Lemma 2: Single Layer Projection
Follows from the previous section in which it is demonstrated that the boundaries of a single unit PD is obtained by vertical projection of the polytope edges. In the layer case, the edges of correspond to all the points in the input space s.t. belongs to an edge of at least one of the polytopes making up . The layer PD having for boundaries the union of all the per unit PD boundaries, it follows directly that the vertical projection of the edges of form the layer PD boundaries.
c.3 Proof Complexity
Recalling Section 2, a layer MASO produces its output by: first inferring in which cell lies in the layer PD partitioning from Theorem LABEL:thm:layer_PD; and then affinely transforming the input via . The inference problem of determining in which power diagram cell an input falls
c.4 Proof of Theorem 2: Single Unit Power Diagram
Let first consider the case of units.
The layer input space of the -MASO units at layer is a weighted Voronoi Diagram with a maximum of regions, centroids , and biases