On the Expressive Power of Deep Polynomial Neural Networks

We study deep neural networks with polynomial activations, particularly their expressive power. For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials. The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure. This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically. Finally, we link our study to favorable optimization properties when training weights, and we draw intriguing connections with tensor and polynomial decompositions.

Authors

• 15 publications
• 11 publications
• 81 publications
11/30/2019

Counting invariant subspaces and decompositions of additive polynomials

The functional (de)composition of polynomials is a topic in pure and com...
12/21/2021

NN2Poly: A polynomial representation for deep feed-forward artificial neural networks

Interpretability of neural networks and their underlying theoretical beh...
04/22/2018

Torus polynomials: an algebraic approach to ACC lower bounds

We propose an algebraic approach to proving circuit lower bounds for ACC...
03/18/2021

Neural tensor contractions and the expressive power of deep neural quantum states

We establish a direct connection between general tensor networks and dee...
02/07/2021

Towards a mathematical framework to inform Neural Network modelling via Polynomial Regression

Even when neural networks are widely used in a large number of applicati...
04/26/2013

An Algorithm for Training Polynomial Networks

We consider deep neural networks, in which the output of each node is a ...
11/24/2016

Survey of Expressivity in Deep Neural Networks

We survey results on neural network expressivity described in "On the Ex...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A fundamental problem in the theory of deep learning is to study the

functional space of deep neural networks. A network can be modeled as a composition of elementary maps, however the family of all functions that can be obtained in this way is extremely complex. Many recent papers paint an accurate picture for the case of shallow networks (e.g., using mean field theory chizat_global_2018 ; mei_mean_2018 ) and of deep linear networks arora_convergence_2018 ; arora_optimization_2018 ; kawaguchi_deep_2016 , however a similar investigation of deep nonlinear networks appears to be significantly more challenging, and require very different tools.

In this paper, we consider a general model for deep polynomial neural networks

, where the activation function is a polynomial (

-th power) exponentiation. The advantage of this framework is that the functional space associated with a network architecture is algebraic, so we can use tools from algebraic geometry harris_algebraic_1995 for a precise investigation of deep neural networks. Indeed, for a fixed activation degree and architecture (expressed as a sequence of widths), the family of all networks with varying weights can be identified with an algebraic variety , embedded in a finite-dimensional Euclidean space. In this setting, an algebraic variety can be thought of as a manifold that may have singularities.

In this paper, our main object of study is the dimension of as a variety (in practice, as a manifold), which may be regarded as a precise measure of the architecture’s expressiveness. Specifically, we prove that this dimension stabilizes when activations are high degree, and we provide an exact dimension formula for this case (Theorem 14). We also investigate conditions under which fills its ambient space. This question is important from the vantage point of optimization, since an architecture is “filling” if and only if it corresponds to a convex functional space (Proposition 6). In this direction, we prove a bottleneck property, that if a width is not sufficiently large, the network can never fill the ambient space regardless of the size of other layers (Theorem 19).

In a broader sense, our work introduces a powerful language and suite of mathematical tools for studying the geometry of network architectures. Although this setting requires polynomial activations, it may be used as a testing ground for more general situations and, e.g., to verify rules of thumb rigorously. Finally, our results show that polynomial neural networks are intimately related to the theory of tensor decompositions Landsberg-book . In fact, representing a polynomial as a deep network corresponds to a type of decomposition of tensors which may be viewed as a composition of decompositions of a recently introduced sort LORS-2019 . Using this connection, we establish general non-trivial upper bounds on filling widths (Theorem 10). We believe that our work can serve as a step towards many interesting research challenges in developing the theoretical underpinnings of deep learning.

1.1 Related work

The study of the expressive power of neural networks dates back to seminal work on the universality of networks as function approximators cybenko_approximation_1989 ; hornik_multilayer_1989 . More recently, there has been research supporting the hypothesis of “depth efficiency”, i.e., the fact that deep networks can approximate functions more efficiently than shallow networks delalleau2011shallow ; martens_expressive_2014 ; cohen_expressive_2016 ; cohen_convolutional_2016 . Our paper differs from this line of work, in that we do not emphasize approximation properties, but rather the study of the functions that can be expressed exactly using a network.

Most of the aforementioned studies make strong hypotheses on the network architecture. In particular, delalleau2011shallow ; martens_expressive_2014 focus on arithmetic circuits, or sum-product networks poon_sum-product_2012 . These are networks composed of units that compute either the product or a weighted sum of their inputs. In cohen_expressive_2016 , the authors introduce a model of convolutional arithmetic circuits. This is a particular class of arithmetic circuits that includes networks with layers of 1D convolutions and product pooling. This model does not allow for non-linear activations (beside the product pooling), although the follow-up paper cohen_convolutional_2016

extends some results to ReLU activations with sum pooling. Interestingly, these networks are related to Hierarchical Tucker (HT) decomposition of tensors.

The polynomial networks studied in this paper are not arithmetic circuits, but feedforward deep networks with polynomial -th power activations. This is a vast generalization of a setting considered in several recent papers venturi2018a ; du_power_2018 ; soltanolkotabi_theoretical_2018 , that study shallow (two layer) networks with quadratic activations (). These papers show that if the width of the intermediate layer is at least twice the input dimension, then the quadratic loss has no “bad” local minima. This result in line with our Proposition 5, which explains in this case the functional space is convex and fills

the ambient space. We also point out that polynomial activations are required for the functional space of the network to span a finite dimensional vector space

leshno_multilayer_1993 ; venturi2018a .

The polynomial networks considered in this paper do not correspond to HT tensor decompositions as in cohen_expressive_2016 ; cohen_convolutional_2016 , rather they are related to a different polynomial/tensor decomposition attracting very recent interest FOS-PNAS ; LORS-2019 . These generalize usual decompositions, however their algorithmic and theoretical understanding are, mostly, wide open. Neural networks motivate several questions in this vein.

Main contributions.

Our main contributions can be summarized as follows.

• We give a precise formulation of the expressiveness of polynomial networks in terms of the algebraic dimension of the functional space as an algebraic variety.

• We spell out the close, two-way relationship between polynomial networks and a particular family of decompositions of tensors.

• We prove several theoretical results on the functional space of polynomial networks. Notably, we give a formula for the dimension that holds for sufficiently high activation degrees (Theorem 14) and we prove a tight lower bound on the width of the layers for the network to be “filling” in the functional space (Theorem 19).

Notation.

We use to denote the space of homogeneous polynomials of degree in variables with coefficients in . This set is a vector space over of dimension , spanned by all monomials of degree in variables. In practice, is isomorphic to , and our networks will correspond to points in this high dimensional space. The notation expresses the fact that a polynomial of degree in variables can always be identified with a symmetric tensor in that collects all of its coefficients.

2 Basic setup

A polynomial network is a function of the form

 pθ(x)=WhρrWh−1ρr…ρrW1x,Wi∈Rdi×di−1, (1)

where the activation raises all elements of to the -th power (). The parameters (with ) are the network’s weights, and the network’s architecture is encoded by the sequence (specifying the depth and widths ). Clearly, is a homogeneous polynomial mapping of degree , i.e., .

For fixed degree and architecture , there exists an algebraic map

 Φd,r:θ↦pθ=⎡⎢ ⎢⎣pθ1⋮pθdh+1⎤⎥ ⎥⎦, (2)

where each is a polynomial in variables. The image of is a set of vectors of polynomials, i.e., a subset of , and it is the functional space represented by the network. In this paper, we consider the “Zariski closure” of the functional space.111The Zariski closure of a set is the smallest set containing that can be described by polynomial equations. We refer to as functional variety of the network architecture, as it is in fact an irreducible algebraic variety. In particular, can be studied using powerful machinery from algebraic geometry.

Remark 1.

The functional variety may be significantly larger than the actual functional space , since the Zariski closure is typically larger than the closure with respect to the standard the Euclidean topology. On the other hand, the dimensions of the spaces and agree, and the set is usually “nicer” (it can be described by polynomial equations, whereas an exact implicit description of may require inequalities).

2.1 Examples

We present some examples that describe the functional variety in simple cases.

Example 2.

A linear network is a polynomial network with . In this case, the network map is simply matrix multiplication:

 θ=(Wh,Wh−1,…,W1)↦pθ=WhWh−1…W1x. (3)

The functional space is the set of matrices with rank at most . This set is already characterized by polynomial equations, as the common zero set of all minors, so in this case. The dimension of is .

Example 3.

Consider and . The input variables are , and the parameters are the weights

 W1=[w111w112w121w122],W2=⎡⎢⎣w211w212w221w222w231w232⎤⎥⎦. (4)

The network map is a triple of quadratic polynomials in , that can be written as

 W2ρ2W1x=⎡⎢ ⎢⎣w211(w111x1+w112x2)2+w212(w121x1+w122x2)2w221(w111x1+w112x2)2+w222(w121x1+w122x2)2w231(w111x1+w112x2)2+w232(w121x1+w122x2)2⎤⎥ ⎥⎦. (5)

The map in (2) takes (that have parameters) to the three quadratics in displayed above. The quadratics have a total of coefficients, however these coefficients are not arbitrary, i.e., not all possible triples of polynomials occur in the functional space. Writing for the coefficient of in in (5) (with ) then it is a simple exercise to show that

 det⎡⎢ ⎢ ⎢⎣c(1)11c(1)12c(1)22c(2)11c(2)12c(2)22c(3)11c(3)12c(3)22⎤⎥ ⎥ ⎥⎦=0. (6)

This cubic equation describes the functional variety , which is in this case an eight-dimensional subset (hypersurface) of .

2.2 Objectives

The main goal of this paper is to study the dimension of as the network’s architecture and the activation degree vary. This dimension may be considered a precise and intrinsic measure of the polynomial network’s expressivity

, quantifying degrees of freedom of the functional space. For example, the dimension reflects the number of input/output pairs the network can interpolate, as each sample imposes one linear constraint on the variety

.

In general, the variety lives in the ambient space , which in turn only depends on the activation degree , network depth , and the input/output dimensions and . We are thus interested in the role of the intermediate widths in the dimension of .

Definition 4.

A network architecture has a filling functional variety for the activation degree if .

It is important to note that if the functional variety is filling, then actual functional space (before taking closure) is in general only thick, i.e., it has positive Lebesgue measure in (see Remark 1). On the other hand, given an architecture with a thick functional space, we can find another architecture whose functional space is the whole ambient space.

Proposition 5 (Filling functional space).

Fix and suppose has a filling functional variety . Then the architecture has a filling functional space, i.e., .

In summary, while an architecture with a filling functional variety may not necessarily have a filling functional space, it is sufficient to double all the intermediate widths for this stronger condition to hold. As argued below, we expect architectures with thick/filling functional spaces to have more favorable properties in terms of optimization and training. On the other hand, non-filling architectures may lead to interesting functional spaces for capturing patterns in data. In fact, we show in Section 3.2 that non-filling architectures generalize families of low-rank tensors.

2.3 Connection to optimization

The following two results illustrate that thick/filling functional spaces are helpful for optimization.

Proposition 6.

If the closure of a set is not convex, then there exists a convex function on whose restriction to has arbitrarily “bad” local minima (that is, there exist local minima whose value is arbitrarily larger than that of a global minimum).

Proposition 7.

If a functional space is not thick, then it is not convex.

These two facts show that if the functional space is not thick, we can always find a convex loss function and a data distribution that lead to a landscape with arbitrarily bad local minima. There is also an obvious weak converse, namely that if the functional space is filling

, then any convex loss function will have a unique global minimum (although there may be “spurious” critical points that arise from the non-convex parameterization).

3 Architecture dimensions

In this section, we begin our study of the dimension of . We describe the connection between polynomial networks and tensor decompositions for both shallow (Section 3.1) and deep (Section 3.2) networks, and we present some computational examples (Section 3.3).

3.1 Shallow networks and tensors

Polynomial networks with are closely related to CP tensor decomposition Landsberg-book . Indeed in the shallow case, we can verify the network map sends to:

 W2ρrW1x=(d1∑i=1W2(:,i)⊗W1(i,:)⊗r)⋅x⊗r=:Φ(W2,W1)⋅x⊗r. (7)

Here is a partially symmetric tensor, expressed as a sum of partially symmetric rank terms, and denotes contraction of the last indices. Thus the functional space is the set of rank partially symmetric tensors. Algorithms for low-rank CP decomposition could be applied to to recover and . In particular, when , we obtain a symmetric tensor. For this case, we have the following.

Lemma 8.

A shallow architecture is filling for the activation degree if and only if every symmetric tensor has rank at most .

Furthermore, the celebrated Alexander-Hirschowitz Theorem alexander1995 from algebraic geometry provides the dimension of for all shallow, single-output architectures.

Theorem 9 (Alexander-Hirschowitz).

If , the dimension of is given by , except for the following cases:

• , ,

• , , ,

• , , ,

• , , ,

• , , .

3.2 Deep networks and tensors

Deep polynomial networks also relate to a certain iterated tensor decomposition. We first note the map may be expressed via the so-called Khatri-Rao product from multilinear algebra. Indeed maps to:

 SymRowWh((Wh−1…(W2(W∙r1))∙r…)∙r). (8)

Here the Khatri-Rao product operates on rows: for , the power replaces each row, , by its vectorized -fold outer product, . Also in (8), SymRow denotes symmetrization of rows, regarded as points in a certain linear operator.

Another viewpoint comes from using polynomials and inspecting the layers in reverse order. Writing for the output polynomials at depth , the top output at depth is:

 wh11prθ1+wh12prθ2+…+wh1dh−1prθdh−1. (9)

This expresses a polynomial as a weighted sum of -th powers of other (nonlinear) polynomials. Recently, a study of such decompositions has been initiated in the algebra community LORS-2019 . Such expressions extend usual tensor decompositions, since weighted sums of powers of homogeneous linear forms correspond to CP symmetric decompositions. Accounting for earlier layers, our neural network expresses each in (9) as -th powers of lower-degree polynomials at depth , so forth. Iterating the main result in FOS-PNAS on decompositions of type (9), we obtain the following bound on filling intermediate widths.

Theorem 10 (Bound on filling widths).

Suppose and satisfy

 dh−i≥min(dh⋅rid0,(rh−i+d0−1rh−i)) (10)

for each . Then the functional variety is filling.

3.3 Computational investigation of dimensions

We have written code222Available at https://github.com/mtrager/polynomial_networks. in the mathematical software SageMath sagemath that computes the dimension of for a general architecture and activation degree . Our approach is based on randomly selecting parameters and computing the rank of the Jacobian of in (2). This method is based on the following lemma, coming from the fact that the map is algebraic.

Lemma 11.

For all , the rank of the Jacobian matrix is at most the dimension of the variety . Furthermore, there is equality for almost all (i.e., for a non-empty Zariski-open subset of ).

Thus if is full rank at any , this witnesses a mathematical proof is filling. On the other hand if the Jacobian is rank-deficient at random

, this indicates with “probability 1" that

is not filling. We have implemented two variations of this strategy, by leveraging backpropagation:

1. Backpropagation over a polynomial ring. We defined a network class over a ring , taking as input a vector variables . Performing automatic differentiation (backpropagation) of the output function yields polynomials corresponding to , for any entry of a weight matrix . Extracting the coefficients of the monomials in , we recover the entries of the Jacobian of .

2. Backpropagation over a finite field. We defined a network class over a finite field . After performing backpropagation at a sufficient number of random sample points , we can recover the entries of the Jacobian of by solving a linear system (this system is overdetermined, but it will have an exact solution since we use exact finite field arithmetic). The computation over provides the correct dimension over for almost all primes .

The first algorithm is simpler and does not require interpolation, but is generally slower. We present examples of some of our computations in Tables 1 and 2. Table 1 shows minimal architectures that are filling, as the depth varies. Here, “minimal” is with respect to the partial ordering comparing all widths. It is interesting to note that for deeper networks, there is not a unique minimally filling network. Also conspicuous is that minimal filling widths are “unimodal", (weakly) increasing and then (weakly) decreasing. Arguably, this pattern conforms with common wisdom.

Conjecture 12 (Minimal filling widths are unimodal).

Fix , , and . If is a minimal filling architecture, there is such that and .

Table 2 shows examples of computed dimensions, for varying architectures and degrees. Notice that the dimension of an architecture stabilizes as the degree increases.

4 General results

This section presents general results on the dimension of . We begin by pointing out symmetries in the network map , under suitable scaling and permutation.

Lemma 13 (Multi-homogeneity).

For all diagonal matrices and permutation matrices (), the map returns the same output under the replacement:

 W1 ←P1D1W1 W2 ←P2D2W2D−r1PT1 W3 ←P3D3W3D−r2PT2 ⋮ Wh ←WhD−rh−1PTh−1.

Thus the dimension of a generic fiber (pre-image) of is at least .

Our next result deduces a general upper bound on the dimension of . Conditional on a standalone conjecture in algebra, we prove that equality in the bound is achieved for all sufficiently high activation degrees . An unconditional result is achieved by varying the activation degrees per layer.

Theorem 14 (Naive bound and equality for high activation degree).

If , then

 dimVd,r≤min(dh+h∑i=1(di−1−1)di,dh(d0+rh−1−1rh−1)). (11)

Conditional on Conjecture 16, for fixed satisfying (), there exists such that whenever , we have an equality in (11). Unconditionally, for fixed satisfying (), there exist infinitely many such that the image of has dimension .

Proposition 15.

Given integers , there exists with the following property. Whenever are homogeneous polynomials of the same degree in variables, no two of which are linearly dependent, then are linearly independent if .

Conjecture 16.

In the setting of Proposition 15, may be taken to depend only on and .

Proposition 15 and Conjecture 16 are used in induction on for the equality statements in Theorem 14. Our next result uses the iterative nature of neural networks to provide a recursive bound.

Proposition 17 (Recursive Bound).

For all and , we have:

 dimV(d0,…,dh),r≤dimV(d0,…,dk),r+dimV(dk,…,dh),r−dk. (12)

Using the recursive bound, we can prove an interesting bottleneck property for polynomial networks.

Definition 18.

The width in layer is an asymptotic bottleneck (for , and ) if there exists such that for all and all , …, , , then the widths are non-filling.

This expresses our finding that too narrow a layer can “choke" a polynomial network, such that there is no hope of filling the ambient space, regardless of how wide elsewhere or deep the network is.

Theorem 19 (Bottlenecks).

If , then is an asymptotic bottleneck. Moreover conditional on Conjecture 2 in nicklasson-2017 , then is not an asymptotic bottleneck.

Proposition 17 affords a simple proof is an asymptotic bottleneck. However to obtain the full statement of Theorem 19, we seem to need more powerful tools from algebraic geometry.

5 Conclusion

We have studied the functional space of neural networks from a novel perspective. Deep polynomial networks furnish a framework for nonlinear networks, to which the powerful mathematical machinery of algebraic geometry may be applied. In this respect, we believe polynomial networks can help us access a better understanding of deep nonlinear architectures, for which a precise theoretical analysis has been extremely difficult to obtain. Furthermore, polynomials can be used to approximate any continuous activation function over any compact support (Stone?Weierstrass theorem). For these reasons, developing a theory of deep polynomial networks is likely to pay dividends in building understanding of general neural networks.

In this paper, we have focused our attention on the dimension of the functional space of polynomial networks. The dimension is the first and most basic descriptor of an algebraic variety, and in this context it provides an exact measure of the expressive power of an architecture. Our novel theoretical results include a general formula for the dimension of the architecture attained in high degree, as well as a tight lower bound and nontrivial upper bounds on the width of layers in order for the functional variety to be filling. We have also demonstrated intriguing connections with tensor and polynomial decompositions, including some which appear in very recent literature in algebraic geometry.

The tools and concepts introduced in this work for fully connected feedforward polynomial networks can be applied in principle to more general algebraic network architectures. Variations of our algebraic model could include multiple polynomial activations (rather than just single exponentiations) or more complex connectivity patterns of the network (convolutions, skip connections, etc.). The functional varieties of these architectures could be studied in detail and compared. Another possible research direction is a geometric study of the functional varieties, beyond the simple dimension. For example, the degree or the Euclidean distance degree draisma_euclidean_2013 of these varieties could be used to bound the number of critical points of a loss function. Additionally, motivated by Section 3.2, we would like to develop computational methods for constructing a network architecture that represents an assigned polynomial mapping. Such algorithms might lead to “closed form” approaches for learning using polynomial networks (similar to SVD or tensor decomposition), as a provable counterpoint to gradient descent methods. Our research program might also shed light on the practical problem of choosing an appropriate architecture for a given application.

Acknowledgements

We thank Justin Chen, Amit Moscovich, Claudiu Raicu and Steven Sam for their help. JK was partially supported by the Simons Collaboration on Algorithms and Geometry. MT and JB were partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753 and Samsung Electronics.

References

• [1] James Alexander and André Hirschowitz. Polynomial interpolation in several variables. Journal of Algebraic Geometry, 4(2):201–222, 1995.
• [2] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. In International Conference on Learning Representations, 2019.
• [3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: implicit acceleration by overparameterization. In

International Conference on Machine Learning

, pages 244–253, 2018.
• [4] Pranav Bisht. On hitting sets for special depth-4 circuits. Master’s thesis, Indian Institute of Technology Kanpur, 2017.
• [5] Grigoriy Blekherman and Zach Teitler. On maximum, typical and generic ranks. Mathematische Annalen, 362(3-4):1021–1031, 2015.
• [6] Winfried Bruns and Jürgen Herzog. Cohen-Macaulay rings, volume 39 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 1993.
• [7] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, pages 3036–3046, 2018.
• [8] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: a tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
• [9] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955–963, 2016.
• [10] George Cybenko.

Approximation by superpositions of a sigmoidal function.

Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
• [11] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pages 666–674, 2011.
• [12] The Sage Developers. SageMath, the Sage Mathematics Software System (Version 8.0.0), 2017.
• [13] Jan Draisma, Emil Horobeţ, Giorgio Ottaviani, Bernd Sturmfels, and Rekha R. Thomas. The Euclidean distance degree of an algebraic variety. Foundations of Computational Mathematics, 16(1):99–149, 2016.
• [14] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with quadratic activation. In International Conference on Machine Learning, pages 1329–1338, 2018.
• [15] David Eisenbud. Commutative algebra: with a view toward algebraic geometry, volume 150 of Graduate Texts in Mathematics. Springer-Verlag, New York, 1995.
• [16] Ralf Fröberg, Giorgio Ottaviani, and Boris Shapiro. On the Waring problem for polynomial rings. Proceedings of the National Academy of Sciences, 109(15):5600–5602, 2012.
• [17] Joe Harris. Algebraic geometry: a first course, volume 133 of Graduate Texts in Mathematics. Springer-Verlag, New York, corrected 3rd print edition, 1995.
• [18] Robin Hartshorne. Algebraic geometry, volume 52 of Graduate Texts in Mathematics. Springer-Verlag, New York-Heidelberg, corrected 8th print edition, 1997.
• [19] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
• [20] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
• [21] J. M. Landsberg. Tensors: geometry and applications, volume 128 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2012.
• [22] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, 1993.
• [23] Samuel Lundqvist, Alessandro Oneto, Bruce Reznick, and Boris Shapiro. On generic and maximal -ranks of binary forms. Journal of Pure and Applied Algebra, 223(5):2062 – 2079, 2019.
• [24] James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product networks. arXiv preprint arXiv:1411.7717, 2014.
• [25] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):7665–7671, 2018.
• [26] Lisa Nicklasson. On the Hilbert series of ideals generated by generic forms. Communications in Algebra, 45(8):3390–3395, 2017.
• [27] Hoifung Poon and Pedro Domingos. Sum-product networks: a new deep architecture. arXiv preprint arXiv:1202.3732, 2012.
• [28] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019.
• [29] Luca Venturi, Afonso S. Bandeira, and Joan Bruna. Spurious valleys in two-layers neural network optimization landscapes. arXiv preprint arXiv:1802.06384, 2018.

Appendix A Technical proofs

See 5

Proof.

We mimic the proof of Theorem 1 in [5]. As is thick, equivalently contains some Euclidean open ball (see Chevalley’s theorem [18]). But given any point , we may write for some and . Thus in the architecture , we may set the “top half" of weights to represent , the “bottom half" to represent , and so scaling appropriately, all together the network represents . ∎

See 6

Proof.

We write for the closure of . Let a line that intersects in (at least) two closed disjoint intervals . Such line always exists because is not convex. It is easy to construct a convex function that is outside of and has (arbitrarily) different minima when restricted to : this amounts to constructing a convex function with assigned minima on disjoint closed intervals. ∎

See 7

Proof.

It is enough to argue that does not lie on a linear subspace (i.e., that its affine hull is the whole ambient space). Indeed, because has zero-measure, this implies that it cannot coincide with its convex hull. To show the claim, we observe that always contains all vectors of polynomials of the form , where is a linear form in variables (this follows by induction on ). The vectors span the ambient space, as any polynomial can be written as a linear combination of powers of linear forms. ∎

See 8

Proof.

This is clear as the network outputs . ∎

See 10

Proof.

It is equivalent to show that the network map with scalars extended to (i.e., allowing complex weights), denoted , has full-measure image. For this, we use induction on . The key input is Theorem 4 of [16], which states generic homogeneous polynomials over of degree in variables can be written as a sum of many -th powers of degree polynomials over , when .

The base case is trivial. Thus assume and that the image has full measure for . If , then for generic , the entries of form a vector space basis of , so the image of is filling. On the other hand if , then the image of is full measure by [16] and the inductive hypothesis. ∎

See 11

Proof.

We note entries of are polynomials in , thus minors of are polynomials in , so has a Zariski-generic rank (the largest size of minor that is a nonzero polynomial), which is also the maximum rank of . By basic algebraic geometry, this is the dimension of (see “generic submersiveness" of algebraic maps in characteristic 0 [18]). ∎

See 13

Proof.

This is from the multi-homogeneity of the -th power activation by substituting. ∎

See 14

Proof.

We know the dimension of equals the dimension of the domain of minus the dimension of a generic fiber of (see generic freeness [15]). Thus by Lemma 13, . At the same time, the dimension of is at most that of its ambient space . Combining produces the bound (10).

For the next statement, we temporarily assume Conjecture 16. We shall prove by induction on the stronger result that for the generic fibers of are precisely as described in Lemma 13 (and no more). The base case is trivial. Thus assume and that for the generic fiber is exactly as in Lemma 13, whenever . For the induction step, we let be a threshold which works in Conjecture 16 for and , and then we set . Now with fixed generic weights , we consider any other weights satisfying

 WhρrWh−1…ρrW1x=~Whρr~Wh−1…ρr~W1x (13)

for . Write for the output of the LHS in (13) at depth , and similarly for the RHS. By genericity and , the polynomials are pairwise linearly independent. Comparing the top outputs at depth in (13), we get two decompositions of type (9):

 wh11prθ1+…+wh1dh−1prθdh−1=~wh11~prθ1+…+~wh1dh−1~prθdh−1. (14)

Since , by Conjecture 16 there must be two linearly dependent summands in (14). Permuting as necessary we may assume these are the first two terms on both sides. Scaling as necessary we may assume , and then subtract from (14) to get:

 (wh11−~wh11)prθ1+…+wh1dh−1prθdh−1=~wh12~prθ2+…+~wh1dh−1~prθdh−1. (15)

Invoking Conjecture 16 again, we may remove another summand from the RHS, so on until the RHS is 0. Then each individual summand in the LHS must be 0 too, by pairwise linear independence and Conjecture 16 once more. We have argued that (up to scales and permutation) it must hold and . Comparing other outputs at depth in (13) gives (up to scales and permutation). Thus by the inductive hypothesis, the fiber through is as in Lemma 13 and no more. This completes the induction.

For the unconditional result with differing degrees per layer, the argument runs closely along similar lines, but it relies on Proposition 15 in place of Conjecture 16. For brevity, details are omitted. ∎

See 15

Proof.

It is shown in [4] (via Wronskian and Vandermonde determinants) that for any particular , no two of which are linearly dependent, there exists such that if . The dependence on particular can be removed as follows.

Let be the set of -tuples, no two entries of which are linearly dependent. So is Zariski-open, described by the non-vanishing of minors. Further let be the subset of -tuples whose -th powers are linearly independent, similarly Zariski-open. Consider the chain of inclusions . By [4], the union of this chain equals . Thus by Noetherianity of affine varieties, there exists with  [15]. Now works. ∎

See 17

Proof.

This bound encapsulates the bracketing:

 (WhρrWh−1…Wk+1)ρr(WkρrWk−1…W1x). (16)

More formally, the network map factors as:

 (17)

by first sending to the pair of bracketed terms in (16) and then the pair to the composite in (16). The closure of the image of the first map in (17) is . On the other hand, the second map in (17) has -dimensional generic fibers, by multiplying with a diagonal matrix . Combining these facts gives the result. ∎

See 19

Proof.

We first point out that Proposition 17 gives an elementary proof is an asymptotic bottleneck. This is because as grows the ambient dimension grows like , while the RHS bound grows like , so if then cannot fill for .

To gain a factor of 2 in the bottleneck bound, we start by writing for the output polynomials at depth , that is, for . Fixing , we consider , a subalgebra of the Veronese ring . The key idea is to compare the Hilbert polynomials of and of  [6]. If the Hilbert polynomials differ in any non-constant terms, this means the dimension of the degree piece of minus that of diverges to as goes to . At the same time, however we vary weights (keeping fixed), the output polynomials remain in the algebra . Additionally, for varying and , the possible -vectors of degree polynomials in variables, , comprise a bounded-dimensional variety. The upshot is that if it need always be the case (based on ) that the Hilbert polynomials of and have non-constant difference, then must be an asymptotic bottleneck. Thus it suffices to check the Hilbert polynomial property holds for all if . To this end, we derived the following general result:

Claim.

Given integers and . Then whenever are homogeneous polynomials of the same degree in variables, the algebra and the Veronese algebra have Hilbert polynomials with non-constant difference.

Proof of claim.

First, it suffices to check the claim for generic . Second, the difference in Hilbert polynomials identifies with the Hilbert polynomial of the sheaf  [18]. Here () is the projective Veronese variety, the linear projection