Are all layers created equal? is a recent work which addressed the problem of how sensitive are the parameters in an over-parameterized deep network. Their experiments show a heterogeneous characteristics of layers, where bottom layers have a higher sensitivity than top layers. This is an exciting observation since this is exactly what the geometry of quantum computation told us about deep networks one decade ago!
In our former work, inspired by the facts that deep networks are effective descriptors for our physical world and deep networks share similar geometric structures of physical systems such as geometric mechanics, quantum computation, quantum many-body systems and even general relativity, we proposed a geometrization scheme to interpret deep networks and deep learning systems. The observation of  encouraged us to apply this scheme on over-parameterized deep networks to give a geometric description of such networks.
In the following parts of this paper, we will explore the similarities between deep networks and quantum computation systems. We will transfer the rich geometric structure of quantum mechanics and quantum computation systems to deep networks so that we have an intuitive geometric understanding of the basic properties of over-parameterized deep networks, including network complexity, generalization, convergence and the geometry formed by deep networks.
Geometrization of physics is the greatest and the most successful idea in understanding the rules of our physical world in human history. But why can our world be geometrized? In the last decade, we saw a new trend to combine geometrization and quantum information processing to draw a complete new picture of our world. Basically this is to regard our world, including spacetime, material and the interactions among them, as emergent from a complex quantum deep network. From this point of view, our world is built from deep networks and the geometric structure of the physical world emerges from the geometric structure of the underlying deep networks. So the geometrization of physics is essentially the geometrization of the underlying quantum deep networks. The success of geometrization of physics indicates that geometrization is also the key to understand deep networks.
The similarities between deep networks and physical systems, including both classical geometric mechanics and quantum computation systems, have been addressed in our former works. Here for simplicity we only give a brief recap of key points we have learned from the geometrization of quantum information processing that will be involved in this paper.
Ii-a Geometry of quantum information processing
It’s well known that quantum mechanics has a rich geometric structure so that we believe quantum mechanics is the ultimate rule of our world. Quantum information processing or quantum computation, which explores the complex structure of both quantum states and quantum state evolutions, is the ultimate tool to describe our world and the rules of quantum information processing systems can be applied to all physical systems, including deep networks. So what do we know already about quantum information processing systems?
Gigantic quantum state space and the corner of physical states
For simplicity we use the most popular model of quantum information processing, i.e. a quantum state is described by a n-qubit system and the quantum information processing is described by a quantum circuit model. The quantum state space is huge since the dimension of a n-qubit pure state system isand the number of possible states is . In all the states, only a tiny zero measure subset, the corner of physical states, is physically realizable since the states in this subset can be generated with a polynomial complexity from a simple initial state such as the product state .
Quantum computational complexity
The concept of quantum computational complexity plays a key role not only in quantum computation but also in quantum gravity, black hole information problem and quantum phase transition. Basically a quantum algorithm on a n-qubit system is an unitary transformation and its computational complexity is given by the geodesic distance between the identity operation and , where the geodesic is defined on the Riemannian manifold of . For more details on the geometry of quantum computation, please refer to . Accordingly the state complexity of a n-qubit quantum system is defined as the minimal complexity of all the quantum algorithms that can generate from , i.e. . Since the DOF of a general n-qubit transformation is , obviously its computational complexity is . This is to say, a general n-qubit algorithm can only be achieved by a quantum circuit with quantum gates, which is regarded as non-realizable. What we are interested are the polynomial complexity algorithms, which can be used to prepare the corner of physical states from the product state .
Quantum computational complexity and geometry Quantum computational complexity has a rich geometrical structure. Firstly the quantum complexity is defined on the Riemannian structure of the manifold . A natural question is then, what’s the curvature of the Riemannian manifold of quantum computation? It’s shown that this manifold may have a non-positive curvature everywhere. This is to say, the geodesic on this manifold is not stable and it’s initial momentum sensitive. Keen readers can immediately see that we have now a connection between quantum computation and the observation of . Secondly, the concept of quantum computational complexity builds a correspondence or a duality between quantum states and quantum algorithms. That’s to say, given a quantum state , we have a correspondent optimal quantum algorithm to prepare it from an initial product state. If we take the quantum circuit of the algorithm as a network of quantum operations, then we have a duality between quantum states and quantum deep networks. This duality may play a key role in understanding the geometry of spacetime. In fact the geometry of spacetime is just the geometry of the quantum deep network. The take-home message is, the dual quantum deep network of a quantum state is determined by a Riemannian geometry of the quantum transformation space, and a quantum deep network also generates a Riemannian geometry. So do we have two Riemannian structures? There are signs to show, if we use the Fisher-Rao metric of the deep network, then they can be united and general relativity can be deduced from it.
Quantum mechanics and geometry
Finally, even we consider the most classical quantum mechanics without the fancy concept of quantum complexity, we can also learn something that can be applied to understanding deep networks. The first observation is the geometry of quantum state space. It’s well known that quantum mechanics show a probabilistic property so that in a projective measurement, the probability that the state falls in an eigen state of the observable is determined by the distance between the initial state and the final state. Geometrically this means the probabilistic property of quantum mechanics is determined by the Riemannian structure of quantum mechanics. The second observation is the geometry of quantum evolution. A general quantum state evolution of a n-qubit system can be written as a sequence of unitary transformationswith . Obviously this can be regarded as a linear deep network. How about the stability of this system? It has been shown that this system show a chaotic property, which means a tiny perturbation of the first operation will lead to a huge change of the composite operation .
We will see all the afero-mentioned observations can help us to understand over-parameterized deep networks.
Iii Geometrization of over-parameterized deep networks
Iii-a Over-parameterized deep networks
We first give a brief summary of the known facts and arguments about over-parameterized deep networks.
Over-parameterization By over-parameterized deep networks, we usually mean the number of network parameters is much larger than the number of training data. The over-parameterization is in both the width and the depth of deep networks. Existing works show that over-parameterization plays a key role in the network capacity, convergency, generalization and even the acceleration of the optimization. But how exactly the over-parameterization can affect the performance of deep networks remains not completely clear to us.
Local minima and convergence It’s obvious that over-parameterized networks have a large number of local minima. In  it’s shown that for over-parameterized deep network, with a high probability, all the local minima are also global minima as far as the data are not degenerated. A similar argument in  told us that for sufficiently over-parameterized deep networks, gradient descent can reach local minima with a high probability from any initialization point of the network. Of course this is because the over-parameterization re-shaped the loss landscape of deep networks. Can we have an intuitive geometric picture of this point?
Network complexity and generalization Although all the local minima can all fit the training data well, we know they are not equal since they have different generalization capabilities and we prefer to find out a configuration with good generalization performance. Generally the generalization of a network is related with the network complexity and a lower network complexity means a better generalization performance. In  it’s shown that the minima that can generalize well have a larger volume of basin of attraction so that they dominate over the poor ones. This is an interesting observation and we will show this is essentially an analogue of the probabilistic characteristics of quantum mechanics and it has a geometrical origin.
Loss landscape Over-parameterization changes the loss landscape.  claimed that the locus of global minima is usually not discrete but rather an continuous high-dimensional submanifold of the parameter space. But how the structure of this submanifold changes with the number of parameters is still an open problem.
Implicit acceleration by over-parameterization In  it’s claimed that over-parameterization, especially in the depth direction, works as an acceleration mechanism for the optimization of deep networks and also this acceleration can not be achieved by a regularization. We will show maybe this is a misunderstanding of the role of over-parameterization.
Layers are not created equal For a multilayer deep network, it’s a direct question to check if all the layers are equal. The recent work  showed that layers have different sentivities for either fully connected networks, convolutional networks or residual networks. What’s the geometry behind this observation? We will try to understand this point as an analogue of quantum information processing systems.
Iii-B Geometric picture of over-parameterized deep networks
The geometrization of deep networks has been explained in , where we showed that deep networks share the same geometric structure of geometric mechanics and quantum computation systems. The key observation is that deep networks are curves to connect the identity transformation and the target transformation on the Riemannian manifold of data transformations. We will now see how over-parameterized deep networks can be understood in this geometrization framework.
Over-parameterization What’s the role of over-parameterization in deep network? How to determine if a network is properly over-parameterized? In fact we can understand over-parameterization by comparing it with quantum computation systems. In quantum computation we have a gigantic state space and only a zero measure subset, the corner of physical states, is physically realizable. The duality between quantum states and quantum algorithms shows that this is also true for quantum algorithms. Similarly the space of possible functions between the input and output data of deep networks is also huge and only a small subset of it is physically interesting for us, which is the subset of functions that have a polynomial computational complexity. So essentially approximating a function by deep networks is to explore this subset. Compared with quantum computation systems, an universal shallow network is just a general unitary transformation , which needs an exponential complexity to describe a transformation of data state space. A polynomial deep network is just a polynomial quantum circuit that only generate the corner of physical states. From this complexity point of view, deep networks are not really universal since they only explore a subset of all possible transformations. In over-parameterized deep networks, increasing the width and depth of the networks can be understood as increasing the number of qubits and the length of the quantum circuit to achieve a quantum algorithm. A key point is that, in order to achieve a quantum algorithm the complexity of the quantum circuit, which is roughly proportional to the depth of the quantum circuit, has to exceed the quantum complexity of .
Local minima and convergence How the over-parameterization can change the distribution of local minima and convergence is not very clear yet. If we compare deep networks with quantum mechanics, we can only say the cost function of deep networks can be regarded as a frustration free Hamiltonian and the global minima are ground states of the frustration free Hamiltonian. This observation is closely related with the concepts of parent Hamiltonian and uncle Hamiltonian. But if there is an exact correspondence between them is still under investigation.
Network complexity and generalization The relationship between network complexity and generalization capability is straight forward. In our former work to compare deep networks with the image registration problem, we indicated that the network complexity can be understood as the deformation energy of a diffeomorphic image transformation. So a lower network complexity means a smooth low energy deformation. Obviously a smooth image transformation has a better generalization performance. The observation of  that a solutions with a better generalization has a higher probability to be found during optimization from a random initialization then has an exact correspondence in quantum mechanics. As mentioned in the first section, during a projective measurement, the probability of a final quantum state appears is related with its distance to the initial quantum state . This is to say, the probability is determined by the complexity of quantum transformation that transform the initial state to the final state so that . We see this is exactly what happens in over-parameterized deep networks. Here a better generalizaiton means a lower network complexity and a higher probability that this network configuration is found during optimization. Obviously we also have a relationship between the probability and the complexity. So we can claim that the probability that a deep network configuration is found by optimization is determined by the network complexity, which is geometrically the Riemannian distance between the transformation achieved by this network and the identity transformation . It’s very interesting to see classical deep networks show the same probabilistic property of quantum mechanics. For us it’s more interesting to check if this observation can be used to understand quantum mechanics from a deep network point of view, because the measurement problem of quantum mechanics is still not fully understood. Can the commonly used decoherence picture of quantum measurement can be formulated as a training process of deep networks?
Loss landscape It’s straight forward to see that over-parameterized deep network has a locus of global minima as an high-dimensional submanifold of the parameter space. But we are not clear about the exact structure of this submanifold and how it will change with the increasing number of network parameters. For example, we have no idea if this high-dimensional submanifold is a connected or a separated manifold or even has a fractal-like structure. We highly suspect that the locus of global minima has a fractal structure since the network is nonlinear and the sensitivities of different layers are different as will be further addressed in the following discussions.
Implicit acceleration by over-parameterization Can the over-parameterization provide an implicit acceleration of the optimization as claimed in ? To clarify this, we first restate the argument of 
, in which a linear neural network is considered as follows:and are the input and output data space. A -layer linear network is used to fit a training set and the loss function is used, where is the output of the network given the input . The parameters of the depth- linear network are and the end-to-end weight matrix is given by so that . The gradient descent based optimization of can then be written as
where they assume is fulfilled for the network.  argued that the difference between the N-layer deep network and a 1-layer network is that the gradient is transformed by the two items and . They interpreted the effect of overparameterization (replacing clasic linear model by dept-N linear networks) on gradient descent as the deep network structure reshapes the gradient by changing both its amplitude and direction so that this can be understood as introducing some forms of momentum and adaptive learning rate. Also they claimed that this over-parameterization effect can not be obtained by regularization.
Do we have a geometric description of this observation in our geometrization scheme? In fact this can be directly observed by comparing deep networks with diffeomorphic image registration problem as in . What’s more, we can directly generalize the conclusion of  to a general nonlinear deep network without any further assumptions on the network.
Diffeomorphic image registration can be abstracted as a map , where is the group of image transformations and
is the vector space of images. Large deformation diffeomorphic metric mapping (LDDMM) generates a deformation as a flow of a time-dependent vector field so that
The diffeomorphic matching of two images and with LDDMM is to find a vector field to minimize the cost function
Here the regularity on is a kinetic energy term with a norm on the vector field defined as . The operator is a positive self-adjoint differential operator. Obviously the norm defines a Riemannian metric on the manifold of the diffeomorphic transformation group . The second term is the difference between the transformed image and the target image .
A necessary condition to minimize the cost function is that the vector field should satisfy the Euler-Poincaré (E-P) equation
where , . The operator is defined as and is the momentum map.
In LDDMM framework, the curve satisfying the E-P equation is found by a gradient descent algorithm, while the gradient is given by with . A direct calculation in the LDDMM framework following  shows that the update of is given by
We can directly see this is almost the same as the update rule of given by (1). But here we are working with a nonlinear deep network so that we have a generalization of the linear network of . In fact the result of  can be regarded as a special case of LDDMM called static vector flow (SVF), which is formulated on a Lie group instead of on a Riemannian manifold and the items , can be understood as an analogue of the Lie exponential used in SVF framework.
LDDMM has a beautiful geometric picture which is the same as the geometric mechanics. How to understand the effect of over-parameterization in this LDDMM framework? LDDMM formulates a smooth image transformation by a constrained curve described by (2). The gradient descent based update of the curve is essentially a constrained optimal control as shown in . So when we try to approximate a function by deep networks, the structure of over-parameterized deep network is essentially to set constraints on the possible solution space. The so-called acceleration effect of over-parameterization in  is nothing but a natural result of the constrained optimal control formulation. Also their conclusion that this acceleration can not be obtained by regularization is also not exact since the constraints in optimal control can also be regarded as a kind of regularization in optimization problems. The only difference is that the regularization is set on the structure of the network.
Layers are not created equal We have seen that in quantum computation, for both the general sequential unitary quantum evolution and the quantum circuit model, we observe the same initial value sensitivity property. This is to say, quantum information processing systems are playing with Riemannian manifolds with negative curvatures. If we compare these with the observation of , we find the general quantum evolution system corresponds to the fully connect networks and the quantum circuit model corresponds to convolutional networks. So we can say the observed non-equality of layers in  is just a direct consequence of the principle of quantum computation system. But there is still one thing is missing, the residual network. It’s observed in  that residual networks also show a non-equality of layers but the pattern is different from fully connected and convolutional networks. Can we also find the correspondence of residual networks in quantum computation systems? Yes, since residual networks are just differential equations, they are correspondent to the fundamental quantum mechanics rule, the Schrodinger equation. Since the finite time discretization of Schrodinger equation is just the general sequential unitary quantum evolution, we believe Schrodinger equation should have the same initial value sensitivity pattern. This means residual networks should have a similar pattern as the fully connected and convolutional networks. This is different from the observed pattern of residual networks. How to resolve this contradiction? If we believe that quantum mechanics is the ultimate rule of the world and the main advantage of residual networks is to build a smoother manifold of transformations to approximate functions, then residual networks should be related with a smooth geometry and there is no reason that some layers of residual networks are more critical than other layers as observed in . We assume this is due to the artifacts of the non-uniform discretization used in residual networks and noise during optimization. From another aspect, the redistribution of the sensitivity pattern of residual networks also indicates that the strong background negative curvature geometry of general deep networks is weakened in residual networks so that the random perturbation effects survive. This is in fact an evidence that residual networks are building and working on a flatter manifold than fully connected and convolutional networks.
Another problem is related with the spacetime structure. There is evidence that the geometry of spacetime is emergent from quantum information processing networks. Also in 
we indicated that in deep networks, if the Fisher-Rao metric is used to measure the network complexity, then the interaction between data and network structures is analogue of the interaction between material and spacetime geometry, i.e. the general relativity. But if a general quantum deep network has a negative curvature, how can our universe have a flat (in a large scale) spacetime? Does the existence of our flat universe is an evidence that there exists a subset of deep networks that can form a flat Euclidean geometry? If such a corner of Euclidean deep networks exist, then all the layers will be created equal in such networks. Can this help us to find better network structures? In random matrix based analysis of deep networks, a special type of network configuration with dynamic isometry property seems to fall in this subset. It has been shown that such kind of networks hold some advantages beyond normal networks such as a smooth information flow in both the forward and backward directions. In fact geometrically the smooth information flow is just the inertial movement in a flat spacetime, i.e. the first law of Newton. Of course, just as the corner of physical states in quantum mechanics, the corner of Euclidean deep networks is also a zero measurement set. So we assume this subset may not form an universal data processing system, just as our universe may be a very special case of the so-called multiverse picture.
Finally, the negative curvature will influence the loss landscape of deep networks. If a network configuration has a higher sensitivity at the bottom layers, it can be easily figured out that loss landscape is more sensitive to the bottom layers and more robust to top layers. Accordingly the locus of the global minima will have more valleys in the bottom layers and the locus may have a fractal-like complex pattern with a stronger over-parameterization. How exactly the over-parameterization will change the loss landscape is still open.
Geometrization is not only the key idea of physics, it’s also a framework to understand deep networks. In this work we try to understand over-parameterized deep networks by geometrization. By establishing analogies between properties of over-parameterized deep networks and quantum information processing/diffeomorphic image registration systems, we found they share similar geometric structures. Our key observations are:(1)Polynomial complexity over-parameterized deep networks only explore a corner of polynomial complexity functions just as quantum computation systems only explore the corner of physical states in the gigantic quantum state space. The network structure sets constraints on the submanifold of functions that can be approximated by the network. (2)Over-parameterized deep networks may have a complex loss landscape and local minima have different generalization capabilities. The generalization capability is determined by the network complexity, which is computed as the geodesic distance on a Riemannian manifold between the transformation represented by the network and the identity transformation. The probability that a certain configuration is obtained is determined by the complexity of the network. This is an analogue of the measurement problem in quantum mechanics, where the probability of the final state is determined by the distance between the initial state and the final state. (3)Over-parameterized deep networks have a geometry with a negative curvature, just as quantum computation systems has a Riemannian geometry with a negative curvature. All these observations suggest that deep networks are closely related with physics and geometrization may provide a proper roadmap to interpret deep networks.
In this work we mainly explore the Riemannian structure of deep networks, for example the network complexity as the geodesic distance and the sensitivity of network parameters as Riemannian curvature. A natural question is, can other geometrical structures in physics help to understand over-parameterized deep networks? For example the symplectic structure of geometric mechanics plays a key role in the dynamics of classical mechanics. Can the dynamics of deep networks also be understood in a similar way? Fibre bundle structure is another key structure to understand interactions in physics, also it plays a key role in the geometry of quantum information processing such as the geometry of mixed state and quantum entanglement. Can it be used to understand interactions between subnetworks in a composite system with multiple subnetworks? In  we have mentioned that fibre bundles may be related with important network structures such as attention mechanism, Turing neural machines and differential neural computers. There are signs that fibre bundles are also related with capsule networks and the recent quaternion neural networks. To explore the possibility to understand deep networks based on bundles will be our future work.
-  C. Y. Zhang, Bengio S., and Singer Y. Are all layers created equal? arxiv:902.01996v1, 2019.
-  X. Dong and L. Zhou. Geometrization of deep networks for the interpretability of deep learning systems. arxiv:1901.02354, 2019.
-  J.S. Wu X. Dong and L. Zhou. How deep learning works –the geometry of deep learning. arXiv:1710.10784, 2017.
-  Xian Hui Ge and Bin Wang. Quantum computational complexity, einstein’s equations and accelerated expansion of the universe. arXiv:1708.06811v2, 2018.
-  H. Heydari. Geometric formulation of quantum mechanics. arXiv:1503.00238, 2015.
-  Hiroaki Matsueda. Emergent general relativity from fisher information metric. arXiv:1310.1831v2, 2013.
-  M. R. Dowling and M. A. Nielsen. The geometry of quantum computation. Quantum Information and Computation, 8(10):861–899, 2008.
-  Leonard Susskind. The typical state paradox: diagnosing horizons with complexity. Fortschritte Der Physik, 64(1):84–91, 2016.
-  L. Susskind. Entanglement is not enough. arXiv:1411.0690v1, 2014.
-  L. Susskind and Y. Zhao. Switchbacks and the bridge to nowhere. arXiv:1408.2823v1, 2014.
-  M. Gu M. A. Nielsen, M. R. Dowling and A. C. Doherty. Quantum computation as geometry. Science 311,1133, 2006.
-  Brian Swingle. Entanglement renormalization and holography. Physical Review D Particles and Fields, 86(6):–, 2009.
-  Brian Swingle. Constructing holographic spacetimes using entanglement renormalization. Physics, 2012.
-  X. Dong and L. Zhou. Spacetime as the optimal generative network of quantum states: a roadmap to qm=gr? arxiv:1804.07908, 2018.
-  Hiroaki Matsueda. Derivation of gravitational field equation from entanglement entropy. arXiv:1408.5589v2, 70, 2014.
-  Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arxiv:1605.08361v2, 2016.
-  Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arxiv:1810.02054v1, 2018.
-  Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arxiv:arXiv:1811.03962v2, 2018.
-  Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. arxiv:1711.01530, 2017.
-  Wu Lei, Zhanxing Zhu, and E Weinan. Towards understanding generalization of deep learning: Perspective of loss landscapes. arxiv:1706.10239v2, 2017.
-  Y. Cooper. The loss landscape of overparameterized neural networks. arxiv:1804.10200v1, 2018.
-  Arora. S., Cohen N., and E. Hazan. On the optimization of deep networks: implicit acceleration by overparameterization. arxiv:1802.06509v2, 2018.
-  Mirza Faisal Beg, Michael I. Miller, Alain Trouve, and Laurent Younes. Computing large deformation metric mappings via geodesic flows. 2004.
-  M. Bruveris, F. Gay-Balmaz, D. D. Holm, and T. S. Ratiu. The momentum map representation of images. Journal of Nonlinear Science, 21(1):115–150, 2011.
-  Martins Bruveris and Darryl D. Holm. Geometry of image registration: The diffeomorphism group and momentum maps. Fields Institute Communications, 73:19–56, 2013.
-  Darryl D. Holm, Tanya Schmah, and Cristina Stoica. Geometric mechanics and symmetry. Oxford University Press Oxford, (2):xvi+515, 2009.
-  G. L. Hart, C. Zach, and M. Niethammer. An optimal control approach for deformable registration. In , 2013.
-  Darryl D. Holm. Euler’s fluid equations: Optimal control vs optimization. Physics Letters A, 373(47):4354–4359, 2009.