1 Introduction
Recently, Deep Neural Networks (DNNs) have attracted enormous research attentions, due to their prominent performance comparing to various state of the art approaches in pattern recognition, computer vision, and speech recognition
[6, 20, 41]. Despite vast experimental evidences, such success of DNNs is still not theoretically understood. Among many unsolved puzzles, there are three most fundamental challenges in research of DNNs that highly demand solutions, namely, expressibility, optimisability, and generalisability. Although there have been significant progresses in searching for answers using various theories, e.g. function approximation, information bottleneck principle, sparse representation, statistical inference, Riemannian geometry, etc., a complete solution to the overall puzzle is still missing. In this work, we address the three grand challenges in the framework of Feedforward Neural Networks (FNNs) from the perspective of differential topology.Although classic results have already proven that a shallow FNN with only one hidden layer having an unlimited number of units is a universal approximator of continuous functions on compact subsets of [14], recent extensive practice suggests that deep FNNs are more expressive than their shallow counterparts [5]. Such an observation has been confirmed by showing that there are functions, which are expressible by a widthbounded deep FNN, but require exponentially
many neurons in the hidden layers of a shallow FNN for a specified accuracy
[7, 38]. The work in [30] further shows that the total number of neurons required to approximate natural classes of multivariate polynomials grows only linearly in deep FNNs, but grows exponentiallyin a twolayer FNN. Meanwhile, the impact of width of DNNs has also been proven to be critical for gaining more expressive power of ReLU networks
[23]. Despite these rich results about expressibility of DNNs, interplay or tradeoff between depth and width for achieving good performance is not yet concluded.Training DNNs is conventionally considered to be difficult, mainly due to its associated optimisation problem being highly nonconvex [37, 40]. Recent observation of prominent performance of gradient descent based algorithms has triggered enormous interests and efforts in characterising loss landscape and global optimality of DNNs [17, 27, 12, 32]. Most of these works assume exact fitting of a finite number of training samples with a sufficiently large DNN, and suggest that full rank weight matrices play a critical role in ensuring good performance of DNNs. Curiously, besides these arguments from the optimisation perspective, the impact of requiring weight matrices to have full rank are still not clear to the other two challenges.
Arguably, generalisability is the most puzzling mystery of DNNs [19, 42]. There have been many recent efforts dedicated to explain this phenomenon, such as deep kernel learning [3], information bottleneck [39, 31, 1], and classification bound analysis [34]
. Many heuristic mechanisms have also been developed to enhance generalisability of DNNs, e.g. dropout regularisation
[35] and normbased control [25, 34]. So far, there is no single theory or practice that can provide affirmative conclusions about the mysterious generalisability of DNNs.Most recently, there has been an increasing interest in analysing DNNs from geometric and topological perspectives, such as algebraic topology [36] and Riemannian geometry [13]. Particularly, the work in [8] argues that geometric and topological properties of state of the art DNNs are crucial for better understanding and building theories of DNNs. It is also worth noticing that geometric and topological analysis is indeed a classic methodology in the research of neural networks [24]. In this work, we extend such a trend to employ the theory of differential topology to study the three challenges of DNNs.
2 Optimisation of DNNs: Full rank weights
Let us denote by the number of layers in a DNN, and by the number of processing units in the th layer with . Specifically, by , we refer it to as the input layer. Hidden layers in DNNs can be modelled as the following parameterised nonlinear map
(1) 
where is a bias that is treated as a constant in this work for the sake of simplicity in presentation, and
applies a unit nonlinear function entrywise to its input, e.g. Sigmoid, SoftPlus, and ReLU. In this work, we restrict activation functions to be
smooth, monotonically increasing, and Lipschitz.Now, let us denote by the input. We can then define evaluations at all layers as iteratively. By denoting the set of all parameter matrices in the DNN by , we compose all layerwise maps to define the overall DNN map as
(2) 
Note, that the last layer is commonly linear, i.e., the activation function in the last layer is the identity map . We define the set of parameterised maps specified by a given DNN architecture as
(3) 
which specifies the architecture of the DNN, i.e., the number of units in each layer.
Many machine learning tasks can be formulated as a problem of learning a taskspecific ground truth map (task map for short) , where and denote an input space and an output space, respectively. The problem of interest is to approximate , given only a finite number of samples in either or
. For supervised learning, given only a finite number of samples
with , one can utilise a DNN to approximate the task map, via minimising an empirical total loss function that is defined as
(4) 
where
is a suitable error function that evaluates the estimate
against the supervision . Clearly, given only a finite number of samples, the task map is hardly possible to be exactly learned as the solution in . Nevertheless, exact learning of a finite number of samples is still of theoretical interest.Definition 1 (Exact DNN approximator).
Given a DNN architecture , and let be the task map. Given samples , a DNN map , which satisfies for all , is called an exact DNN approximator of with respect to the samples.
In order to ensure its attainability and uniqueness via an optimisation procedure, we adopt the following assumption as a practical principle of choosing the error function.
Assumption 1.
For a given , the error function is differentiable with respect to its first argument. Existence of global minima of is guaranteed, and is a global minimum of , if and only if the gradient of with respect to the first argument vanishes at , i.e., .
Remark 1.
Assumptinon 1 guarantees the existence of global minima of the error function . Since the summation in the empirical total loss is finite, the function value of has a finite lower bound. Furthermore, it also ensures a global minimiser of the total loss function , if exists, to coincide with the exact learning of a finite set of samples. Popular choices of the error function [29], such as the classic squared loss, smooth approximations of norm with , BlakeZisserman loss, and Cauchy loss, satisfy this assumption.
Let
be the vector of the derivative of the activation function in the
th layer, and we define a set of diagonal matrices as for all . We further construct a sequence of matrices as(5) 
for all with . Then, the Jacobian matrix of the DNN map with respect to the weight can be presented as
(6) 
where denotes the Kronecker product of matrices, and is the total number of variables in the DNN. Let us define
(7) 
and
(8) 
where denotes the gradient of with respect to its first argument. Then the critical point condition of the total loss function can be presented as the following parameterised equation system in
(9) 
Clearly, if there is no solution in for a given finite set of samples, then the empirical total loss function has no critical points. Since the error function is assumed to have global minima according to Assumption 1, i.e., the total loss function has a finite lower bound, there must be a finite accumulation point. On the other hand, if the trivial solution is reachable at some weights , then an exact DNN approximator is obtained, i.e., by Assumption 1. Furthermore, if the solution is even the only solution of the parameterised linear equation system for all , then any critical point of the loss function is a global minimum. Thus, we conclude the following theorem.
Theorem 1.
Given a DNN architecture , and let the error function satisfy Assumption 1. If the rank of matrix as constructed in (7) is equal to for all , then

If exact learning of finite samples is achievable, i.e., for all , then is a global minimum, and all critical points of are global minima;

If exact learning is unachievable, then the total loss has no critical point, i.e., the loss function is noncoercive [11].
Remark 2.
Recent work [2] shows that overparameterisation in DNNs can accelerate optimisation in training DNNs. Such an observation can be explained by the results in Proposition 1, since for both exact and inexact learning, overparameterisation enables exemption of both saddle points and suboptimal local minima. Note, that it is still a challenge to fully identify conditions to ensure full rankness of . Nevertheless, analysis in [32] suggest that making all weight matrices have full rank is a practical strategy to ensure the condition required in Theorem 1. In the rest of this section, we show that DNNs with full rank weights are natural configurations of practice.
Let us extend the Frobenius norm of matrices to collections of matrices as for any
(10) 
It is simply the “entrywise” norm of collections of matrices in the same sense of norm of vectors. Without loss of generality, we assume that weight has the largest rankdeficiency, i.e., all weight matrices are singular. Then, for arbitrary , there exists always a full rank weight , so that
(11) 
Let us denote by the norm of vectors or the spectral norm of matrices. We can then apply a generalised mean value theorem of multivariate functions to the DNN map , where is treated as a constant, as
(12) 
where
denotes the upper bound of the largest singular value of the Jacobian matrix of the network map
with respect to the weight as computed in Eq. (6), i.e., the map is Lipschitz in weight . Straightforwardly, we conclude the following result from the relationship between norm and norm of vectors, i.e., for .Proposition 1.
Given a DNN architecture , for any rankdeficient weight , there exists a fullrank weight , such that for arbitrary , the following inequality holds true for all
(13) 
Remark 3.
This proposition ensures the existence of a DNN with full rank weight matrices to approximate any weight configuration at arbitrary accuracy. In what follows, we show that the theory of differential topology is a natural theoretical framework for analysing DNNs by requiring full rank weights to the properties of DNNs, and further investigate the other two challenges using the instruments from differential topology.
3 Expressiveness of DNNs: Width vs depth
Most data studied in machine learning often share some lowdimensional structure. In this work, we endow the input space with a smooth manifold structure.
Assumption 2.
The input space is a dimensional compact differentiable manifold with .
Strictly speaking, a manifold is a topological space that can locally be continuously mapped to some vector space, where this map has a continuous inverse. Namely, given any point , where is an open neighbourhood around , there is an invertible map . These maps are called charts, and since charts are invertible, we can consider the change of two charts around any point in as a local map from the linear space into itself. If these maps are smooth for all points in , then is a smooth manifold. Trivially, the Euclidean space is by nature a smooth manifold. We refer to [21, 22] for details about manifolds.
3.1 Properties of layerwise maps
Now, let us consider the first layerwise map as constructed in Eq. (1), which is a smooth map of smooth manifolds. Then the differential of at evaluated in tangent direction is computed as
(14) 
Here, all diagonal entries of are always positive by choosing activation functions to be smooth and monotonically increasing. Since all weight matrices are assumed to have full rank, it is clear that the differential is a full rank linear map.
Proposition 2 (Submersion layer).
Let be a layerwise map as constructed in Eq. (1). If and the weight matrix has full rank, then map is a submersion.
Corollary 1.
Given a DNN architecture , if , and all weight matrices have full rank, then is a set of submersions from to .
Remark 4.
For a given , we denote by the image of the DNN map on . Then, it is straightforward to claim that all points in are regular points. If , then the preimage with is a submanifold in . More interestingly, disconnected sets in can be mapped to a connected set in , since the map is surjective from to .
A recent work [28] claims that for a DNN architecture with and , every open and connected set has its preimage to be also open and connected. Such a statement seems to obviously conflict with our conclusions above. A closer look reveals that the DNNs studied in [28] map to , i.e., , while machine learning tasks are commonly constrained to some subset , i.e., . Specifically, if be open and disconnected, then the image of , i.e., , can be open but connected. There is no chance to infer the connectivity of from the connectivity of its image under strict surjective map.
Similarly, we have the following properties for an expanding DNN structure.
Proposition 3 (Immersion layer).
Let be a layerwise map as constructed in Eq. (1). If and the weight has full rank, then map is an immersion.
Corollary 2.
Given a DNN architecture , if , and all weight matrices have full rank, then is a set of immersions from to .
Remark 5.
By the construction of DNNs, any immersive DNN map is proper, i.e., their inverse images of compact subsets are compact. Hence, immersive DNN maps are indeed embeddings. By the following theorem, topological properties of the data manifold are preserved under DNN embeddings.
Theorem 2.
Let be an embedding of smooth manifolds. Then, is a submanifold of .
Corollary 3.
Given a DNN architecture , if , and all weight matrices have full rank, then is a set of diffeomorphisms.
3.2 Expressibility by composition of smooth maps
In the previous subsection, we present some basic properties of layerwise map, and simple DNN architectures. In this subsection, we investigate expressibility of more sophisticated DNN architectures as composition of smooth maps. We assume that the output space is a smooth submanifold of , i.e., .
Lemma 1.
Let be a continuous map of smooth manifolds. Given a surjective linear map , i.e., , there exists a continuous function , such that .
Proof.
Since is a surjective linear map, there exists an inverse map , so that . Trivially, we have , and constructing the continuous function concludes the proof. ∎
Theorem 3.
Let be a continuous map of smooth manifolds. Given a surjective linear map with , and , if , then there exists a smooth embedding , so that the following inequality holds true for a chosen norm and all
(15) 
Proof.
According to Lemma 1, it is equivalent to showing that for a continuous function , there is a smooth embedding that satisfies
(16) 
Since is linear by construction, we have
(17) 
where is the largest singular value of the corresponding matrix representation. The weak Whitney Embedding Theorem [22] ensures that for any , if , then there exists a smooth embedding such that for all , we have
(18) 
The result follows from the relationship between different norms. ∎
Remark 6.
It is important to notice that the lower bound in the Whitney Embedding Theorem, i.e., , is not tight. Hence, it only suggests that, regardless of the depth, an agnostically safe width of DNNs to ensure good approximation is at least twice of the dimension of the data manifold. For a widthbounded DNN with , there is no guarantee to approximate arbitrary functions on an arbitrary data manifold.
4 Generalisability of DNNs: Explicit vs implicit regularisation
Since the DNNs studied in this work are constructed as composition of smooth maps, it is natural to bound its output using a generalised mean value theorem of multivariate functions [34]. Let be convex and open in and given , we have
(19) 
where for all , with the Jacobian matrix of with respect to being computed as
(20) 
Although this is a natural choice of error bound, it is still insufficient to explain the socalled implicit regularisation mystery, i.e., DNNs trained without explicit regularisers still perform well enough [42].
Since the spectral norm of matrices is a smooth function, it is conceptually easy to argue that DNNs should generalise well. Here, we investigate the change rate of the spectral norm of the Jacobian matrix under displacement in . Let us assume that the Jacobian matrix has a distinct largest singular value. Then, we can compute the directional derivative of the spectral norm of the Jacobian matrix of DNNs as
(21) 
where and are the left and right singular vectors associated to the largest singular value of the Jacobian matrix at . Let us denote by the vector of diagonal entries of . A tedious but straightforward computation leads to
(22) 
where with . Here, puts a vector into a diagonal matrix.
Remark 7.
In Eq. (22), the derivative of the spectral norm of the Jacobian matrix of DNNs is computed as an inner product of two vectors, where the vector is a constant for a given DNN, and the other is dependent on derivatives of the activation functions. Simple explicit regularisations, e.g. weight decay [18] and pathnorm [26, 25], can be simply justified for minimising the entries of . Furthermore, by computing
(23) 
we observe that the matrix is simply a truncation of the Jacobian of the DNN with respect to the input . Minimising the Frobenius norm of the Jacobian matrix, known as the Jacobian regulariser in [34, 16], is indeed a more sophisticated explicit regularisation.
The second term is the Kronecker product of derivatives of the activation functions, which is often upper bounded by one, e.g. Sigmoid, SoftSign, and SoftPlus. Namely, the spectral norm of the Jacobian matrix of DNNs can only change slowly, hence DNNs without explicit regularisations shall generalise well. As a result, we argue that the slope of activation functions is an implicit regularisation for generalisability of DNNs.
5 Architecture of DNNs: Representation learning
So far, the empirical success of DNNs has been mostly observed and studied in the scenario of representation learning, which aims to extract suitable representations of data to promote solutions to machine learning problems [4, 20]. In particular, DNNs are capable of automatically learning representations that are insensitive or invariant to nuisances, such as translations, rotations, and occlusions.
One potential theory to explain such a phenomenon is the information bottleneck (IB) principle [39]. The original idea believes that training of DNNs performs two distinct phases, namely, an initial fitting phase and a subsequent compression phase. The tradeoff between the two phases is guided by the IB principle. The work in [1] further argues that discarding taskirrelevant information is necessary for learning invariant representations that generalises well. However, a criticising work [31] demonstrates that no evident causal connection can be found between compression and generalisation. More interestingly, an opposite opinion states that loss of information is unnecessarily responsible for generalisability of DNNs [15]. Therefore, the IB theory of DNNs still needs a careful thorough investigation.
In this section, we propose to employ the quotient topology in the framework of differential topology, to model nuisance factors as equivalence relationship in data. We refer to [21] for details about quotient topology.
Definition 2 (Nuisance as equivalence relation).
Let be a data manifold, nuisance on is defined as an equivalence relation on .
In the framework of differential topology, insensitivity or invariance to nuisances in data for a specific learning task leads to the following assumption about the task map.
Assumption 3.
The task map is a surjective continuous map, i.e., is invariant with respect to some nuisance/equivalence relation .
Then, we can define the nuisance relation on by , if , and equivalence classes under on as , which is also the fibre of . The set of equivalence class of is constructed as , and endow the quotient topology via the canonical quotient map . Deep representation learning can be described as a process of constructing suitable representation or feature space via , to enable a composition with . It can be visualised as the following commutative diagram
(24) 
In this model, we refer to as the representation map or feature map, and as the latent map.
Definition 3 (Sufficient representation).
Let be a data manifold, and be a task function. A feature map is sufficient for the task , if there exists a function , so that .
Obviously, there can be an infinite number of possible constructions of representations. In this work, we focus on two specific categories of representations, namely, informationlossless representation and invariant representation.
5.1 Informationlossless representation
The work in [15]
constructs a cascade of invertible layers in DNNs, so that no information is discarded in the representations. It shows that loss of information is not a necessary condition to learn representations that generalise well. A similar observation is also made in the invertible convolutional neural networks
[9]. Instead of manually designing the invertible layers, we show that invertibility of layers in DNNs is its native properties, when the architecture of layers is suitable.Lemma 2.
Let be a map of smooth manifolds. Then can be decomposed as , where with is a smooth embedding.
Proof.
By the strong Whitney embedding theorem, every smooth manifold admits a smooth embedding into with . Then the image of the embedding feature map , denoted by , is a smooth submanifold of , see Theorem 2. The embedding induces a diffeomorphism between and , i.e.,
(25) 
By properties of smooth embeddings [10], there exists a smooth inverse , so that . Trivially, we have , and the proof is concluded by defining . ∎
The relationship between the task map and the latent map can be described as follows.
Proposition 4.
Let and be smooth manifolds, and the task map admit a decomposition , where with is a smooth embedding. Then the task map is a quotient map, if and only if the latent map is a quotient map.
Proof.
Since the feature map is a diffeomorphism, is also a quotient map by definition. If is a quotient map, then the composition , composing two quotient maps, is also a quotient map.
Conversely, let us assume that is a quotient map. Since is surjective, so is surjective. Then it is equivalent to showing that a set is open in , if and only if the preimage is open in .
Suppose is open in . Then the set is open in since is a diffeomorphism. By assumption that , i.e., , and is a quotient map, hence this makes to be open in . Now, let us assume is open in . Clearly, the set is open in , since is a quotient map and is a diffeomorphism. The result follows from the fact that . ∎
5.2 Invariant representation
Obviously, the dimension of information lossless representation can be large, hence the size of DNNs might explode. It is thus demanding to construct lower dimensional representations that can serve the same purpose. In this subsection, we adopt the framework proposed in [1] to develop geometric notions of invariant representations. Let us set the feature map to be the canonical quotient map, i.e., and .
Definition 4 (Invariant representation).
Let be a data manifold, and be a task map. A feature map is invariant for the task map , if is constant on all preimage with .
We then adopt the classic results about quotient maps [21] to our scenario of invariant representation learning.
Proposition 5.
Let and be smooth manifolds, and the task map satisfy Assumption 3. Then induces a unique bijective latent map such that . Furthermore, the task map is a quotient map, if and only if the latent map is a homeomorphism.
Since a homeomorphic latent map implies the minimal dimension of the feature space , we conclude the following result.
Corollary 4.
Let and be smooth manifolds, and the task map satisfy Assumption 3. If a feature map is both sufficient and minimal, then the task map is a quotient map.
6 Experiments
In our experiments, all DNNs are trained in the batch learning setting. The classic backpropagation algorithm and the approximate Newton’s algorithm, proposed in
[32], are used for training DNNs. Activation functions are all chosen to be Sigmoid. The error function is a smooth approximation of the norm as with .6.1 Learning as diffeomorphism
In this experiment, we illustrate that the process of training DNNs is essentially deforming the data manifold diffeomorphically. The task is the four region classification benchmark [33]. In around the origin, there is a square area , and three concentric circles with their radiuses being , , and . Four regions/classes are interlocked, nonconvex, as shown in Figure 1(a).
We randomly draw samples in the box for training, and specify the corresponding output to be the th basis vector in . We deploy a fourlayer DNN architecture .
We investigate the property of smoothly embedding the box/manifold into via the specified DNN. Since we cannot visualise a structure, we track the values in each dimension of the output along the diagonal (dashed) line (class transition , see Figure 1(a)). Figure 1(b) shows two curves of DNN outputs along the diagonal in all four dimensions, where the dashed output curve (before convergence) deforms smoothly to the final solid output curve (after convergence).
6.2 Implicit regularisation
In this subsection, we investigate the results derived in Section 4 about regularisation for generalisation. Since there have been enormous works about the effects of explicit regularisation, in this experiment we focus only on the implicit regularisation.
The task is to learn a map from a unit circle to the twopetal rose , a.k.a. the figure eight curve. Note, that the former is a smooth manifold, while the latter is not a manifold due to the intersection at the origin. We draw points equally placed on the circle, and perturb them with a uniform noise in a square region . A threelayer DNN architecture
is used. The parameterised Sigmoid function and its derivative are defined as
(26) 
In our experiment, we choose the constant , which controls the largest slope of the Sigmoid function. Figure 2 depicts the learned curve against the ground truth, and suggests that the performance of generalisation decreases with an increasing maximal slope . Clearly, large slopes of activation functions encourage overfitting.
6.3 Deep representation learning
In this experiment, we aim to investigate the findings in Section 5 about the DNN architecture in connection with generalisation. The task is to map a Swiss roll (input manifold) with an arbitrary orientation to a unit circle (output manifold). Specifically, we define the input manifold as
(27) 
where
is an arbitrary orthogonal matrix. We randomly draw
samples on the Swiss roll for training, and another samples for testing. We compare two DNN architectures, namely, one being a fourlayer FNN and the other being a fivelayer FNN . The former tends to learn informationlossless representations, while the later places a bottleneck to capture invariant representations. The deep FNN has only one more neuron than the shallow one.We apply the two trained FNNs on testing samples. Figure 3 shows the box plot of the
norm of prediction errors. Clearly, the shallow FNN (left), which learns informationlossless representations, outperforms only slightly the deep FNN with a bottleneck (right), in terms of mean value, variance, and tail. However, we argue that such a difference is due to the difficulty in training the deep FNN with an extremely narrow bottleneck.
7 Conclusion
In this work, we provide a differential topological perspective on four challenging problems of learning with DNNs, namely, expressibility, optimisability, generalisability, and architecture
. By modelling the dataset of interest as a smooth manifold, DNNs are considered as compositions of smooth maps of smooth manifolds. Our results suggest that differential topological instruments are native for understanding and analysing DNNs. We believe that a thorough investigation of differential topological theory of DNNs will bring new knowledge and methodologies in the study of deep learning.
References
 [1] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50):1–34, 2018.
 [2] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In J. Dy and A. Krause, editors, Proceedings of the International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 244–253, 2018.
 [3] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In Proceedings of the International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 541–549, 2018.
 [4] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1789–1828, 2013.
 [5] Y. Bengio and Y. LeCun. Scaling learning algorithms toward AI. In L. Bottou, O. Chapelle, D. Decoste, and J. Weston, editors, LargeScale Kernel Machines, chapter 14, pages 321–358. MIT Press, 2007.
 [6] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, USA, 1996.
 [7] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In JMLR: Proceedings of Machine Learning Research: The 29 Annual Conference on Learning Theory, volume 49, pages 907–940, 2016.
 [8] A. Fawzi, S.M. MoosaviDezfooli, P. Forssard, and S. Soatto. Empirical study of the topology and geometry of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3762–3770, 2018.

[9]
A. C. Gilbert, Y. Zhang, K. Lee, Y. T. Zhang, and H. Lee.
Towards understanding the invertibility of convolutional neural
networks.
In
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence (IJCAI17)
, pages 1703–1710, 2017.  [10] V. Guillemin and A. Pollack. Differential Topology, volume 370. AMS Chelsea Publishing, 1974.
 [11] O. Güler. Foundations of Optimization. Springer, 2010.
 [12] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7331 – 7339, 2017.
 [13] M. Hauser and A. Ray. Principles of riemannian geometry in neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, 2017.
 [14] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
 [15] J.H. Jacobsen, A. Smeulders, and E. Oyallon. RevNet: Deep invertible networks. In Proceedings of the International Conference on Learning Representations, 2018.
 [16] D. Jakubovitz and R. Giryes. Improving DNN robustness to adversarial attacks using Jacobian regularization. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, European Conference on Computer Vision (ECCV), pages 525–541, 2018.
 [17] K. Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594, 2016.
 [18] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 950–957, 1992.
 [19] S. Lawrence, C. L. Giles, and A. C. Tsoi. Lessons in neural network training: Overfitting may be harder than expected. In Proceedings of the National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, AAAI’97/IAAI’97, pages 540–545, 1997.
 [20] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015.
 [21] J. M. Lee. Introduction to Topological Manifolds. Springer, edition, 2010.
 [22] J. M. Lee. Introduction to Smooth Manifolds. Springer, edition, 2013.
 [23] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6231–6239, 2017.
 [24] M. Minsky and S. A. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, reissue of the 1988 expanded edition, 2017.
 [25] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5947–5956, 2017.
 [26] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. PathSGD: Pathnormalized optimization in deep neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2422–2430, 2015.
 [27] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of the International Conference on Machine Learning, 2017.
 [28] Q. Nguyen, M. Mukkamala, and M. Hein. Neural networks should be wide enough to learn disconnected decision regions. In Proceedings of the International Conference on Machine Learning, 2018.
 [29] H. R. and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, 2004.
 [30] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In Proceedings of the th International Conference on Learning Representations, 2018.
 [31] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the information bottleneck theory of deep learning. In Proceedings of the International Conference on Learning Representations, 2018.
 [32] H. Shen. Towards a mathematical understanding of the difficulty in learning with feedforward neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 811–820, 2018.

[33]
S. Singhal and L. Wu.
Training multilayer perceptrons with the extended Kalman algorithm.
In Advances in Neural Information Processing Systems, pages 133–140, 1989.  [34] J. Sokolić, R. Giryes, G. Sapiro, and M. R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
 [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [36] S. Sun, W. Chen, L. Wang, X. Liu, and T.Y. Liu. On the depth of deep neural networks: A theoretical view. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2066–2072, 2016.
 [37] R. S. Sutton. Two problems with backpropagation and other steepestdescent learning procedures for networks. In Proceedings of the th Annual Conference of the Cognitive Science Society, pages 823–831, 1986.
 [38] M. Telgarsky. Benefits of depth in neural networks. In JMLR: Proceedings of Machine Learning Research: The 29 Annual Conference on Learning Theory, volume 49, 2016.
 [39] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop (ITW), 2015.
 [40] B. Widrow and M. A. Lehr. 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9):1415–1442, 1990.
 [41] D. Yu and L. Deng. Automatic Speech Recognition: A Deep Learning Approach. SpringerVerlag, London, 2015.
 [42] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In The International Conference on Learning Representations, 2017.