1 Introduction
Deep learning has had great success in many fields. Deep learning model perform extremely well in computer
vision [21], image processing, video processing, face recognition
[27], speech recognition [15], and natural
language processing [1, 6, 30].
Deep learning has also been used in more complex systems that are able to play
games [14, 19, 28] or diagnose and classify diseases [2, 9, 10].
Along with the development of deep learning, decisions made using deeplearning principles are being implemented in a wide range of applications.
In order for deep learning to be more useful for human beings,
it is necessary to ensure that there is no leakage of personal or confidential information in the learning process or decisionmaking process.
In this paper, we point out the leakage problem in deep learning. Namely, the learning process of deep learning might leak sample data. This type of phenomenon is specific to the deeplearning methods with ReLU functions.
The following example expresses the difference between linear and ReLU models.
Consider a onedimensional leastsquares model.
Because many give the same (See [25]), we cannot reconstruct from . However, if we consider the onedimensional ReLU leastsquares model,
Then, as are the nonsmooth points of , we can obtain the nonsmooth point of from . Hence, we can reconstruct from up to scalar multiplication. Namely, we obtain satisfying and for some .
The main theorem in this paper shows that, if we reveal all possible learning processes, leakage of training samples can occur. As this example indicates, nonsmooth points of loss functions plays important role. We show that the set of nonsmooth points and the set induced by some algebraic structure coincides (Theorem 2.4). This is a key point of this paper. Another key point is the concept of homogenous polynomials, which is used in algebraic geometry. We find the natural multidegree (layerwise degree) in deep neural networks. We show that loss functions are essentially homogenous polynomials (virtual polynomials) of layerwise degree. By using the theory of homogenous polynomials, we show the correspondence of the factorization of virtual polynomials and its active paths (Theorem 2.8). Finally, as an application of this theorem, we show the weak reconstruction theorem of training samples (Theorem 2.11). We also give a theoretical algorithm to obtain weak reconstruction of
training samples (Section2.5).
1.1 Related work
Leakage problem: The leakage problem in deep learning can be a serious issue in the future and many researchers are working on this. Since the trained model has essential information about the training sample, it is possible to extract sensitive information from a model [3, 11, 12]. B. Hitaj, G. Ateniese, & F. P’erezCruz considered this problem by using generative adversarial networks (GANs) [18]. This type of approach is suitable for images, but unsuitable for numerical data such as medical nonimage data. For example, generating humanmodel data, such as height of six feet, is usually considered normal. However, if identity of the model is made available, then it is a leakage problem. This is the difference between the model generating approach and the deterministic approach.
Loss surfaces: Mathematically, the learning process of deep learning is to find the local minima of loss surfaces (loss functions). Before this paper, some researchers had analyzed the loss surfaces. One of their aims was achieving theoretical understanding of the generalization of deep learning. For example, K. Kawaguchi proved that any local minima of loss surfaces associated with linear neural networks was a global minimum [4,20]. J. Pennington and Y. Bahri analyzed loss surfaces using the random matrix theory [26]. In this paper, we present a new framework to analyze loss surfaces. We study the structure of loss surfaces using algebraic geometry. This approach can contribute to the theoretical understanding of the generalization of deep learning.
Algebraic geometry: Algebraic geometry is one of the most exciting field of pure mathematics [5,8,16,24]. Furthermore, algebraic geometry frequently applied to machine learning. For example, R. Livni, et al introduced vanishing component analysis to express the algebraic (nonlinear) structure of data sets [23]. S. Watanabe applied algebraic geometry to learning theory. He proved that an invariant defined in algebraic geometry and the one defined in learning theory coincides [29]. He also related these invariants value to zeta functions. When we treat polynomials, algebraic geometry is a powerful tool to consider them.
1.2 Contribution
We discuss the loss functions of fully connected deep neural networks with square losses. Basically, all notations are taken from the deep learning book by Goodfellow, et al [13]. Let be the number of layers. We do not use the notion of "hidden layer" for the consistency of the other definitions. We denote the weight parameters by , which consists of the entries of the parameter matrices corresponding to each layer : . Here, represents the width of the th layer, where the first layer is the input layer and the th layer is the output layer. We use node to indicate the th node in the th layer. We denote its output as and preoutput as , namely and
We simply denote by . We denote the output by , namely,
Let be a training sample set. Then, we define the loss function as follows,
whereis the Frobenius norm. The main theorem of this paper is given below.
Theorem 1.1
Let be the loss function of deep neural network with ReLU functions. Assume we can obtain all input and output of . Then we can obtain satisfying for some , number of layers, and number of nodes in each layer.
This theorem means that the input and output of the loss function can reconstruct the input of the training samples up to scalar multiplication. In other word, if we can obtain all possible training process of deep learning, is reconstructed as . In general, is not equal to . However, if we obtain a entry of , we can specify in the Theorem 1.1. Hence, we can obtain . This indicates that it carries many risks to reveal the training process of deep learning. Hence, we need to conceal the value of loss functions to protect training samples. We can provide a stronger statement after proper mathematical preparation (See Theorem 2.11).
Note that Theorem 1.1 can be generalized as follows. First, we can add any smooth function r(w) to the loss function E(w) as a regularization term. Second, we can change the activation function to any piecewise linear function such as Leaky ReLU, Maxout, and LWTA [17,22]. For simplicity, in this paper, we only treat the simplest case.
2 Mathematical results
In this section, we prepare definitions and theorems to prove the main result. Our focus is on the loss surfaces that are defined by
,where is the loss function defined above. From the view point of deep learning, we are interested in the local minima of loss functions. We provide some mathematical frameworks from algebraic geometry. This is a new method to analyze the loss surfaces, which can contribute to the theoretical understanding of generalization. For the standard notations in algebraic geometry, we refer to [8,16]. First, we define semialgebraic sets from a field of pure mathematics, algebraic geometry. Let be a subset of . is said to be a semialgebraic set if is defined by the polynomials and and the finite union of them. If is a semialgebraic set, we can state that is a defining equation of and is a defining inequation of . For other notations in semialgebraic geometry, we refer to [7]. The following theorem points out that the loss surfaces are semialgebraic sets.
Theorem 2.1 (Structure theorem 1)
Let be a loss surface of a square loss function. Then, is a semialgebraic set of codimension 1.
Figure 1 indicates the meaning of the theorem. The polynomial divides the loss surface into subsurfaces and each is defined by a (fixed) polynomial. We can see the precise description of later.This theorem allows us to use algebraic geometry for analyzing loss surfaces. The second theorem is about the decomposition of as a semialgebraic set. To describe it, we define virtual polynomials, which are functions written as the outputs of nodes.
2.1 Virtual polynomials
The concept of virtual polynomials plays an important role in this paper.
Definition 2.2
Fix an input and weight on the fixed deep neural network. An node is said to be active if its output is positive. We define the set
to be the ReLU activation set.
When we mention just ReLU activation set, it is just a formal pair of a node and its activations. Hence, it is irrelevant whether it is realized by an input and a weight. When we have a ReLU activation set, it induces a deep linear network. We define virtual polynomials by using them.
Definition 2.3
Fix an input . A weight valuable polynomial is defined to be a virtual polynomial of type if , where is the output of the th node in the th layer in the deep linear network induced by some ReLU activation set. We simply define as a virtual polynomial if is a virtual polynomial of type for some ReLU activation set and some .
See Figure 2. The virtual polynomials of type in this neural network are
The corresponding ReLU activation sets are
If we fix a ReLU activation set, we have a virtual polynomial.
However, even if we fix the virtual polynomial, the ReLU activation set that provides the virtual polynomial is not unique.
For example, give as a virtual polynomial in the example above.
Now, we can state the second theorem.
Theorem 2.4 (Structure theorem 2)
Let be a loss surface of a square loss function. Let be the set of nonsmooth points on . Then,

The shortest decomposition (the decomposition that we cannot reduce by defining inequations) is given by .

is purely codimension 1 in (See [3]) and is locally defined by a virtual polynomial.

is a semialgebraic set.
This indicates that is a natural set from not only the differentialgeometric view but also the algebrogeometric view. By this theorem, is locally defined by some virtual polynomials. Hence, from the algebrogeometric view point, we need to know the irreducible decomposition of virtual polynomials to obtain the geometric structure of .
2.2 Irreducibility of polynomials
In this section, we review factrization of polynomials. We first define the irreducibility of polynomials. Let be a polynomial with real coefficients and valuables. is said to be irreducible if we cannot write as a product of two nonconstant polynomials. Namely,
It is wellknown that polynomials have an irreducible decomposition[5,8,10]. Namely, let be a polynomial with real coefficients and valuables. Then, has a unique decomposition of the following form.
,where is an irreducible polynomial with real coefficients and each is unique up to constant multiplication. We define above as an irreducible component of . In Section 2.4, we give the irreducible decomposition of virtual polynomials (See Theorem 2.8).
2.3 Homogenous polynomials
In this subsection, we review the concepts of homogenous polynomials and multidegree. Let be valuables. Multidegree of each is defined as an element in . For any monomial , we define
A deep neural network induces natural multidegree.
Definition 2.5
Let be the weight valuable on the path passing from the th node in the th layer to the th node in the th layer. Then, we define
where exists in the th entry. We call this multidegree layerwise degree.
Fix multidegree. A polynomial is said to be homogenous if any monomial appearing in has the same multidegree. In this case, we define where is a monomial appearing in . does not depend on the choice of . It is wellknown that any irreducible component of homogenous polynomial is homogenous (See [5,8,24]).
We can see an example of layerwise degree in Figure 2. The layerwise degree of this neural network is
The following theorem points out the features of virtual polynomials.
Theorem 2.6
Virtual polynomials of type are homogenous polynomials of layerwise degree with , where 1 exists from the first entry to the th entry .
2.4 Irreducible decomposition theorem
In this subsection, we give the necessary and sufficient conditions for the irreducible decomposition of virtual polynomials.
Definition 2.7
Let be a ReLU activation set of fixed input and weight . Then, a active neural network is a subneural network, which consists of active nodes and the paths between them.
An example of active neural networks is given below. See Figure 2 and 3. We can regard the neural net in Figure 2 as a sub neural network of the one in Figure 2. Assume that in Figure 2 is negative for some input and weight and the earlier output was positive. Then, with this ReLU activation set , the active neural network is equal to the one in Figure 2.
Theorem 2.8 (Irreducible decomposition theorem)
Let be a ReLU activation set of fixed input and weight . Let be a virtual polynomial of type induced by . Then, if and only if the active neural network has layers such that there is a unique node in the layer. Furthermore, we can write as the output of the subneural network which starts from a unique node and ends at the next unique node.
A typical example of Theorem 2.8 is given below.
Let be a virtual polynomial with a active neural network (Figure 3). Then, we have the following irreducible decomposition of
The first component of the decomposition corresponds to the output of the node in the third layer. The second component of the decomposition corresponds to a function that starts from the third layer and ends at the output. Hence, the theorem tells us the irreducible decomposition of the virtual polynomials from its active node.
We can see that in this example is realized by an input and weight . Hence, we can see that is one of the defining equations of . The decomposition implies that if and only if
or
This means that is the defining equation of . However, does not depend on the training samples. Hence, the loss surfaces have differential geometric structures, which are independent of the training samples. Suitable algorithms using such structures can be developed.
Proposition 2.9
has irreducible components, which do not depend on the training samples.
Corollary 2.10
Linear components of virtual polynomials are weight parameters or come from the second layer.
We state the main result of this paper. Note that, if we know the input and output of the loss functions, we know the defining equations and the defining inequations of .
Theorem 2.11 (Weak reconstruction theorem)
reconstructs the number of layers, number of nodes in the layers, and training samples (not equal to unit vector) up to scalar multiplication. Namely, a vector satisfying for some is reconstructed for all input of each training sample .
In this theorem, we need infinite points on the loss surface. However, if we assume that these points are sufficiently close to each other, we can reconstruct the input of a training sample.
Proposition 2.12
Assume that we have nonsmooth points on the loss surface, which are not smooth and are sufficiently close to each other, then, we can obtain a vector satisfying for some and .
2.5 An algorithm to reconstruct training samples
We give a theoretical algorithm to obtain weak reconstruction of training samples. The algorithm requires long time, but terminates in a finite time.
We estimate the degree of defining equations from the dimension of the weight space. Then, we take random weight
and obtain the defining equation around by taking finitely many points near . After that, we find an adjacent division and its defining equation by taking random points around and comparing the values of and . Here, the intersection of these two equations is an irreducible component of Sing(X). In other word, is a virtual polynomial. We repeat this procedure until we find all the divisions.3 Sketch of proofs
We give sketches of the proofs in this section. We complete the proofs in the appendix.
3.1 Proof of Theorem 2.1
Let be a virtual polynomial at (i,k). Then, put
Since is defined by a single polynomial, divides the weight space into two areas defined by inequalities. Add all to the weight space. Then, the space is divided into many areas defined by inequalities. Fix an input and take two weights and . If and belong to the same area, the ReLU activation set associated with and the one associated with are the same. This implies that, if the weights are in the same area, any entry of is a polynomial. Hence the loss function is a polynomial. This means that the loss surface is a semialgebraic set.
3.2 Proof of Theorem 2.11 and Proposition 2.12
By the assumption, we may assume that we know the defining equations of , we can pick up the linear polynomials in it. We show that, if the linear polynomial is not a weight parameter, the coefficients are equal to the input of some samples up to scalar multiplication. First, we remark that the coefficients of the virtual polynomials in the second layer are equal to the inputs of some samples up to scalar multiplication. Since we can see that the defining polynomials of include virtual polynomials in the second layer, it is proof enough that any linear polynomials appearing in the defining polynomials are virtual polynomials in the second layer. If we assume that a linear polynomial appears in the defining polynomials of , it will be the irreducible component of a virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial of the second layer or weight parameter. This is because, if a linear polynomial that is not a virtual polynomial of the second layer appears as an irreducible component of the virtual polynomial, it must start from the th layer with one active node and end at the th layer with one active node. This is a weight parameter. Hence, we reconstruct the input of a sample up to scalar multiplication and the weights on the paths from the first layer to the second layer. We can find a quadric polynomial in the defining equation of , which contains the weights on the paths from the first layer to the second layer. Then, the remaining weights are the weights on the paths from the second layer to the third layer. Inductively, we can reconstruct the number of nodes and layers.
4 Conclusion
In this paper, we presented a new mathematical framework based on algebraic geometry and some new concepts including virtual polynomials. Using these, we proposed a structure theorem for loss surfaces, an irreducible decomposition theorem for virtual polynomials. The main contribution of this paper was the reconstruction theorem for samples. Namely, the training process of deep learning could leak information of samples. While this fact is important on its own, the proposed framework contributes more. This framework enables researchers in the fields of machine learning and algebraic geometry to pursue research on deep learning. We will be able to discover new algorithms on security issues, from this framework. In addition, we may be able to find an efficient training algorithm on deep learning from this framework. More theoretical understanding of deep learning is required, but there is also a possibility of contributing to this.
Acknowledgments 1
The author would like to thank Prof. Masashi Sugiyama and Kenichi Bannai for giving the opportunity to study machine learning at RIKEN AIP. The author would like to thank Prof. Jun Sakuma and Takanori Maehara for carefully reading the draft and offering valuable advice. The author would like to thank Prof. Shuji Yamamoto and Sumio Watanabe for their fruitful discussion. The author was partially supported by JSPS GrantinAid for Young Scientists (B) 16K17581.
[1] A. Abdulkader, A. Lakshmiratan, and J. Zhang. (2016) Introducing DeepText: Facebook’s text understanding engine. https://tinyurl.com/ jj359dv
[2] A. CruzRoa, J. Ovalle, A. Madabhushi, and F. Osorio. (2013) A deep learning architecture for image representation, visual interpretability and automated basalcell carcinoma cancer detection. In International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer Berlin Heidelberg, 403–410.
[3] G. Ateniese, L. V Mancini, A. Spognardi, A. Villani, D. Vitali, & G. Felici. (2015) Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10, 3, 137–150.
[4] P. Baldi & K. Hornik. (1989) Neural networks and principal component analysis: Learning from examples without local minima.
Neural networks, 2(1), 53–58 .[5] W. Bruns & H. J. Herzog. (1998) CohenMacauley rings Cambridge University Press
[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.
[7] M. Coste. (2002) An introduction to semialgebraic geometry. Tech. rep., Institut de Recherche Mathematiques de Rennes
[8] D. Cox, J. Little, and D. O’Shea. (1992) Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Springer.
[9] DeepMind. 2016. DeepMind Health, Clinicianled. Patientcentred. (2016). https: //deepmind.com/applied/deepmindhealth/
[10] R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. (2013) Using deep learning to enhance cancer diagnosis and classification. In The 30th International Conference on Machine Learning (ICML 2013),WHEALTH workshop
[11] M. Fredrikson, S. Jha, and T. Ristenpart. (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1322–1333.
[12] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. (2014) Privacy in pharmacogenetics: An endtoend case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14). 17–32.
[13] I. Goodfellow, Y. Bengio & A.Courville. (2016) Deep Learning. MIT Press.
[14] Google DeepMind. 2016. AlphaGo, the first computer program to ever beat a professional player at the game of GO. (2016). https://deepmind.com/alphago
[15] A. Graves, A. Mohamed, and G. Hinton. (2013) Speech recognition with deep recurrent neural networks.
In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.[16] R. Hartshorne. (1977) Algebraic geometry, SpringerVerlag, New York, Graduate Texts in Mathematics, No. 52
[17] K. He, X. Zhang, S. Ren & J. Sun. (2015) Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification
IEEE International Conference on Computer Vision[18] B. Hitaj, G. Ateniese, & F. PerezCruz. (2017) Deep models under the GAN: information leakage from collaborative deep learning. CoRR, abs/1702.07464.
[19] M. Lai. (2015) Giraffe: Using deep reinforcement learning to play chess.
arXiv preprint arXiv:1509.01549 (2015).[20] K. Kawaguchi. (2016) Deep learning without poor local minima, In Advances In Neural Information Processing Systems, pp. 586–594, 2016.
[21] Y. LeCun, K. Kavukcuoglu, C. Farabet, et al. (2010) Convolutional networks and applications in vision.In ISCAS. 253–256.
[22] Z. Liao & G. Carneiro. On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units, arxiv1508.0033
[23] R. Livni, D. Lehavi , S. Schein, H. Nachlieli, S. ShalevShwartz & A. Globerson. (2013) Vanishing Component Analysis 30th International Conference on Machine Learning.
[24] H. Matsumura. Commutative Ring Theory Cambridge Studies in Advanced Mathematics
[25] M. Marshall. (2008) Positive Polynomials and Sums of Squares , Mathematical Surveys and Monographs Volume: 146
[26] J. Pennington & Y. Bahri. (2017) Geometry of Neural Network Loss Surfaces via Random Matrix Theory Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2798–2806.
[27] Y. Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. (2014) DeepFace: Closing the Gap to HumanLevel Performance in Face Verification.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1701–1708.
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013).
[29] S. Watanabe. (2009)
Algebraic Geometry and Statistical Learning Theory
Cambridge University Press[30] X. Zhang and Y. LeCun. (2016) Text Understanding from Scratch. arXiv preprint arXiv:1502.01710v5 (2016).
Appendix of Reconstruction of training samples from loss functions.
4.1 Proof of Theorem 2.1
Let be an input of a training sample. Let be the set of virtual polynomial of the input . We show that and equations define the loss surface. Put
Take two weights and from , where . The ReLU activation set associated with and the one associated with are the same. This implies that, if the weights are in the same area, the entry of is a polynomial. Hence, the loss function is a fixed polynomial in this area. The loss surface is defined by in . If the weights are in , put . In this case, we can use the same discussion for . This means that the loss surface is a semialgebraic set.
4.2 Proof of Theorem 2.4
The proof of Theorem 2.1 implies that gives a decomposition. Put , where .
We see that, the loss surface is smooth in the domain .
Again, by the proof of Theorem 2.1, the loss function is a polynomial in .
Then, the loss surface is defined in the form of , where is a polynomial.
By the Jacobian criterion, we see that the loss surface is smooth in for any .
We then claim that we can erase the virtual polynomial from the defining inequalities if and only if consist of smooth points.
Take a point in . If the point is smooth, we can take the Taylor expansion of . Since is a polynomial at a point in the neighborhood of , the Taylor expansion of will be a polynomial.
Hence, we can erase from the defining inequations.
Conversely, we assume that we can erase from the defining inequations. Since the loss surface around any point in is defined by a polynomial, the point is smooth.
Finally, we show that Sing is also a semialgebraic set of codimension 1 in .
Note that Sing is locally of the form . This implies that Sing is codimension 1 in . We consider the decomposition discussed in the proof of Theorem2.1 again. In each domain, it is fixed that is singular or not, because the function is a polynomial in each domain. This implies that Sing is a semialgebraic set.
4.3 Proof of Theorem 2.6
By the construction of the virtual polynomial, the weights of each layer appear precisely once in each monomial. This implies that the layerwise degree is equal to .
4.4 Proof of Theorem 2.8
We prove the theorem for any connected deep neural networks by induction on the number of layers. If , the statement is clear. Assume . First, by Theorem2.15, virtual polynomials of type are a homogenous polynomials of layerwise degree and the layerwise degree is realized by assigning the degree to the weights on the paths passing from the th layer to the th layer, where exists in the th entry. Assume that . Then, by a general theory of commutative algebra, are homogenous polynomials of layerwise degree. We have
We may assume that from the first entry to the th entry, the entry of is 1 and th entry of is zero . Hence, we may also assume that the th entry of is 1. Let be the weight on the path passing from the th node in the th layer to the th node in the th layer. There are the monomials containing in and monomials containing in by the layerwise degree. After the construction of virtual polynomials, there will be no monomials in containing if . However, implies that has monomials containing for all . This implies that the th layer of active neural network has a unique node. Hence, after the construction of virtual polynomials, we obtain , where is the output of the unique node in th layer. By the general theory of commutative algebra (uniqueness of the decomposition), we obtain . Since we can regard is a virtual polynomial of the subneural network which starts from th layer induced by the output of th layer and the same weights, the theorem holds for by inductive hypothesis. This completes the proof of the first statement in Theorem 2.18. The remaining claim follows from the construciton of .
4.5 Proof of Corollary 2.10 and Theorem 2.11
By the assumption, we may assume that we know the defining equations of , we can pick up the linear polynomials in it. Let is linear polynomials in it.We show that, if is not a weight parameter,
is equal to the input of some samples up to scalar multiplication.
First, we remark that the coefficients of the virtual polynomials in the second layer are equal to the inputs of some samples. Since we can see that the defining polynomials of include virtual polynomials in the second layer,
it is enough to show that any linear polynomials appearing in the defining polynomials are virtual polynomials in the second layer. If we assume that a linear polynomial appears in the defining polynomials of , it will be the irreducible component of a virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial of the second layer or weight parameter. This is because, if a linear polynomial that is not a virtual polynomial of the second layer appears as an irreducible component of the virtual polynomial, it must start from the th layer with one active node and end at the th layer with one active node. This is a weight parameter.
Hence, we reconstruct the input of samples up to scalar multiplication and the weights on the paths from the first layer to the second layer.
Next, We identify the weights on the paths from the th layer to the th layer by induction on .
By the discussion above, we identified the weights of the first layer, namely .
Assume that we identified the weights for the th layer if .
Pick up the polynomials of the defining inequalities of degree such that each monomial in is consist of the weights of except one weight parameter. We can easily see that such weight is on the paths from the th layer to the th layer by induction on . Hence, we can also obtain the number of the layers. This completes the proof.
4.6 Proof of Proposition 2.12
We note that the ambient space of the loss surface is
. Hence, we can determine the hyperplane going through the points. This hyperplane will be a linear component of
, and we obtain a sample up to scalar.
Comments
There are no comments yet.