# Reconstruction of training samples from loss functions

This paper presents a new mathematical framework to analyze the loss functions of deep neural networks with ReLU functions. Furthermore, as as application of this theory, we prove that the loss functions can reconstruct the inputs of the training samples up to scalar multiplication (as vectors) and can provide the number of layers and nodes of the deep neural network. Namely, if we have all input and output of a loss function (or equivalently all possible learning process), for all input of each training sample x_i ∈R^n, we can obtain vectors x'_i∈R^n satisfying x_i=c_ix'_i for some c_i ≠ 0. To prove theorem, we introduce the notion of virtual polynomials, which are polynomials written as the output of a node in a deep neural network. Using virtual polynomials, we find an algebraic structure for the loss surfaces, called semi-algebraic sets. We analyze these loss surfaces from the algebro-geometric point of view. Factorization of polynomials is one of the most standard ideas in algebra. Hence, we express the factorization of the virtual polynomials in terms of their active paths. This framework can be applied to the leakage problem in the training of deep neural networks. The main theorem in this paper indicates that there are many risks associated with the training of deep neural networks. For example, if we have N (the dimension of weight space) + 1 nonsmooth points on the loss surface, which are sufficiently close to each other, we can obtain the input of training sample up to scalar multiplication. We also point out that the structures of the loss surfaces depend on the shape of the deep neural network and not on the training samples.

There are no comments yet.

## Authors

• 9 publications
• ### Convex Duality of Deep Neural Networks

We study regularized deep neural networks and introduce an analytic fram...
02/22/2020 ∙ by Tolga Ergen, et al. ∙ 23

• ### Error Loss Networks

A novel model called error loss network (ELN) is proposed to build an er...
06/07/2021 ∙ by Badong Chen, et al. ∙ 0

• ### On Connected Sublevel Sets in Deep Learning

We study sublevel sets of the loss function in training deep neural netw...
01/22/2019 ∙ by Quynh Nguyen, et al. ∙ 0

• ### Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks

Deep neural networks (DNNs) are powerful learning machines that have ena...
06/24/2020 ∙ by Surgan Jandial, et al. ∙ 0

• ### A Note on Connectivity of Sublevel Sets in Deep Learning

It is shown that for deep neural networks, a single wide layer of width ...
01/21/2021 ∙ by Quynh Nguyen, et al. ∙ 0

• ### Optimizing Non-decomposable Measures with Deep Networks

We present a class of algorithms capable of directly training deep neura...
01/31/2018 ∙ by Amartya Sanyal, et al. ∙ 0

• ### A Deep Neural Network's Loss Surface Contains Every Low-dimensional Pattern

The work "Loss Landscape Sightseeing with Multi-Point Optimization" (Sko...
12/16/2019 ∙ by Wojciech Marian Czarnecki, et al. ∙ 32

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep learning has had great success in many fields. Deep learning model perform extremely well in computer vision [21], image processing, video processing, face recognition [27], speech recognition [15], and natural language processing [1, 6, 30]. Deep learning has also been used in more complex systems that are able to play games [14, 19, 28] or diagnose and classify diseases [2, 9, 10].
Along with the development of deep learning, decisions made using deep-learning principles are being implemented in a wide range of applications. In order for deep learning to be more useful for human beings, it is necessary to ensure that there is no leakage of personal or confidential information in the learning process or decision-making process. In this paper, we point out the leakage problem in deep learning. Namely, the learning process of deep learning might leak sample data. This type of phenomenon is specific to the deep-learning methods with ReLU functions. The following example expresses the difference between linear and ReLU models. Consider a one-dimensional least-squares model.

 g(a):=m∑i=1(yi−axi)2.

Because many give the same (See [25]), we cannot reconstruct from . However, if we consider the one-dimensional ReLU least-squares model,

 h(a):=m∑i=1(max(0,yi−axi))2.

Then, as are the nonsmooth points of , we can obtain the nonsmooth point of from . Hence, we can reconstruct from up to scalar multiplication. Namely, we obtain satisfying and for some .
The main theorem in this paper shows that, if we reveal all possible learning processes, leakage of training samples can occur. As this example indicates, nonsmooth points of loss functions plays important role. We show that the set of nonsmooth points and the set induced by some algebraic structure coincides (Theorem 2.4). This is a key point of this paper. Another key point is the concept of homogenous polynomials, which is used in algebraic geometry. We find the natural multidegree (layer-wise degree) in deep neural networks. We show that loss functions are essentially homogenous polynomials (virtual polynomials) of layer-wise degree. By using the theory of homogenous polynomials, we show the correspondence of the factorization of virtual polynomials and its active paths (Theorem 2.8). Finally, as an application of this theorem, we show the weak reconstruction theorem of training samples (Theorem 2.11). We also give a theoretical algorithm to obtain weak reconstruction of training samples (Section2.5).

### 1.1 Related work

Leakage problem: The leakage problem in deep learning can be a serious issue in the future and many researchers are working on this. Since the trained model has essential information about the training sample, it is possible to extract sensitive information from a model [3, 11, 12]. B. Hitaj, G. Ateniese, & F. P’erez-Cruz considered this problem by using generative adversarial networks (GANs) [18]. This type of approach is suitable for images, but unsuitable for numerical data such as medical nonimage data. For example, generating human-model data, such as height of six feet, is usually considered normal. However, if identity of the model is made available, then it is a leakage problem. This is the difference between the model generating approach and the deterministic approach.

Loss surfaces: Mathematically, the learning process of deep learning is to find the local minima of loss surfaces (loss functions). Before this paper, some researchers had analyzed the loss surfaces. One of their aims was achieving theoretical understanding of the generalization of deep learning. For example, K. Kawaguchi proved that any local minima of loss surfaces associated with linear neural networks was a global minimum [4,20]. J. Pennington and Y. Bahri analyzed loss surfaces using the random matrix theory [26]. In this paper, we present a new framework to analyze loss surfaces. We study the structure of loss surfaces using algebraic geometry. This approach can contribute to the theoretical understanding of the generalization of deep learning.

Algebraic geometry: Algebraic geometry is one of the most exciting field of pure mathematics [5,8,16,24]. Furthermore, algebraic geometry frequently applied to machine learning. For example, R. Livni, et al introduced vanishing component analysis to express the algebraic (nonlinear) structure of data sets [23]. S. Watanabe applied algebraic geometry to learning theory. He proved that an invariant defined in algebraic geometry and the one defined in learning theory coincides [29]. He also related these invariants value to zeta functions. When we treat polynomials, algebraic geometry is a powerful tool to consider them.

### 1.2 Contribution

We discuss the loss functions of fully connected deep neural networks with square losses. Basically, all notations are taken from the deep learning book by Goodfellow, et al [13]. Let be the number of layers. We do not use the notion of "hidden layer" for the consistency of the other definitions. We denote the weight parameters by , which consists of the entries of the parameter matrices corresponding to each layer : . Here, represents the width of the -th layer, where the first layer is the input layer and the -th layer is the output layer. We use -node to indicate the -th node in the -th layer. We denote its output as and pre-output as , namely and

 Wk⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣x(k)1x(k)2⋮x(k)dk⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣z(k+1)1z(k+1)2⋮z(k+1)dk+1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

We simply denote by . We denote the output by , namely,

 Fw⎛⎜ ⎜ ⎜ ⎜⎝⎡⎢ ⎢ ⎢ ⎢ ⎢⎣x1x2⋮xd1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎞⎟ ⎟ ⎟ ⎟⎠=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣z(L)1z(L)2⋮z(L)dL⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

Let be a training sample set. Then, we define the loss function as follows,

 E(w)=∑(ai,bi)∈Ω12∥bi−Fw(ai)∥2,

whereis the Frobenius norm. The main theorem of this paper is given below.

###### Theorem 1.1

Let be the loss function of deep neural network with ReLU functions. Assume we can obtain all input and output of . Then we can obtain satisfying for some , number of layers, and number of nodes in each layer.

This theorem means that the input and output of the loss function can reconstruct the input of the training samples up to scalar multiplication. In other word, if we can obtain all possible training process of deep learning, is reconstructed as . In general, is not equal to . However, if we obtain a entry of , we can specify in the Theorem 1.1. Hence, we can obtain . This indicates that it carries many risks to reveal the training process of deep learning. Hence, we need to conceal the value of loss functions to protect training samples. We can provide a stronger statement after proper mathematical preparation (See Theorem 2.11).

Note that Theorem 1.1 can be generalized as follows. First, we can add any smooth function r(w) to the loss function E(w) as a regularization term. Second, we can change the activation function to any piecewise linear function such as Leaky ReLU, Maxout, and LWTA [17,22]. For simplicity, in this paper, we only treat the simplest case.

## 2 Mathematical results

In this section, we prepare definitions and theorems to prove the main result. Our focus is on the loss surfaces that are defined by

 X={(w,y)∈RN×R | y=E(w)}

,where is the loss function defined above. From the view point of deep learning, we are interested in the local minima of loss functions. We provide some mathematical frameworks from algebraic geometry. This is a new method to analyze the loss surfaces, which can contribute to the theoretical understanding of generalization. For the standard notations in algebraic geometry, we refer to [8,16]. First, we define semi-algebraic sets from a field of pure mathematics, algebraic geometry. Let be a subset of . is said to be a semi-algebraic set if is defined by the polynomials and and the finite union of them. If is a semi-algebraic set, we can state that is a defining equation of and is a defining inequation of . For other notations in semi-algebraic geometry, we refer to [7]. The following theorem points out that the loss surfaces are semi-algebraic sets.

###### Theorem 2.1 (Structure theorem 1)

Let be a loss surface of a square loss function. Then, is a semi-algebraic set of codimension 1.

Figure 1 indicates the meaning of the theorem. The polynomial divides the loss surface into subsurfaces and each is defined by a (fixed) polynomial. We can see the precise description of later.This theorem allows us to use algebraic geometry for analyzing loss surfaces. The second theorem is about the decomposition of as a semi-algebraic set. To describe it, we define virtual polynomials, which are functions written as the outputs of nodes.

### 2.1 Virtual polynomials

The concept of virtual polynomials plays an important role in this paper.

###### Definition 2.2

Fix an input and weight on the fixed deep neural network. An -node is said to be active if its output is positive. We define the set

 {[(i,k,q)∣q=positive or negative}

to be the ReLU activation set.

When we mention just ReLU activation set, it is just a formal pair of a node and its activations. Hence, it is irrelevant whether it is realized by an input and a weight. When we have a ReLU activation set, it induces a deep linear network. We define virtual polynomials by using them.

###### Definition 2.3

Fix an input . A weight valuable polynomial is defined to be a virtual polynomial of type if , where is the output of the -th node in the -th layer in the deep linear network induced by some ReLU activation set. We simply define as a virtual polynomial if is a virtual polynomial of type for some ReLU activation set and some .

See Figure 2. The virtual polynomials of type in this neural network are

 {ω1ω5x1+ω2ω6x1+ω3ω5x2+ω4ω6x2,ω1ω5x1+ω3ω5x2,ω2ω6x1+ω4ω6x2,0}.

The corresponding ReLU activation sets are

 {(1,1,active),(1,2,active),(2,1,active),(2,2,active)}
 {(1,1,active),(1,2,active),(2,1,active),(2,2,negative)}
 {(1,1,active),(1,2,active),(2,1,negative),(2,2,active)}
 {(1,1,active),(1,2,active),(2,1,negative),(2,2,negative)}.

If we fix a ReLU activation set, we have a virtual polynomial. However, even if we fix the virtual polynomial, the ReLU activation set that provides the virtual polynomial is not unique. For example, give as a virtual polynomial in the example above.
Now, we can state the second theorem.

###### Theorem 2.4 (Structure theorem 2)

Let be a loss surface of a square loss function. Let be the set of nonsmooth points on . Then,

• The shortest decomposition (the decomposition that we cannot reduce by defining inequations) is given by .

• is purely codimension 1 in (See [3]) and is locally defined by a virtual polynomial.

• is a semi-algebraic set.

This indicates that is a natural set from not only the differential-geometric view but also the algebro-geometric view. By this theorem, is locally defined by some virtual polynomials. Hence, from the algebro-geometric view point, we need to know the irreducible decomposition of virtual polynomials to obtain the geometric structure of .

### 2.2 Irreducibility of polynomials

In this section, we review factrization of polynomials. We first define the irreducibility of polynomials. Let be a polynomial with real coefficients and valuables. is said to be irreducible if we cannot write as a product of two non-constant polynomials. Namely,

 f=gh⇒g or h is constant.

It is well-known that polynomials have an irreducible decomposition[5,8,10]. Namely, let be a polynomial with real coefficients and valuables. Then, has a unique decomposition of the following form.

 f=f1⋯fn

,where is an irreducible polynomial with real coefficients and each is unique up to constant multiplication. We define above as an irreducible component of . In Section 2.4, we give the irreducible decomposition of virtual polynomials (See Theorem 2.8).

### 2.3 Homogenous polynomials

In this subsection, we review the concepts of homogenous polynomials and multidegree. Let be valuables. Multidegree of each is defined as an element in . For any monomial , we define

A deep neural network induces natural multidegree.

###### Definition 2.5

Let be the weight valuable on the path passing from the -th node in the -th layer to the -th node in the -th layer. Then, we define

 deg(w(k)(i,j))=(0,0,…,0,1,0,0,…,0),

where exists in the -th entry. We call this multidegree layer-wise degree.

Fix multidegree. A polynomial is said to be homogenous if any monomial appearing in has the same multidegree. In this case, we define where is a monomial appearing in . does not depend on the choice of . It is well-known that any irreducible component of homogenous polynomial is homogenous (See [5,8,24]).

We can see an example of layer-wise degree in Figure 2. The layer-wise degree of this neural network is

 deg(wi)=(1,0)   (i=1,2,3,4)     deg(wi)=(0,1)   (i=5,6)

The following theorem points out the features of virtual polynomials.

###### Theorem 2.6

Virtual polynomials of type are homogenous polynomials of layer-wise degree with , where 1 exists from the first entry to the -th entry .

### 2.4 Irreducible decomposition theorem

In this subsection, we give the necessary and sufficient conditions for the irreducible decomposition of virtual polynomials.

###### Definition 2.7

Let be a ReLU activation set of fixed input and weight . Then, a -active neural network is a subneural network, which consists of -active nodes and the paths between them.

An example of -active neural networks is given below. See Figure 2 and 3. We can regard the neural net in Figure 2 as a sub neural network of the one in Figure 2. Assume that in Figure 2 is negative for some input and weight and the earlier output was positive. Then, with this ReLU activation set , the -active neural network is equal to the one in Figure 2.

###### Theorem 2.8 (Irreducible decomposition theorem)

Let be a ReLU activation set of fixed input and weight . Let be a virtual polynomial of type induced by . Then, if and only if the -active neural network has layers such that there is a unique node in the layer. Furthermore, we can write as the output of the subneural network which starts from a unique node and ends at the next unique node.

A typical example of Theorem 2.8 is given below.

Let be a virtual polynomial with a -active neural network (Figure 3). Then, we have the following irreducible decomposition of

 u=(ω1ω5x1+ω2ω6x1+ω3ω5x2+ω4ω6x2)(ω7ω9+ω8ω10).

The first component of the decomposition corresponds to the output of the node in the third layer. The second component of the decomposition corresponds to a function that starts from the third layer and ends at the output. Hence, the theorem tells us the irreducible decomposition of the virtual polynomials from its active node.

We can see that in this example is realized by an input and weight . Hence, we can see that is one of the defining equations of . The decomposition implies that if and only if

 ω1ω5x1+ω2ω6x1+ω3ω5x2+ω4ω6x2=0

or

 ω7ω9+ω8ω10=0.

This means that is the defining equation of . However, does not depend on the training samples. Hence, the loss surfaces have differential geometric structures, which are independent of the training samples. Suitable algorithms using such structures can be developed.

###### Proposition 2.9

has irreducible components, which do not depend on the training samples.

###### Corollary 2.10

Linear components of virtual polynomials are weight parameters or come from the second layer.

We state the main result of this paper. Note that, if we know the input and output of the loss functions, we know the defining equations and the defining inequations of .

###### Theorem 2.11 (Weak reconstruction theorem)

reconstructs the number of layers, number of nodes in the layers, and training samples (not equal to unit vector) up to scalar multiplication. Namely, a vector satisfying for some is reconstructed for all input of each training sample .

In this theorem, we need infinite points on the loss surface. However, if we assume that these points are sufficiently close to each other, we can reconstruct the input of a training sample.

###### Proposition 2.12

Assume that we have nonsmooth points on the loss surface, which are not smooth and are sufficiently close to each other, then, we can obtain a vector satisfying for some and .

### 2.5 An algorithm to reconstruct training samples

We give a theoretical algorithm to obtain weak reconstruction of training samples. The algorithm requires long time, but terminates in a finite time.

We estimate the degree of defining equations from the dimension of the weight space. Then, we take random weight

and obtain the defining equation around by taking finitely many points near . After that, we find an adjacent division and its defining equation by taking random points around and comparing the values of and . Here, the intersection of these two equations is an irreducible component of Sing(X). In other word, is a virtual polynomial. We repeat this procedure until we find all the divisions.

## 3 Sketch of proofs

We give sketches of the proofs in this section. We complete the proofs in the appendix.

### 3.1 Proof of Theorem 2.1

Let be a virtual polynomial at (i,k). Then, put

 Wu={w∈RN∣u(i,k)(w)=0}.

Since is defined by a single polynomial, divides the weight space into two areas defined by inequalities. Add all to the weight space. Then, the space is divided into many areas defined by inequalities. Fix an input and take two weights and . If and belong to the same area, the ReLU activation set associated with and the one associated with are the same. This implies that, if the weights are in the same area, any entry of is a polynomial. Hence the loss function is a polynomial. This means that the loss surface is a semi-algebraic set.

### 3.2 Proof of Theorem 2.11 and Proposition 2.12

By the assumption, we may assume that we know the defining equations of , we can pick up the linear polynomials in it. We show that, if the linear polynomial is not a weight parameter, the coefficients are equal to the input of some samples up to scalar multiplication. First, we remark that the coefficients of the virtual polynomials in the second layer are equal to the inputs of some samples up to scalar multiplication. Since we can see that the defining polynomials of include virtual polynomials in the second layer, it is proof enough that any linear polynomials appearing in the defining polynomials are virtual polynomials in the second layer. If we assume that a linear polynomial appears in the defining polynomials of , it will be the irreducible component of a virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial of the second layer or weight parameter. This is because, if a linear polynomial that is not a virtual polynomial of the second layer appears as an irreducible component of the virtual polynomial, it must start from the -th layer with one active node and end at the -th layer with one active node. This is a weight parameter. Hence, we reconstruct the input of a sample up to scalar multiplication and the weights on the paths from the first layer to the second layer. We can find a quadric polynomial in the defining equation of , which contains the weights on the paths from the first layer to the second layer. Then, the remaining weights are the weights on the paths from the second layer to the third layer. Inductively, we can reconstruct the number of nodes and layers.

## 4 Conclusion

In this paper, we presented a new mathematical framework based on algebraic geometry and some new concepts including virtual polynomials. Using these, we proposed a structure theorem for loss surfaces, an irreducible decomposition theorem for virtual polynomials. The main contribution of this paper was the reconstruction theorem for samples. Namely, the training process of deep learning could leak information of samples. While this fact is important on its own, the proposed framework contributes more. This framework enables researchers in the fields of machine learning and algebraic geometry to pursue research on deep learning. We will be able to discover new algorithms on security issues, from this framework. In addition, we may be able to find an efficient training algorithm on deep learning from this framework. More theoretical understanding of deep learning is required, but there is also a possibility of contributing to this.

###### Acknowledgments 1

The author would like to thank Prof. Masashi Sugiyama and Kenichi Bannai for giving the opportunity to study machine learning at RIKEN AIP. The author would like to thank Prof. Jun Sakuma and Takanori Maehara for carefully reading the draft and offering valuable advice. The author would like to thank Prof. Shuji Yamamoto and Sumio Watanabe for their fruitful discussion. The author was partially supported by JSPS Grant-in-Aid for Young Scientists (B) 16K17581.

[1] A. Abdulkader, A. Lakshmiratan, and J. Zhang. (2016) Introducing DeepText: Facebook’s text understanding engine. https://tinyurl.com/ jj359dv

[2] A. Cruz-Roa, J. Ovalle, A. Madabhushi, and F. Osorio. (2013) A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer Berlin Heidelberg, 403–410.

[3] G. Ateniese, L. V Mancini, A. Spognardi, A. Villani, D. Vitali, & G. Felici. (2015) Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10, 3, 137–150.

[4] P. Baldi  & K. Hornik. (1989) Neural networks and principal component analysis: Learning from examples without local minima.

Neural networks, 2(1), 53–58 .

[5] W. Bruns  & H. J. Herzog. (1998) Cohen-Macauley rings Cambridge University Press

[6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.

[7] M. Coste. (2002) An introduction to semialgebraic geometry. Tech. rep., Institut de Recherche Mathematiques de Rennes

[8] D. Cox, J. Little, and D. O’Shea. (1992) Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Springer.

[9] DeepMind. 2016. DeepMind Health, Clinician-led. Patient-centred. (2016). https: //deepmind.com/applied/deepmind-health/

[10] R. Fakoor, F. Ladhak, A. Nazi, and M. Huber. (2013) Using deep learning to enhance cancer diagnosis and classification. In The 30th International Conference on Machine Learning (ICML 2013),WHEALTH workshop

[11] M. Fredrikson, S. Jha, and T. Ristenpart. (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1322–1333.

[12] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart. (2014) Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14). 17–32.

[13] I. Goodfellow, Y. Bengio  & A.Courville. (2016) Deep Learning. MIT Press.

[14] Google DeepMind. 2016. AlphaGo, the first computer program to ever beat a professional player at the game of GO. (2016). https://deepmind.com/alpha-go

[15] A. Graves, A. Mohamed, and G. Hinton. (2013) Speech recognition with deep recurrent neural networks.

In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645–6649.

[16] R. Hartshorne. (1977) Algebraic geometry, Springer-Verlag, New York, Graduate Texts in Mathematics, No. 52

[17] K. He, X. Zhang, S. Ren  & J. Sun. (2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

IEEE International Conference on Computer Vision

[18] B. Hitaj, G. Ateniese,  & F. Perez-Cruz. (2017) Deep models under the GAN: information leakage from collaborative deep learning. CoRR, abs/1702.07464.

[19] M. Lai. (2015) Giraffe: Using deep reinforcement learning to play chess.

arXiv preprint arXiv:1509.01549 (2015).

[20] K. Kawaguchi. (2016) Deep learning without poor local minima, In Advances In Neural Information Processing Systems, pp. 586–594, 2016.

[21] Y. LeCun, K. Kavukcuoglu, C. Farabet, et al. (2010) Convolutional networks and applications in vision.In ISCAS. 253–256.

[22] Z. Liao & G. Carneiro. On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units, arxiv1508.0033

[23] R. Livni, D. Lehavi , S. Schein, H. Nachlieli, S. Shalev-Shwartz & A. Globerson. (2013) Vanishing Component Analysis 30th International Conference on Machine Learning.

[24] H. Matsumura. Commutative Ring Theory Cambridge Studies in Advanced Mathematics

[25] M. Marshall. (2008) Positive Polynomials and Sums of Squares , Mathematical Surveys and Monographs Volume: 146

[26] J. Pennington  & Y. Bahri. (2017) Geometry of Neural Network Loss Surfaces via Random Matrix Theory Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2798–2806.

[27] Y. Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. (2014) DeepFace: Closing the Gap to Human-Level Performance in Face Verification.

In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1701–1708.

[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602 (2013).

[29] S. Watanabe. (2009)

Algebraic Geometry and Statistical Learning Theory

Cambridge University Press

[30] X. Zhang and Y. LeCun. (2016) Text Understanding from Scratch. arXiv preprint arXiv:1502.01710v5 (2016).

Appendix of Reconstruction of training samples from loss functions.

### 4.1 Proof of Theorem 2.1

Let be an input of a training sample. Let be the set of virtual polynomial of the input . We show that and equations define the loss surface. Put

 W(u(p)(i,k))={w∈RN∣u(p)(i,k)(w)=0}.
 W+(u(p)(i,k))={w∈RN∣u(p)(i,k)>0}.
 W−(u(p)(i,k))={w∈RN∣u(p)(i,k)<0}.

Take two weights and from , where . The ReLU activation set associated with and the one associated with are the same. This implies that, if the weights are in the same area, the entry of is a polynomial. Hence, the loss function is a fixed polynomial in this area. The loss surface is defined by in . If the weights are in , put . In this case, we can use the same discussion for . This means that the loss surface is a semi-algebraic set.

### 4.2 Proof of Theorem 2.4

The proof of Theorem 2.1 implies that gives a decomposition. Put , where . We see that, the loss surface is smooth in the domain . Again, by the proof of Theorem 2.1, the loss function is a polynomial in . Then, the loss surface is defined in the form of , where is a polynomial. By the Jacobian criterion, we see that the loss surface is smooth in for any .
We then claim that we can erase the virtual polynomial from the defining inequalities if and only if consist of smooth points. Take a point in . If the point is smooth, we can take the Taylor expansion of . Since is a polynomial at a point in the neighborhood of , the Taylor expansion of will be a polynomial. Hence, we can erase from the defining inequations. Conversely, we assume that we can erase from the defining inequations. Since the loss surface around any point in is defined by a polynomial, the point is smooth. Finally, we show that Sing is also a semi-algebraic set of codimension 1 in . Note that Sing is locally of the form . This implies that Sing is codimension 1 in . We consider the decomposition discussed in the proof of Theorem2.1 again. In each domain, it is fixed that is singular or not, because the function is a polynomial in each domain. This implies that Sing is a semi-algebraic set.

### 4.3 Proof of Theorem 2.6

By the construction of the virtual polynomial, the weights of each layer appear precisely once in each monomial. This implies that the layer-wise degree is equal to .

### 4.4 Proof of Theorem 2.8

We prove the theorem for any connected deep neural networks by induction on the number of layers. If , the statement is clear. Assume . First, by Theorem2.15, virtual polynomials of type are a homogenous polynomials of layer-wise degree and the layer-wise degree is realized by assigning the degree to the weights on the paths passing from the -th layer to the -th layer, where exists in the -th entry. Assume that . Then, by a general theory of commutative algebra, are homogenous polynomials of layer-wise degree. We have

 n∑i=1deg(gi)=deg(u)=(1,1,…,1).

We may assume that from the first entry to the -th entry, the entry of is 1 and -th entry of is zero . Hence, we may also assume that the -th entry of is 1. Let be the weight on the path passing from the -th node in the -th layer to the -th node in the -th layer. There are the monomials containing in and monomials containing in by the layer-wise degree. After the construction of virtual polynomials, there will be no monomials in containing if . However, implies that has monomials containing for all . This implies that the -th layer of -active neural network has a unique node. Hence, after the construction of virtual polynomials, we obtain , where is the output of the unique node in -th layer. By the general theory of commutative algebra (uniqueness of the decomposition), we obtain . Since we can regard is a virtual polynomial of the subneural network which starts from -th layer induced by the output of -th layer and the same weights, the theorem holds for by inductive hypothesis. This completes the proof of the first statement in Theorem 2.18. The remaining claim follows from the construciton of .

### 4.5 Proof of Corollary 2.10 and Theorem 2.11

By the assumption, we may assume that we know the defining equations of , we can pick up the linear polynomials in it. Let is linear polynomials in it.We show that, if is not a weight parameter, is equal to the input of some samples up to scalar multiplication. First, we remark that the coefficients of the virtual polynomials in the second layer are equal to the inputs of some samples. Since we can see that the defining polynomials of include virtual polynomials in the second layer, it is enough to show that any linear polynomials appearing in the defining polynomials are virtual polynomials in the second layer. If we assume that a linear polynomial appears in the defining polynomials of , it will be the irreducible component of a virtual polynomial. By Theorem 2.8, we can see that it is the virtual polynomial of the second layer or weight parameter. This is because, if a linear polynomial that is not a virtual polynomial of the second layer appears as an irreducible component of the virtual polynomial, it must start from the -th layer with one active node and end at the -th layer with one active node. This is a weight parameter. Hence, we reconstruct the input of samples up to scalar multiplication and the weights on the paths from the first layer to the second layer.
Next, We identify the weights on the paths from the -th layer to the -th layer by induction on . By the discussion above, we identified the weights of the first layer, namely . Assume that we identified the weights for the -th layer if . Pick up the polynomials of the defining inequalities of degree such that each monomial in is consist of the weights of except one weight parameter. We can easily see that such weight is on the paths from the -th layer to the -th layer by induction on . Hence, we can also obtain the number of the layers. This completes the proof.

### 4.6 Proof of Proposition 2.12

We note that the ambient space of the loss surface is

. Hence, we can determine the hyperplane going through the points. This hyperplane will be a linear component of

, and we obtain a sample up to scalar.