1 Introduction
Deep learning is perhaps one of the most sensational scientific and technological developments in the industry of the last years. Despite the spectacular success of deep neural networks (NN) outperforming other pattern recognition methods, achieving even superhuman skills in some domains [12, 36, 58], and confirmations of empirical successes in other areas such as speech recognition [25], optical charachter recognition [8], games solution [44, 56]
, the mathematical understanding of the technology of machine learning is in its infancy. This is not only unsatisfactory from a scientific, especially mathematical point of view, but it also means that deep learning currently has the character of a black-box method and its success can not be ensured yet by a full theoretical explanation. This leads to lack of acceptance in many areas, where interpretability is a crucial issue (like security, cf.
[10]) or for those applications where one wants to extract new insights from data [61].Several general mathematical results on neural networks have been available since the 90’s [2, 17, 38, 39, 46, 47, 48], but deep neural networks have special features and in particular superior properties in applications that still can not be fully explained from the known results. In recent years, new interesting mathematical insights have been derived for undestanding approximation properties (expressivity) [27, 54] and stability properties [9, 68] of deep neural networks. Several other crucial and challenging questions remain open.
A fundamental one is about the number of required training data to obtain a good neural network, i.e.
, achieving small generalization errors for future data. Classical statistical learning theory splits this error into bias and variance and gives general estimations by means of the so-called VC-dimension or Rademacher complexity of the used class of neural networks
[55]. However, the currently available estimates of these parameters [26] provide very pessimistic barriers in comparison to empirical success. In fact, the tradeoff between bias and variance is function of the complexity of a network, which should be estimated by the number of sampling points to identify it uniquely. Thus, on the one hand, it is of interest to know which neural networks can be uniquely determined in a stable way by finitely many training points. On the other hand, the unique identifiability is clearly a form of interpretability.The motivating problem of this paper is the robust and resource efficient identification of feed forward neural networks. Unfortunately, it is known that identifying a very simple (but general enough) neural network is indeed NP-hard
[7, 33]. Even without invoking fully connected neural networks, recent work [20, 41]showed that even the training of one single neuron (ridge function or single index model) can show any possible degree of intractability, depending on the distribution of the input. Recent results
[3, 34, 42, 57, 52], on the other hand, are more encouraging, and show that minimizing a square loss of a (deep) neural network does not have in general or asymptotically (for large number of neurons) poor local minima, although it may retain the presence of critical saddle points.In this paper we present conditions for a fully nonlinear two-layer neural network to be provably identifiable with a number of samples, which is polynomially depending on the dimension of the network. Moreover, we prove that our procedure is robust to perturbations. Our result is clearly of theoretical nature, but also fully constructive and easily implementable. To our knowledge, this work is the first, which allows provable de-parametrization of the problem of deep network identification, beyond the simpler case of shallow (one hidden) layer neural networks already considered in very recent literature [3, 32, 21, 34, 42, 43, 57, 52]
. For the implementation we do not require black-box high dimensional optimization methods and no concerns about complex energy loss landscapes need to be addressed, but only classical and relatively simple calculus and linear algebra tools are used (mostly function differentiation and singular value decompositions). The results of this paper build upon the work
[20, 21], where the approximation from a finite number of sampling points have been already derived for the single neuron and one-layer neural networks. The generalization of the approach of the present paper to networks with more than two hidden layers is suprisingly simpler than one may expect, and it is in the course of finalization [22], see Section 5 (v) below for some details.1.1 Notation
Let us collect here some notation used in this paper. Given any integer , we use the symbol for indicating the index set of the first integers. We denote the Euclidean unit ball in , the Euclidean sphere, and
is its uniform probability measure. We denote
the -dimensional Euclidean space endowed with the norm . For we often write indifferently . For a matrix we denote its singular value. We denote the sphere of symmetric matrices of unit Frobenius norm . The spectral norm of a matrix is denoted . Given a closed convex set we denote the orthogonal projection operator onto (sometimes we use such operators to project onto subspaces of or subspaces of symmetric matrices or onto balls of such spaces). For vectors we denote the tensor product as the tensor of entries . For the case of the tensor product of two vectors equals the matrix . For any matrix(1) |
is its vectorization, which is the vector created by the stacked columns of .
1.2 From one artificial neuron to shallow, and deeper networks
1.2.1 Meet the neuron
The simplest artificial neural network is a network consisting of exactly one artificial neuron, which is modeled by a ridge-function (or single-index model) as
(2) |
where is the shifted activation function and the vector expresses the weight of the neuron. Since the beginning of the 90’s [31, 30], there is a vast mathematical statistics literature about single-index models, which addresses the problem of approximating and possibly also from a finite number of samples of to yield an expected least-squares approximation of on a bounded domain
. Now assume for the moment that we can evaluate the network
at any point in its domain; we refer to this setting as active sampling. As we aim at uniform approximations, we adhere here to the language of recent results about the sampling complexity of ridge functions from the approximation theory literature, e.g., [13, 20, 41]. In those papers, the identification of the neuron is performed by using approximate differentiation. Let us clarify how this method works as it will be of inspiration for the further developments below. For any , points , , and differentiation directions , we have(3) |
Hence, differentiation exposes the weight of a neuron and allows to test it against test vectors . The approximate relationship (3) forms for every fixed index a linear system of dimensions , whose unknown is . Solving approximately and independently the systems for yields multiple approximations of the weight, the most stable of them with respect to the approximation error in (3) is the one for which is maximal. Once is learned then one can easily construct a function by approximating on further sampling points. Under assumptions of smoothness of the activation function , for , , and compressibility of the weight, i.e., is small for , then by using sampling points of the function and the approach sketched above, one can construct a function such that
(4) |
In particular, the result constructs the approximation of the neuron with an error, which has polynomial rate with respect to the number of samples, depending on the smoothness of the activation function and the compressibility of the weight vector . The dependence on the input dimension is only logarithmical. To take advantage of the compressibility of the weight, compressive sensing [23] is a key tool to solve the linear systems (3). In [13] such an approximation result was obtained by active and deterministic choice of the input points . In order to relax a bit the usage of active sampling, in the paper [20] a random sampling of the points has been proposed and the resulting error estimate would hold with high probability. The assumption is somehow crucial, since it was pointed out in [20, 41] that any level of tractability (polynomial complexity) and intractability (super-polynomial complexity) of the problem may be exhibited otherwise.
1.2.2 Shallow networks: the one-layer case
Combining several neurons leads to richer function classes [38, 39, 46, 47, 48]. A neural network with one hidden layer and one output is simply a weighted sum of neurons whose activation function only differs by a shift, i.e.,
(5) |
where and for all . Sometimes, it may be convenient below the more compact writing where and 111Below, with slight abuse of notation, we may use the symbol also for the span of the weights .. Differently from the case of the single neuron, the use of first order differentiation
(6) |
may furnish information about (active subspace identification [14, 15], see also [20, Lemma 2.1]), but it does not allow yet to extract information about the single weights . For that higher order information is needed. Recent work shows that the identification of a network (5) can be related to tensor decompositions [1, 32, 21, 43]. As pointed out in Section 1.2.1 differentiation exposes the weights. In fact, one way to relate the network to tensors and tensor decompositions is given by higher order differentiation. In this case the tensor takes the form
which requires that the ’s are sufficiently smooth. In a setting where the samples are actively chosen, it is generally possible to approximate these derivatives by finite differences. However, even for passive sampling there are ways to construct similar tensors [32, 21], which rely on Stein’s lemma [59] or differentiation by parts or weak differentiation. Let us explain how passive sampling in this setting may be used for obtaining tensor representations of the network. If the probability measure of the sampling points ’s is with known (or approximately known [18]) density with respect to the Lebesgue measure, i.e., , then we can approximate the expected value of higher order derivatives by using exclusively point evatuations of . This follows from
In the work [32] decompositions of third order symmetric tensors () [1, 35, 51] have been used for the weights identification of one hidden layer neural networks. Instead, beyond the classical results about principal Hessian directions [37], in [21] it is shown that using second derivatives actually suffices and the corresponding error estimates reflect positively the lower order and potential of improved stability, see e.g., [16, 28, 29]. The main part of the present work is an extension of the latter approach and therefore we will give a short summary of it with emphasis on active sampling, which will be assumed in this paper as the sampling method. The first step of the approach in [21] is taking advantage of (6) to reduce the dimensionality of the problem from to .
Reduction to the active subspace.
Before stating the core procedure, we want to introduce a simple and optional method, which can help to reduce
the problem complexity in practice. Assume takes the form (5), where and that are linearly independent.
From a numerical perspective the input dimension of the network plays a relevant role in terms of complexity of the procedure. For this reason in [21] the input dimension is effectively reduced to the number of neurons in the first hidden layer. With this reasoning, in the sections that follow we also consider networks where the input dimension matches the number of neurons of the first hidden layer.
Assume for the moment that the active subspace is known. Let us choose any orthonormal basis of and arrange it as the columns of a matrix . Then
which can be used to define a new network
(7) |
whose weights are , all the other parameters remain unchanged. Note that , and therefore can be recovered from . In summary, if the active subspace of is approximately known, then we can construct , such that the identification of and are equivalent. This allows us to reduce the problem to the identification of instead of , under the condition that we approximate well enough [21, Theorem 1.1]. As recalled in (6) we can produce easily approximations to vectors in by approximate first order differentiation of the original network and, in an ideal setting, generating linear independent gradients would suffices to approximate . However, in general, there is no way to ensure a priori such linear independence and we have to account for the error caused by approximating gradients by finite differences. By suitable assumptions on (see the full rank condition on the matrix defined in (8) below) and using Algorithm 1 we obtain the following approximation result.
Theorem 1 ([21],Theorem 2.2).
Assume the vectors are linear independent and of unit norm. Additionally, assume that the ’s are smooth enough. Let be constructed as described in Algorithm 1 by sampling values of . Let , and assume that the matrix
(8) | |||||
has full rank, i.e., its -th singular value fulfills . Then
with probability at least , where are absolute constants depending on the smoothness of ’s.
Identifying the weights.
As clarified in the previous section we can assume from now on that without loss of generality. Let be a network of the type (5), with three times differentiable activation functions , and independent weights of unit norm. Then has second derivative
(9) |
whose expression represents a non-orthogonal rank-1 decomposition of the Hessian. The idea is, first of all, to modify the network by an ad hoclinear transformation (withening) of the input
(10) |
in such a way that forms an orthonormal system. The computation of can be performed by spectral decomposition of any positive definite matrix
In fact, from the spectral decomposition of , we define (see [21, Theorem 3.7]). This procedure is called whitening and allows to reduce the problem to networks with nearly-orthogonal weights, and presupposes to have obtained . By using (9) and a similar approach as Algorithm 1 (one simply substitutes there the approximate gradients with vectorized approximate Hessians), one can compute under the assumption that also the second order matrix
is of full rank, where is the vectorization of the Hessian .
After whitening one could assume without loss of generality that the vectors are nearly orthonormal in the first place. Hence the representation (9) would be a near spectral decomposition of the Hessian and the components
would represent the approximate eigenvectors. However, the numerical stability of spectral decompositions is ensured only under spectral gaps
[50, 4]. In order to maximally stabilize the approximation of the ’s, one seeks for matriceswith the maximal spectral gap between the first and second largest eigenvalues. This is achieved by the maximizers of the following nonconvex program
(11) |
where and are the spectral and Frobenius norms respectively. This program can be solved by a suitable projected gradient ascent, see for instance [21, Algorithm 3.4] and Algorithm 3 below, and any resulting maximizer has the eigenvector associated to the largest eigenvalue in absolute value close to one of the ’s. Once approximations to all the ’s are retrieved, then it is not difficult to perform the identification of the activation functions , see [21, Algorithm 4.1, Theorem 4.1]. The recovery of the network resulting from this algorithmic pipeline is summarized by the following statement.
Theorem 2 ([21],Theorem 1.2).
Let be a real-valued function defined on the neighborhood of , which takes the form
for . Let be three times continuously differentiable on a neighborhood of for all , and let be linearly independent. We additionally assume both and of maximal rank . Then, for all (stepsize employed in the computation of finite differences), using at most random exact point evaluations of , the nonconvex program (11) constructs approximations of the weights up to a sign change for which
(12) |
with probability at least , for a suitable constant intervening (together with some fixed power of ) in the asymptotical constant of the approximation (12). Moreover, once the weights are retrieved one constructs an approximating function of the form
such that
(13) |
While this result have been generalized to the case of passive sampling in [21] and through whitening allows for the identification of non-orthogonal weights, it is restricted to the case of and linearly independent weights .
The main goal of this paper is generalizing this approach to account for both the identification of two fully nonlinear hidden layer neural networks and the case where and the weights are not necessarily nearly orthogonal or even linearly independent (see Remark 2 below).
1.2.3 Deeper networks: the two layer case
What follows further extends the theory discussed in the previous sections to a wider class of functions, namely neural networks with two hidden layers. By doing so, we will also address a relevant open problem that was stated in [21], which deals with the identification of shallow neural networks where the number of neurons is larger than the input dimension. First, we need a precise definition of the architecture of the neural networks we intend to consider.
Definition 3.
Let , and , be sets of unit vectors, and denote , . Let and be univariate functions, and denote . We define
(14) |
with , , and satisfying
-
[label=(A0)]
-
,
-
a frame condition for the system with , i.e. there exist constants such that
(15) for all ,
-
the derivatives of and are uniformly bounded according to
(16)
Sometimes it may be convenient below the more compact writing where , . In the previous section we presented a dimension reduction that can be applied to one layer neural networks, and which can be useful to reduce the dimensionality from the input dimension to the number of neurons of the first layer. The same approach can be applied to networks defined by the class . For the approximation error of the active subspace, we end up with the following corollary of Theorem 1.
Corollary 1 (cf. Theorem 1).
Assume that and let be constructed as described in Algorithm 1 by sampling values of . Let , and assume that the -th singular value of fulfills . Then we have
with probability at least and constants that depend only on for .
From now on we assume .
2 Approximating the span of tensors of weights
In the one layer case, which was described earlier, the unique identification of the weights is made possible by constructing a matrix space whose rank-1 basis elements are outer products of the weight profiles of the network. This section illustrates the extension of this approach beyond shallow neural networks. Once again, we will make use of differentiation and overall there will be many parallels to the approach in [21]. However, the intuition behind the matrix space will be less straightforward, because we can not anymore directly express the second derivative of a two layer network as a linear combination of symmetric rank-1 matrices. This is due to the fact that the Hessian matrix of a network has the form
Therefore, , which has dimension
and is in general not spanned by symmetric rank-1 matrices. This expression is indeed quite complicated, due to the chain rule and the mixed tensor contributions, which are consequently appearing. At a first look, it would seem impossible to use a similar approach as the one for shallow neural networks recalled in the previous section. Nevertheless a relatively simple algebraic manipulation allows to recognize some useful structure: For a fixed
we rearrange the expression aswhich is a combination of symmetric rank-1 matrices since . We write the latter expression more compactly by introducing the notation
(17) |
where and
(18) | |||||
(19) | |||||
(20) |
Let us now introduce the fundamental matrix space
(21) |
where are the weight profiles of the first layer and
for all encode entangled information about . For this reason, we call the ’s entangled weights. Let us stress at this point that the definition and the constructive approximtion of the space is perhaps the most crucial and relevant contribution of this paper. In fact, by inspecting carefully the expression (17), we immediately notice that , and also that the first sum in (17), namely , lies in for all . Moreover, for arbitrary sampling points , deviations of from are only due to the second term in (17). The intuition is that for suitable centered distributions of sampling points ’s so that and , the Hessians will distribute themselves somehow around the space , see Figure 1 for a two dimensional sketch of the geometrical situation. Hence, we would attempt an approximation of by PCA of a collection of such approximate Hessians. Practically, by active sampling (targeted evaluations of the network ) we first construct estimates by finite differences of the Hessian matrices (see Section 2.1), at sampling points drawn independently from a suitable distribution . Next, we define the matrix
whose columns are the vectorization of the approximate Hessians. Finally, we produce the approximation to as the span of the first left singular vectors of the matrix . The whole procedure of calculating is given in Algorithm 2. It should be clear that the choice of plays a crucial role for the quality of this method. In the analysis that follows, we focus on distributions that are centered and concentrated. Figure 1 helps to form a better geometrical intuition of the result of the procedure. It shows the region covered by the Hessians, indicated by the light blue area, which envelopes the space in a sort of nonlinear/nonconvex cone originating from . In general, the Hessians do not concentrate around in a symmetric way, which means that the “center of mass” of the Hessians can never be perfectly aligned with the space , regardless of the number of samples. In this analogy, the center of mass is equivalent to the space estimated by Algorithm 2
, which essentially is a non-centered principal component analysis of observed Hessian matrices. The primary result of this section is Theorem
4, which provides an estimate of the approximation error of Algorithm 2 depending on the subgaussian norm of the sample distribution and the number of neurons in the respective layers. More precisely, this result gives a precise worst case estimate of the error caused by the imbalance of mass. For reasons mentioned above, the error does not necessarily vanish with an increasing number of samples, but the probability under which the statement holds will tend to . In Figure 1, the estimated region is illustrated by the gray cones that show the maximal, worst case deviation of . One crucial condition for Theorem 4 to hold is that there exists an such that(22) |
This assumption makes sure that the space spanned by the observed Hessians has, in expectation, at least dimension . Aside from this technical aspect this condition implicitly helps to avoid network configurations, which are reducible, for certain weights can not be recovered. For example, we can define a network in with weights given by
It is easy to see that will never be used during a forward pass through the network, which makes it impossible to recover from the output of the network.
In the theorem below and in the proofs that follow we will make use of the subgaussian norm
of a random variable. This quantity measures how fast the tails of a distribution decay and such a decay plays an important role in several concentration inequalities. More in general, for
, the -norm of a scalar random variable is defined asFor a random vector on the -norm is given by
The random variables for which are called subexponential and those for which are called subgaussian. More in general, the Orlicz space consists of all real random variables on the probabillity space with finite norm and its elements are called -subexponential random variagles. Below, we mainly focus on subgaussian random variables. In particular, every bounded random variable is subgaussian, which covers all the cases we discuss in this work. We refer to [66]
for more details. One example of a subgaussian distribution is the uniform distribution on the unit-sphere, which has subgaussian norm
.Theorem 4.
Let be a neural network within the class described in Definition 3 and consider the space as defined in (21). Assume that is a probability measure with , , and that there exists an such that
(23) |
Then, for any , Algorithm 2 returns a projection that fulfills
(24) |
for a suitable subspace (we can actually assume that according to Remark 1 below) with probability at least
where
are absolute constants and are constants depending on the constants for .
Remark 1.
As already mentioned above, for we have . In this case the error bound 24 behaves like
Comments
There are no comments yet.