Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

We address the structure identification and the uniform approximation of two fully nonlinear layer neural networks of the type f(x)=1^T h(B^T g(A^T x)) on R^d from a small number of query samples. We approach the problem by sampling actively finite difference approximations to Hessians of the network. Gathering several approximate Hessians allows reliably to approximate the matrix subspace W spanned by symmetric tensors a_1 ⊗ a_1 ,...,a_m_0⊗ a_m_0 formed by weights of the first layer together with the entangled symmetric tensors v_1 ⊗ v_1 ,...,v_m_1⊗ v_m_1, formed by suitable combinations of the weights of the first and second layer as v_ℓ=A G_0 b_ℓ/A G_0 b_ℓ_2, ℓ∈ [m_1], for a diagonal matrix G_0 depending on the activation functions of the first layer. The identification of the 1-rank symmetric tensors within W is then performed by the solution of a robust nonlinear program. We provide guarantees of stable recovery under a posteriori verifiable conditions. We further address the correct attribution of approximate weights to the first or second layer. By using a suitably adapted gradient descent iteration, it is possible then to estimate, up to intrinsic symmetries, the shifts of the activations functions of the first layer and compute exactly the matrix G_0. Our method of identification of the weights of the network is fully constructive, with quantifiable sample complexity, and therefore contributes to dwindle the black-box nature of the network training phase. We corroborate our theoretical results by extensive numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/18/2021

Stable Recovery of Entangled Weights: Towards Robust Identification of Deep Neural Networks from Minimal Samples

In this paper we approach the problem of unique and stable identifiabili...
09/21/2021

Neural networks with trainable matrix activation functions

The training process of neural networks usually optimize weights and bia...
11/02/2019

Jacobi-type algorithm for low rank orthogonal approximation of symmetric tensors and its convergence analysis

In this paper, we propose a Jacobi-type algorithm to solve the low rank ...
04/04/2018

Identification of Shallow Neural Networks by Fewest Samples

We address the uniform approximation of sums of ridge functions ∑_i=1^m ...
07/14/2017

On the Complexity of Learning Neural Networks

The stunning empirical successes of neural networks currently lack rigor...
12/30/2021

A Unified and Constructive Framework for the Universality of Neural Networks

One of the reasons why many neural networks are capable of replicating c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is perhaps one of the most sensational scientific and technological developments in the industry of the last years. Despite the spectacular success of deep neural networks (NN) outperforming other pattern recognition methods, achieving even superhuman skills in some domains [12, 36, 58], and confirmations of empirical successes in other areas such as speech recognition [25], optical charachter recognition [8], games solution [44, 56]

, the mathematical understanding of the technology of machine learning is in its infancy. This is not only unsatisfactory from a scientific, especially mathematical point of view, but it also means that deep learning currently has the character of a black-box method and its success can not be ensured yet by a full theoretical explanation. This leads to lack of acceptance in many areas, where interpretability is a crucial issue (like security, cf.

[10]) or for those applications where one wants to extract new insights from data [61].

Several general mathematical results on neural networks have been available since the 90’s [2, 17, 38, 39, 46, 47, 48], but deep neural networks have special features and in particular superior properties in applications that still can not be fully explained from the known results. In recent years, new interesting mathematical insights have been derived for undestanding approximation properties (expressivity) [27, 54] and stability properties [9, 68] of deep neural networks. Several other crucial and challenging questions remain open.

A fundamental one is about the number of required training data to obtain a good neural network, i.e.

, achieving small generalization errors for future data. Classical statistical learning theory splits this error into bias and variance and gives general estimations by means of the so-called VC-dimension or Rademacher complexity of the used class of neural networks

[55]. However, the currently available estimates of these parameters [26] provide very pessimistic barriers in comparison to empirical success. In fact, the tradeoff between bias and variance is function of the complexity of a network, which should be estimated by the number of sampling points to identify it uniquely. Thus, on the one hand, it is of interest to know which neural networks can be uniquely determined in a stable way by finitely many training points. On the other hand, the unique identifiability is clearly a form of interpretability.

The motivating problem of this paper is the robust and resource efficient identification of feed forward neural networks. Unfortunately, it is known that identifying a very simple (but general enough) neural network is indeed NP-hard

[7, 33]. Even without invoking fully connected neural networks, recent work [20, 41]

showed that even the training of one single neuron (ridge function or single index model) can show any possible degree of intractability, depending on the distribution of the input. Recent results

[3, 34, 42, 57, 52], on the other hand, are more encouraging, and show that minimizing a square loss of a (deep) neural network does not have in general or asymptotically (for large number of neurons) poor local minima, although it may retain the presence of critical saddle points.
In this paper we present conditions for a fully nonlinear two-layer neural network to be provably identifiable with a number of samples, which is polynomially depending on the dimension of the network. Moreover, we prove that our procedure is robust to perturbations. Our result is clearly of theoretical nature, but also fully constructive and easily implementable. To our knowledge, this work is the first, which allows provable de-parametrization of the problem of deep network identification, beyond the simpler case of shallow (one hidden) layer neural networks already considered in very recent literature [3, 32, 21, 34, 42, 43, 57, 52]

. For the implementation we do not require black-box high dimensional optimization methods and no concerns about complex energy loss landscapes need to be addressed, but only classical and relatively simple calculus and linear algebra tools are used (mostly function differentiation and singular value decompositions). The results of this paper build upon the work

[20, 21], where the approximation from a finite number of sampling points have been already derived for the single neuron and one-layer neural networks. The generalization of the approach of the present paper to networks with more than two hidden layers is suprisingly simpler than one may expect, and it is in the course of finalization [22], see Section 5 (v) below for some details.

1.1 Notation

Let us collect here some notation used in this paper. Given any integer , we use the symbol for indicating the index set of the first integers. We denote the Euclidean unit ball in , the Euclidean sphere, and

is its uniform probability measure. We denote

the -dimensional Euclidean space endowed with the norm . For we often write indifferently . For a matrix we denote its singular value. We denote the sphere of symmetric matrices of unit Frobenius norm . The spectral norm of a matrix is denoted . Given a closed convex set we denote the orthogonal projection operator onto (sometimes we use such operators to project onto subspaces of or subspaces of symmetric matrices or onto balls of such spaces). For vectors we denote the tensor product as the tensor of entries . For the case of the tensor product of two vectors equals the matrix . For any matrix

(1)

is its vectorization, which is the vector created by the stacked columns of .

1.2 From one artificial neuron to shallow, and deeper networks

1.2.1 Meet the neuron

The simplest artificial neural network is a network consisting of exactly one artificial neuron, which is modeled by a ridge-function (or single-index model) as

(2)

where is the shifted activation function and the vector expresses the weight of the neuron. Since the beginning of the 90’s [31, 30], there is a vast mathematical statistics literature about single-index models, which addresses the problem of approximating and possibly also from a finite number of samples of to yield an expected least-squares approximation of on a bounded domain

. Now assume for the moment that we can evaluate the network

at any point in its domain; we refer to this setting as active sampling. As we aim at uniform approximations, we adhere here to the language of recent results about the sampling complexity of ridge functions from the approximation theory literature, e.g., [13, 20, 41]. In those papers, the identification of the neuron is performed by using approximate differentiation. Let us clarify how this method works as it will be of inspiration for the further developments below. For any , points , , and differentiation directions , we have

(3)

Hence, differentiation exposes the weight of a neuron and allows to test it against test vectors . The approximate relationship (3) forms for every fixed index a linear system of dimensions , whose unknown is . Solving approximately and independently the systems for yields multiple approximations of the weight, the most stable of them with respect to the approximation error in (3) is the one for which is maximal. Once is learned then one can easily construct a function by approximating on further sampling points. Under assumptions of smoothness of the activation function , for , , and compressibility of the weight, i.e., is small for , then by using sampling points of the function and the approach sketched above, one can construct a function such that

(4)

In particular, the result constructs the approximation of the neuron with an error, which has polynomial rate with respect to the number of samples, depending on the smoothness of the activation function and the compressibility of the weight vector . The dependence on the input dimension is only logarithmical. To take advantage of the compressibility of the weight, compressive sensing [23] is a key tool to solve the linear systems (3). In [13] such an approximation result was obtained by active and deterministic choice of the input points . In order to relax a bit the usage of active sampling, in the paper [20] a random sampling of the points has been proposed and the resulting error estimate would hold with high probability. The assumption is somehow crucial, since it was pointed out in [20, 41] that any level of tractability (polynomial complexity) and intractability (super-polynomial complexity) of the problem may be exhibited otherwise.

1.2.2 Shallow networks: the one-layer case

Combining several neurons leads to richer function classes [38, 39, 46, 47, 48]. A neural network with one hidden layer and one output is simply a weighted sum of neurons whose activation function only differs by a shift, i.e.,

(5)

where and for all . Sometimes, it may be convenient below the more compact writing where and 111Below, with slight abuse of notation, we may use the symbol also for the span of the weights .. Differently from the case of the single neuron, the use of first order differentiation

(6)

may furnish information about (active subspace identification [14, 15], see also [20, Lemma 2.1]), but it does not allow yet to extract information about the single weights . For that higher order information is needed. Recent work shows that the identification of a network (5) can be related to tensor decompositions [1, 32, 21, 43]. As pointed out in Section 1.2.1 differentiation exposes the weights. In fact, one way to relate the network to tensors and tensor decompositions is given by higher order differentiation. In this case the tensor takes the form

which requires that the ’s are sufficiently smooth. In a setting where the samples are actively chosen, it is generally possible to approximate these derivatives by finite differences. However, even for passive sampling there are ways to construct similar tensors [32, 21], which rely on Stein’s lemma [59] or differentiation by parts or weak differentiation. Let us explain how passive sampling in this setting may be used for obtaining tensor representations of the network. If the probability measure of the sampling points ’s is with known (or approximately known [18]) density with respect to the Lebesgue measure, i.e., , then we can approximate the expected value of higher order derivatives by using exclusively point evatuations of . This follows from

In the work [32] decompositions of third order symmetric tensors () [1, 35, 51] have been used for the weights identification of one hidden layer neural networks. Instead, beyond the classical results about principal Hessian directions [37], in [21] it is shown that using second derivatives actually suffices and the corresponding error estimates reflect positively the lower order and potential of improved stability, see e.g., [16, 28, 29]. The main part of the present work is an extension of the latter approach and therefore we will give a short summary of it with emphasis on active sampling, which will be assumed in this paper as the sampling method. The first step of the approach in [21] is taking advantage of (6) to reduce the dimensionality of the problem from to .

Reduction to the active subspace.

Before stating the core procedure, we want to introduce a simple and optional method, which can help to reduce the problem complexity in practice. Assume takes the form (5), where and that are linearly independent. From a numerical perspective the input dimension of the network plays a relevant role in terms of complexity of the procedure. For this reason in [21] the input dimension is effectively reduced to the number of neurons in the first hidden layer. With this reasoning, in the sections that follow we also consider networks where the input dimension matches the number of neurons of the first hidden layer.

Assume for the moment that the active subspace is known. Let us choose any orthonormal basis of and arrange it as the columns of a matrix . Then

which can be used to define a new network

(7)

whose weights are , all the other parameters remain unchanged. Note that , and therefore can be recovered from . In summary, if the active subspace of is approximately known, then we can construct , such that the identification of and are equivalent. This allows us to reduce the problem to the identification of instead of , under the condition that we approximate well enough [21, Theorem 1.1]. As recalled in (6) we can produce easily approximations to vectors in by approximate first order differentiation of the original network and, in an ideal setting, generating linear independent gradients would suffices to approximate . However, in general, there is no way to ensure a priori such linear independence and we have to account for the error caused by approximating gradients by finite differences. By suitable assumptions on (see the full rank condition on the matrix defined in (8) below) and using Algorithm 1 we obtain the following approximation result.

Input: Given a shallow neural network as in (5), step-size of finite differences , number of samples
1 begin
2       Draw uniformly from the unit sphere
3       Calculate the estimated gradients by first order finite differences with stepsize .
4       Compute the singular value decomposition
where contains the largest singular values. Set .
5 end
6
Output:
Algorithm 1 Active subspace identification [21]
Theorem 1 ([21],Theorem 2.2).

Assume the vectors are linear independent and of unit norm. Additionally, assume that the ’s are smooth enough. Let be constructed as described in Algorithm 1 by sampling values of . Let , and assume that the matrix

(8)

has full rank, i.e., its -th singular value fulfills . Then

with probability at least , where are absolute constants depending on the smoothness of ’s.

Identifying the weights.

As clarified in the previous section we can assume from now on that without loss of generality. Let be a network of the type (5), with three times differentiable activation functions , and independent weights of unit norm. Then has second derivative

(9)

whose expression represents a non-orthogonal rank-1 decomposition of the Hessian. The idea is, first of all, to modify the network by an ad hoclinear transformation (withening) of the input

(10)

in such a way that forms an orthonormal system. The computation of can be performed by spectral decomposition of any positive definite matrix

In fact, from the spectral decomposition of , we define (see [21, Theorem 3.7]). This procedure is called whitening and allows to reduce the problem to networks with nearly-orthogonal weights, and presupposes to have obtained . By using (9) and a similar approach as Algorithm 1 (one simply substitutes there the approximate gradients with vectorized approximate Hessians), one can compute under the assumption that also the second order matrix

is of full rank, where is the vectorization of the Hessian .

After whitening one could assume without loss of generality that the vectors are nearly orthonormal in the first place. Hence the representation (9) would be a near spectral decomposition of the Hessian and the components

would represent the approximate eigenvectors. However, the numerical stability of spectral decompositions is ensured only under spectral gaps

[50, 4]. In order to maximally stabilize the approximation of the ’s, one seeks for matrices

with the maximal spectral gap between the first and second largest eigenvalues. This is achieved by the maximizers of the following nonconvex program

(11)

where and are the spectral and Frobenius norms respectively. This program can be solved by a suitable projected gradient ascent, see for instance [21, Algorithm 3.4] and Algorithm 3 below, and any resulting maximizer has the eigenvector associated to the largest eigenvalue in absolute value close to one of the ’s. Once approximations to all the ’s are retrieved, then it is not difficult to perform the identification of the activation functions , see [21, Algorithm 4.1, Theorem 4.1]. The recovery of the network resulting from this algorithmic pipeline is summarized by the following statement.

Theorem 2 ([21],Theorem 1.2).

Let be a real-valued function defined on the neighborhood of , which takes the form

for . Let be three times continuously differentiable on a neighborhood of for all , and let be linearly independent. We additionally assume both and of maximal rank . Then, for all (stepsize employed in the computation of finite differences), using at most random exact point evaluations of , the nonconvex program (11) constructs approximations of the weights up to a sign change for which

(12)

with probability at least , for a suitable constant intervening (together with some fixed power of ) in the asymptotical constant of the approximation (12). Moreover, once the weights are retrieved one constructs an approximating function of the form

such that

(13)

While this result have been generalized to the case of passive sampling in [21] and through whitening allows for the identification of non-orthogonal weights, it is restricted to the case of and linearly independent weights .

The main goal of this paper is generalizing this approach to account for both the identification of two fully nonlinear hidden layer neural networks and the case where and the weights are not necessarily nearly orthogonal or even linearly independent (see Remark 2 below).

1.2.3 Deeper networks: the two layer case

What follows further extends the theory discussed in the previous sections to a wider class of functions, namely neural networks with two hidden layers. By doing so, we will also address a relevant open problem that was stated in [21], which deals with the identification of shallow neural networks where the number of neurons is larger than the input dimension. First, we need a precise definition of the architecture of the neural networks we intend to consider.

Definition 3.

Let , and , be sets of unit vectors, and denote , . Let and be univariate functions, and denote . We define

(14)

with , , and satisfying

  1. [label=(A0)]

  2. ,

  3. a frame condition for the system with , i.e. there exist constants such that

    (15)

    for all ,

  4. the derivatives of and are uniformly bounded according to

    (16)

Sometimes it may be convenient below the more compact writing where , . In the previous section we presented a dimension reduction that can be applied to one layer neural networks, and which can be useful to reduce the dimensionality from the input dimension to the number of neurons of the first layer. The same approach can be applied to networks defined by the class . For the approximation error of the active subspace, we end up with the following corollary of Theorem 1.

Corollary 1 (cf. Theorem 1).

Assume that and let be constructed as described in Algorithm 1 by sampling values of . Let , and assume that the -th singular value of fulfills . Then we have

with probability at least and constants that depend only on for .

From now on we assume .

2 Approximating the span of tensors of weights

In the one layer case, which was described earlier, the unique identification of the weights is made possible by constructing a matrix space whose rank-1 basis elements are outer products of the weight profiles of the network. This section illustrates the extension of this approach beyond shallow neural networks. Once again, we will make use of differentiation and overall there will be many parallels to the approach in [21]. However, the intuition behind the matrix space will be less straightforward, because we can not anymore directly express the second derivative of a two layer network as a linear combination of symmetric rank-1 matrices. This is due to the fact that the Hessian matrix of a network has the form

Therefore, , which has dimension

and is in general not spanned by symmetric rank-1 matrices. This expression is indeed quite complicated, due to the chain rule and the mixed tensor contributions, which are consequently appearing. At a first look, it would seem impossible to use a similar approach as the one for shallow neural networks recalled in the previous section. Nevertheless a relatively simple algebraic manipulation allows to recognize some useful structure: For a fixed

we rearrange the expression as

which is a combination of symmetric rank-1 matrices since . We write the latter expression more compactly by introducing the notation

(17)

where and

(18)
(19)
(20)

Figure 1: Illustration of the relationship between (black line) and (light blue region) given by two non-linear cones that fan out from . There is no reason to believe that the these cones are symmetric around . The gray cones show the maximal deviation of from .

Let us now introduce the fundamental matrix space

(21)

where are the weight profiles of the first layer and

for all encode entangled information about . For this reason, we call the ’s entangled weights. Let us stress at this point that the definition and the constructive approximtion of the space is perhaps the most crucial and relevant contribution of this paper. In fact, by inspecting carefully the expression (17), we immediately notice that , and also that the first sum in (17), namely , lies in for all . Moreover, for arbitrary sampling points , deviations of from are only due to the second term in (17). The intuition is that for suitable centered distributions of sampling points ’s so that and , the Hessians will distribute themselves somehow around the space , see Figure 1 for a two dimensional sketch of the geometrical situation. Hence, we would attempt an approximation of by PCA of a collection of such approximate Hessians. Practically, by active sampling (targeted evaluations of the network ) we first construct estimates by finite differences of the Hessian matrices (see Section 2.1), at sampling points drawn independently from a suitable distribution . Next, we define the matrix

whose columns are the vectorization of the approximate Hessians. Finally, we produce the approximation to as the span of the first left singular vectors of the matrix . The whole procedure of calculating is given in Algorithm 2. It should be clear that the choice of plays a crucial role for the quality of this method. In the analysis that follows, we focus on distributions that are centered and concentrated. Figure 1 helps to form a better geometrical intuition of the result of the procedure. It shows the region covered by the Hessians, indicated by the light blue area, which envelopes the space in a sort of nonlinear/nonconvex cone originating from . In general, the Hessians do not concentrate around in a symmetric way, which means that the “center of mass” of the Hessians can never be perfectly aligned with the space , regardless of the number of samples. In this analogy, the center of mass is equivalent to the space estimated by Algorithm 2

, which essentially is a non-centered principal component analysis of observed Hessian matrices. The primary result of this section is Theorem

4, which provides an estimate of the approximation error of Algorithm 2 depending on the subgaussian norm of the sample distribution and the number of neurons in the respective layers. More precisely, this result gives a precise worst case estimate of the error caused by the imbalance of mass. For reasons mentioned above, the error does not necessarily vanish with an increasing number of samples, but the probability under which the statement holds will tend to . In Figure 1, the estimated region is illustrated by the gray cones that show the maximal, worst case deviation of . One crucial condition for Theorem 4 to hold is that there exists an such that

(22)

This assumption makes sure that the space spanned by the observed Hessians has, in expectation, at least dimension . Aside from this technical aspect this condition implicitly helps to avoid network configurations, which are reducible, for certain weights can not be recovered. For example, we can define a network in with weights given by

It is easy to see that will never be used during a forward pass through the network, which makes it impossible to recover from the output of the network.
In the theorem below and in the proofs that follow we will make use of the subgaussian norm

of a random variable. This quantity measures how fast the tails of a distribution decay and such a decay plays an important role in several concentration inequalities. More in general, for

, the -norm of a scalar random variable is defined as

For a random vector on the -norm is given by

The random variables for which are called subexponential and those for which are called subgaussian. More in general, the Orlicz space consists of all real random variables on the probabillity space with finite norm and its elements are called -subexponential random variagles. Below, we mainly focus on subgaussian random variables. In particular, every bounded random variable is subgaussian, which covers all the cases we discuss in this work. We refer to [66]

for more details. One example of a subgaussian distribution is the uniform distribution on the unit-sphere, which has subgaussian norm

.

Input: Neural network , number of estimated Hessians , step-size of the finite difference approximation

, probability distribution

1 begin
2       Draw independently from
3       Calculate the matrix
4       Set
5       Denote by the first columns of .
6      
7 end
8
Output:
Algorithm 2 Approximating
Theorem 4.

Let be a neural network within the class described in Definition 3 and consider the space as defined in (21). Assume that is a probability measure with , , and that there exists an such that

(23)

Then, for any , Algorithm 2 returns a projection that fulfills

(24)

for a suitable subspace (we can actually assume that according to Remark 1 below) with probability at least

where

are absolute constants and are constants depending on the constants for .

Remark 1.

If is sufficiently small, due to (23) the space returned by Algorithm 2 has dimension . If the error bound (24) in Theorem 4 is such that , then and must have the same dimension. Moreover, and would necessarily imply that . Hence, for and sufficiently small, we have .

As already mentioned above, for we have . In this case the error bound 24 behaves like