# Identification of Shallow Neural Networks by Fewest Samples

We address the uniform approximation of sums of ridge functions ∑_i=1^m g_i(a_i· x) on R^d, representing the shallowest form of feed-forward neural network, from a small number of query samples, under mild smoothness assumptions on the functions g_i's and near-orthogonality of the ridge directions a_i's. The sample points are randomly generated and are universal, in the sense that the sampled queries on those points will allow the proposed recovery algorithms to perform a uniform approximation of any sum of ridge functions with high-probability. Our general approximation strategy is developed as a sequence of algorithms to perform individual sub-tasks. We first approximate the span of the ridge directions. Then we use a straightforward substitution, which reduces the dimensionality of the problem from d to m. The core of the construction is then the approximation of ridge directions expressed in terms of rank-1 matrices a_i ⊗ a_i, realized by formulating their individual identification as a suitable nonlinear program, maximizing the spectral norm of certain competitors constrained over the unit Frobenius sphere. The final step is then to approximate the functions g_1,...,g_m by ĝ_1,...,ĝ_m. Higher order differentiation, as used in our construction, of sums of ridge functions or of their compositions, as in deeper neural network, yields a natural connection between neural network weight identification and tensor product decomposition identification. In the case of the shallowest feed-forward neural network, we show that second order differentiation and tensors of order two (i.e., matrices) suffice.

## Authors

• 9 publications
• 7 publications
• 22 publications
07/26/2016

### Uniform Approximation by Neural Networks Activated by First and Second Order Ridge Splines

We establish sup-norm error bounds for functions that are approximated b...
06/30/2020

### Approximation with Tensor Networks. Part I: Approximation Spaces

We study the approximation of functions by tensor networks (TNs). We sho...
02/25/2021

### Recovery of regular ridge functions on the ball

We consider the problem of the uniform (in L_∞) recovery of ridge functi...
06/30/2019

### Robust and Resource Efficient Identification of Two Hidden Layer Neural Networks

We address the structure identification and the uniform approximation of...
12/22/2020

### Improving Sample and Feature Selection with Principal Covariates Regression

Selecting the most relevant features and samples out of a large set of c...
06/25/2020

### Q-NET: A Formula for Numerical Integration of a Shallow Feed-forward Neural Network

Numerical integration is a computational procedure that is widely encoun...
09/14/2021

### Approximation of Curve-based Sleeve Functions in High Dimensions

Sleeve functions are generalizations of the well-established ridge funct...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Sums of ridge functions and neural networks

A ridge function, in its simplest form, is a function of the type

 f(x)=g(a⋅x), (1)

where is a scalar univariate function and

is the direction of the ridge function. Ridge functions are constant along the hyperplanes

for any given level and are among the most simple form of multivariate functions. For this reason they have been extensively studied in the past couple of decades as approximation building blocks for more complicated functions. Nevertheless, ridge functions appeared long before in several contexts. For instance, in multivariate Fourier series, the basis functions are of the form for and for arbitrary directions in the Radon transform. Also ridge polynomials are used in many settings [20]. The term “ridge function” has been actually coined by Logan and Shepp in 1975 in their work on computer tomography [22], where they show how ridge functions solve the corresponding -minimum norm approximation problem. Ridge function approximation has been as well extensively studied during the 80’s in mathematical statistics under the name of projection pursuit, see for instance [17, 9]. Projection pursuit algorithms approximate a function of variables by functions of the form

 m∑i=1gi(ai⋅x),x∈Rd, (2)

for some functions

and some non-zero vectors

In the early 90’s there has been an explosion of interest in the field of neural networks. One very popular model is the multilayer feed-forward neural network with input, hidden (internal), and output layers. The simplest case of such a network (the one with only one internal hidden layer, processing units, and one output) is described mathematically by a function of the form

 m∑i=1αiσ(m∑j=1wijxj+θi),

where is somehow given and called the activation function and are suitable weights indicating the amount of contribution of the input layer neurons. One can easily recognize this model to be of type (2). Due to the tremendous attention raised by neural networks in the early 90’s the question of whether one can use sums of ridge functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community for more than two decades, see for instance [21, 28] for two extensive overviews on the topic, and the references therein. Also the efficiency of such an approximation compared to, e.g., spline type approximation for smoothness classes of functions, has been extensively investigated [8, 27]. The identification of a ridge function has also been thoroughly considered, in particular we mention the work [3], and, for what concerns multilayer neural networks, we refer to the groundbreaking paper [10].

### 1.2 Identifications of ridge functions from sample queries

Except perhaps the work of Candès on ridglets [4], after these fundamental developments and the rather complete exploration of the subject, there has been less attention on the problem of approximating functions by means of ridge functions, until a more specific issue came into focus again recently, namely the approximation of ridge functions by the minimal amount of sampling queries. In fact the above mentioned results on the identification of such functions were mainly based on disposing of any possible output of the function or even of its derivatives, which might be in certain practical situations very expensive, hazardous or even merely impossible. In the work [5] the authors explored deterministic and adaptive choices of point queries to learn functions of the type (2) for the specific case where , under mild regularity assumptions on the function as in (1), i.e., . The approach pursued in this paper was based on suitable finite differences yielding approximations to the gradient of the function . A similar approach has been considered in [11] where a universal sampling strategy has been derived, i.e., the points on which to evaluate the function are not adaptively chosen depending on , at the price though of randomizing their selection. This approach was inspired by the recent developments of the theory of compressed sensing. The approximation of a function of the type (1) has been actually considered as a learning of a compressible vector from the minimal amount of nonlinear measurements , provided by the applications of the function on a randomized set of sampling points ’s. In the same paper and similarly, the approximation of functions of the type , for a matrix and , has been explored. The problem of identifying the matrix (and consequently the function ) has been lately popularized under the name of “active subspace” detection [6, 7] with a large number of potential applications. The optimal complexity of such sampling strategies, especially for ridge functions, has been by now completely explored [23].
In this paper, we address the open problem of approximating a function of the type (2) for from a small number of sampling points. We shall assume throughout, that (with the interesting possibility of included), that the vectors are linearly independent, and for all

. We reiterate that such functions play a relevant role, for instance, in forming layers of neural networks. The robust and fast calibration of multilayer neural networks, especially in deep learning, is a standing open problem

[16].
As a matter of fact one can rewrite (2) as follows

 f(x)=g(ATx)=m∑i=1gi(ai⋅x),x∈Rd,

where is the matrix whose columns are the vectors , so that , and . Hence the problem of learning uniformly can certainly be addressed by using the methods explored, e.g., in [11]. However, there are at least two relevant motivations for searching for alternative approaches to the learning of functions of the type (2). First of all, one can hope that the specific structure of being a sum of ridge functions can be more precisely identified, in particular, in some applications one may really wish to identify precisely the ridge directions , as it happens in the calibration of a neural network. Unfortunately, this is actually not possible with the methods in [11]. Indeed, as clarified in [11, Lemma 2.1], one can identify the matrix only up to an orthogonal transformation, since for , for any and Hence, whatever algorithm one uses to approximate uniformly with functions of the type , it is impossible to uniquely identify the ridge directions. The second motivation for searching for a more specific method to approximate (2) from samples is the intrinsic complexity of the problem. In fact, learning a function of the form requires eventually an exponential number of samples in , while functions of the type are essentially a sum of one dimensional ridge functions and one expects that the number of necessary samples should scale at most polynomially in .

### 1.3 Main results of the paper: identification of sums of ridge functions from sample queries

For these two relevant reasons we proceed differently in this paper. The approach we are intending to follow is dictated by the following result, whose proof we describe in more detail in Section 2.

###### Theorem 1 (Reduction to m dimensions).

Let us consider a function

 f(x)=m∑i=1gi(ai⋅x),x∈Bd1={x∈Rd:∥x∥2≤1}, (3)

for and we denote .444With a certain abuse of notation, we often use in this paper the symbol also to denote the matrix whose columns are the vectors . Let us now fix a -dimensional subspace , whose we choose an orthonormal basis , so that . We arrange the vectors ’s as columns of a matrix , and we denote with and the orthogonal projections onto and respectively. Then one can construct a function

 ~f(y)=m∑i=1~gi(αi⋅y),y∈Bm1⊂Rm, (4)

with , such that for any other function the following estimate holds

 ∥f−^f(BT⋅)∥∞≤∥f∥Lip∥PA−P~A∥F+∥~f−^f∥∞. (5)

Moreover, for any other set of vectors ,

 ∥ai−B^αi∥2≤∥PA−P~A∥F+∥αi−^αi∥2. (6)

In view of (4), (5), and (6), the approximation of a sum of ridge functions (3) on and the identification of its ridge directions can be reduced to the approximation of a sum of ridge functions (4) on and the identification of its ridge directions, as soon as one can approximate well the subspace by means of any other subspace . Hence, we need to focus on two relevant tasks. The first one is the approximation of the subspace and the second is the identification of ridge directions of a sum of ridge functions defined on . For both these tasks, we intend to follow the approach proposed in [11, 3] and we consider approximate higher order differentiation to “extract” from the function and “test” its principal directions ’s against some given vectors ’s:

 Dα1c1…Dαkckf(x)=m∑i=1g(α1+⋯+αk)i(ai⋅x)(ai⋅c1)α1…(ai⋅ck)αk, (7)

where , , for all and is the -th derivative in the direction . Since we will employ a finite difference approximation of (7) and we also intend to keep numerical stability in mind, we restrict our attention to first and second order derivatives of only, i.e. to and Hence, we need to assume certain regularity of the functions . Namely, we suppose that they are three times continuously differentiable and introduce the quantities

 Cj :=maxi=1,…,mmax−1≤t≤1|g(j)i(t)|<∞,j=0,1,2,3. (8)

We propose two different algorithms to construct a subspace which approximates with high accuracy. The first algorithm (Algorithm 1) requires that the matrix

 J[f]:=∫Sd−1∇f(x)∇f(x)TdμSd−1(x)

has full rank , being the uniform measure on the sphere . It computes an approximating subspace by using point evaluations of with high probability, increasing exponentially to one with the number of samples on the sphere. This algorithm computes an approximation space with arbitrary accuracy with high-probability, by using only first order derivatives, and a number of points which is actually linear with respect to the dimension .

The second algorithm (Algorithm 2) is deterministic and requires that the matrix

 H[f](x)=(∂2f(0)∂ej∂ek)dj,k=1=m∑i=1g′′i(ai⋅x)ai⋅aTi,

has rank at . It computes an approximating subspace with arbitrary accuracy by using point evaluations of . This algorithm computes an accurate approximation space deterministically, by using second order derivatives, and a number of points which is quadratic with respect to the dimension .

In case the ridge profiles exhibit some compressibility, i.e., they can be approximated by sparse vectors, then one can further reduce the number of sampling points by using techniques from compressed sensing. We do not explore this scenario in detail, as it would be a minor modification of the approach given in [11]. As soon as an approximating space of has been identified, we can then apply Theorem 1 and reduce the dimensionality of the problem to .

It remains to find a way of approximating the ridge directions of a sum of ridge functions in . We approach this latter issue in Section 3.1 (cf. Algorithm 3) by first finding an approximating matrix space to . The result (Theorem 8) resembles very much the one for the identification of provided by Algorithm 1 and it reads as follows: if the matrix

 H2[f]:=∫Sm−1H[f](x)⊗vH[f](x)dμSm−1(x)

has full rank (the symbol represents a suitable tensor product of the vectorization of the matrices , see (30) for its precise definition), then Algorithm 3 based on second differentiation of provides an approximating matrix space to with arbitrary accuracy by using point evaluations of with high probability, increasing exponentially to one with the number of samples on the sphere.

If this approximation is fine enough, then we can assume to be spanned by near rank matrices of unit Frobenius norm as well, and those to be good approximations to . For identifying such a basis for we need to find in it elements of minimal rank. Let us stress that this problem is strongly related to similar and very relevant ones appearing recently in the literature addressing nonconvex programs to identify sparse vectors and low-rank matrices in linear subspaces, see, e.g., in [25, 29]. We perform such a search by solving a nonlinear program, maximizing the spectral norm among competitors in of the Frobenius norm bounded by one, i.e.,

 arg max ∥M∥∞,s.t.M∈˜A, ∥M∥F≤1. (9)

We analyze the optimization algorithm (9) under the assumptions that the ridge profiles are -nearly-orthogonal, i.e. that they are close to some orthonormal basis of as described in the following definition.

###### Definition 2.

Let be unit vectors. Then we define

 S(a1,…,am)=inf{(m∑i=1∥ai−wi∥22)1/2:w1,…,wmorthonormal basis in Rm}. (10)

We say that are -nearly-orthogonal, if , for relatively small.

In Section 3.2 we characterize the solutions to the problem (9) by analyzing its first and second order optimality conditions, and we show that the local maximizers are actually close to as soon as is a good approximation to . In Section 3.4 we present an algorithm (Algorithm 5), which is easy to implement and which strives for the solution of this optimization problem, by a sort of iteratively projected gradient ascent, and we prove some of its convergence properties.

Once we have identified the approximations of by Algorithm 4 or Algorithm 5, the final step addressed in Section 3.3 is then to approximate the functions by . The approximation of is then given by Algorithm 6 as follows

 ^f(x)=m∑i=1^gi(^ai⋅x),x∈Bm1.

At this point, it is worth to summarize all the construction through the different algorithms in a single higher level result. We use the notations introduced so far.

###### Theorem 3.

Let be a real-valued function defined on the neighborhood of , which takes the form

 f(x)=m∑i=1gi(ai⋅x),

for . Let be three times continuously differentiable on a neighborhood of for all , and let be -nearly-orthogonal, for small enough. We additionally assume both and of maximal rank . Then using at most random point evaluations of , Algorithms 1-5 construct approximations of the ridge directions up to a sign change for which

 (m∑i=1∥^ai−ai∥22)1/2≲ε, (11)

with probability at least , for a suitable constant intervening (together with some fixed power of ) in the asymptotical constant of the approximation (11). Moreover, Algorithm 6 constructs an approximating function of the form

 ^f(x)=m∑i=1^gi(^ai⋅x),

such that

 ∥f−^f∥L∞(Bd1)≲ε. (12)

Let us mention that this constructive result would not hold if , measuring the deviation of from orthonormality, gets too large. In fact, in case deviates significantly from being an orthonormal system, it is unclear whether they can be again uniquely identified by any algorithm (see Example 1 below). Moreover, in absence of noise on the point evaluations of as in Theorem 3, the usage of more point evaluations does not improve the accuracy in (11) and (12). The result would need to be significantly modified in case of noise on the point evaluations of in order to deal with stability issues determined by employing finite differences in order to approximate the gradient and the Hessian of .

The identifiability of the parameters of (deep) neural networks has been addressed recently [19, 31]. Also the connection between fitting of neural networks and the search for the decomposition of tensors in the sense of [12] was observed in the recent literature. Using decompositions of tensors of the third order, [18] proposed and analyzed the algorithm NN-LIFT, which learns a two-layer feed-forward neural network, where the second layer has a linear activation function. Our use of first and second order of differentiation (cf. Algorithms 1-3) of functions of the form (2) yields a natural connection between neural network weight identification and decompositions of second order tensors (i.e. non-orthogonal decompositions of matrices). Interestingly, [24] shows that learning the weights of a simple neural network (which essentially coincides with (2)) is as hard as the problem of decomposition of a tensor built up from these weights. As it is known that many problems involving tensors are NP-hard [13, 14], this shows that also the exact recovery of parameters of a neural network is a difficult task, at least without some additional assumptions on its structure.

Let us conclude with a glimpse on future developments. In the case of the shallowest feed-forward neural network (2), second order differentiation and tensors of order two (i.e., matrices) suffice as we show in this paper. Since our results clarify constructively how many training samples one needs in order to train a shallow feed-forward neural network, they might be extended also to shed some light on estimating the minimal number of data needed to train a deeper neural network when the learning is performed by means of layer-by-layer procedures [2, 15].

The notation used throughout the paper is rather standard. For , we denote by the -(quasi)-norm of a vector . This notation is complemented by setting If is an matrix, we denote by its Frobenius norm and by its spectral norm. The inner product of two vectors is denoted by Their tensor product is a rank-1 matrix denoted by More specific notation is introduced along the way, when needed.

## 2 Dimensionality reduction

The aim of this paper is the uniform approximation of sums of ridge functions

 f(x)=m∑i=1gi(ai⋅x),x∈Bd1. (13)

We assume throughout that the vectors are linearly independent and, therefore, Nevertheless, the typical setting we have in mind is that the number of variables is very large and the number of summands in (13) is significantly smaller than , i.e.

The main aim of this section is the proof of Theorem 1, which allows to reduce the general case to Due to the typical range of parameters we have in mind, this step is crucial in reducing the complexity of the approximation of (13).

### 2.1 Reduction to dimension d=m

Proof of Theorem 1. Let us assume that the unknown function takes the form of a sum of ridge functions (13) with unknown univariate functions and unknown ridge profiles We denote .

We assume (and we shall discuss this point later in this section) that we were able to find a subspace , which approximates . We select an (arbitrary) orthonormal basis of , and consider the matrix with columns Finally, we set with .

We observe that the function

 ~f(y) :=f(By)=m∑i=1gi(ai⋅By)=m∑i=1gi(αi⋅y),y∈Bm1,

is a sum of ridge functions on . Furthermore, sampling of can be easily transferred to sampling of by Let us assume that is a uniform approximation of on . Then the function is a uniform approximation of on . Indeed, let . We have

 |f(x)−^f(BTx)| ≤|f(x)−~f(BTx)|+|~f(BTx)−^f(BTx)| ≤|f(x)−f(BBTx)|+∥~f−^f∥∞=|f(PAx)−f(P~Ax)|+∥~f−^f∥∞ ≤∥f∥Lip⋅∥PAx−P~Ax∥2+∥~f−^f∥∞.

If we take the supremum over , we get

 ∥f−^f(BT⋅)∥∞≤∥f∥Lip⋅∥PA−P~A∥∞+∥~f−^f∥∞.

A crucial step in the construction of the uniform approximation of on will be the identification of the ridge profiles Naturally, we will not be able to recover them exactly and we will only obtain some good approximation . Then the vectors approximate well the original ridge profiles as can be observed by using and

 ∥ai−B^αi∥2 ≤ ∥ai−Bαi∥2+∥B(αi−^αi)∥2 = ∥(PA−P~A)ai∥2+∥B(αi−^αi)∥2 ≤ ∥PA−P~A∥∞+∥αi−^αi∥2,

which finishes the proof.

###### Remark 1.

Let be -nearly orthogonal and let be such that

 S(a1,…,am)=(m∑j=1∥aj−wj∥22)1/2=ε.

By Theorem 20 (and its proof) we can assume that Then

 S(α1,…,αm) =S(BTa1,…,BTam)=S(BBTa1,…,BBTam) =S(P~Aa1,…,P~Aam)≤(m∑j=1∥P~Aaj−wj∥22)1/2 ≤(m∑j=1∥P~Aaj−P~Awj∥22)1/2+(m∑j=1∥P~Awj−PAwj∥22)1/2 ≤ε+∥P~A−PA∥F.

Hence, if the vectors are orthogonal, or nearly-orthogonal in the sense of Definition 2, the vectors behave similarly.

### 2.2 Approximation of the span of ridge profiles

As previously shown, as soon as we can produce a subspace approximating , we can eventually reduce the problem of approximating a sum of ridge functions in to the same problem in , preserving even the quasi-orthogonality, cf. Remark 1. In this section we describe two different methods of identification of . The first one is inspired by the results in [11] and makes use of first order differences. The second one works with second order differences.

#### 2.2.1 Identifying A using first order derivatives

We observe that the vector

 ∇f(x)=m∑i=1g′i(ai⋅x)ai (14)

lies in for every . We consider (14) for different , where . In a generic situation for the points ’s, is likely given as the span of .

As we would like to use only function values of in our algorithms, we use for every and every the Taylor’s expansion

 ∂∂ejf(xk)=f(xk+ϵej)−f(xk)ϵ−[∂∂ejf(xk+ηj,kej)−∂∂ejf(xk)] (15)

for some . We recast the instances of (15) into the matrix notation

 X=Y−E, (16)

where

 Xj,k =∂∂ejf(xk),Yj,k=f(xk+ϵej)−f(xk)ϵ, (17) and Ej,k =∂∂ejf(xk+ηj,kej)−∂∂ejf(xk)

for and . It follows from (14), that is the linear span of columns of . Naturally, we define using the linear span of the singular vectors of corresponding to its

largest singular values. This is formalized in the following algorithm.

###### Algorithm 1.
For a natural number , choose uniformly at random. Construct according to (17).

Compute the singular value decomposition of

(18)
where contains the largest singular values.
Set to be the row space of .

The aim of the rest of this section is to show, that constructed in Algorithm 1 is in some sense close to . To be more specific, we need to bound , i.e. the operator or the Frobenius norm of the difference between the orthogonal projections onto and , respectively. For this first approximation method we need to introduce a new quantity

 J[f]:=∫Sd−1∇f(x)∇f(x)TdμSd−1(x). (19)
###### Lemma 4.

Assume the vectors linearly independent, and for all . Additionally assume

 C1 :=maxi=1,…,mmax−1≤t≤1|g′i(t)|<∞. (20)

Suppose that , i.e., the singular value of the matrix is bounded away from zero. Then for any we have that

 σm(X)≥√mXα(1−s) (21)

with probability at least , where is constructed as in (16) for drawn uniformly at random.

###### Proof.

The result will follow by a suitable application of Theorem 24 in the Appendix. Let be an orthonormal basis of . We denote . We identify with the corresponding matrix, i.e. the matrix with rows We observe that ,

 XXT=mX∑l=1∇f(xl)∇f(xl)T

and

 PAXXT(PA)T=mX∑l=1PA∇f(xl)∇f(xl)T(PA)T.

Furthermore, we obtain for every

 σ1(PA∇f(x)∇f(x)T(PA)T) =σ1(∇f(x)∇f(x)T)=∥∇f(x)∇f(x)T∥F =∥∇f(x)∥22=d∑l=1(m∑i=1g′i(ai⋅x)ai,l)2 (22) ≤C21(m∑i=1d∑l=1|ai,l|2)2=C21m2.

Hence is a random positive-semidefinite matrix, that is almost surely bounded. Moreover,

 EXj=PA∫Sd−1∇f(x)∇f(x)TdμSd−1(x)(PA)T=PAJ[f](PA)T.

We conclude that , and by Theorem 24 in the Appendix

 σm(X)=√σm(PAXXT(PA)T)≥√μmin(1−s)≥√mXα(1−s)

with probability at least

 1−mexp(−μmins22C21m2)≥1−mexp(−mXαs22C21m2).

###### Remark 2.

If we further assume that are -nearly-orthonormal and

are orthonormal vectors with

 S(a1,…,am)=(m∑i=1∥ai−wi∥22)1/2≤ε,

we can improve (22) to

 σ1(PA∇f(x)∇f(x)T(PA)T) ≤∥∇f(x)∥22=∥∥m∑i=1g′i(ai⋅x)ai∥∥22 ≤(∥∥m∑i=1g′i(ai⋅x)wi∥∥2+∥∥m∑i=1g′i(ai⋅x)(ai−wi)∥∥2)2 ≤[(m∑i=1|g′i(ai⋅x)|2)1/2+m∑i=1|g′i(ai⋅x)|⋅∥ai−wi∥2]2 ≤(1+ε)2m∑i=1|g′i(ai⋅x)|2≤C21(1+ε)2m.

The rest of the proof then follows in the same manner, only the probability changes to

 1−mexp(−mXαs22C21(1+ε)2m).

The same remark applies also to Theorem 8.

Following theorem quantifies the distance between the subspace constructed in Algorithm 1 and .

###### Theorem 5.

Assume the vectors linearly independent, and for all . Additionally assume that

 C1 :=maxi=1,…,mmax−1≤t≤1|g′i(t)|<∞ (23)

and that the Lipschitz constants of all , are bounded by

Let be constructed as described in Algorithm 1 by sampling values of . Let , and assume Then

 ∥PA−P~A∥F≤2C2ϵm√α(1−s)−C2ϵm

with probability at least .

###### Proof.

We intend to apply the so-called Wedin’s bound, as recalled in Theorem 23 in the Appendix, to estimate the distance between and . If we choose and , we get and we observe that (88) and (89) are satisfied with . Therefore, Theorem 23 implies

 ∥PA−P~A∥F =∥V1VT1−~V1~VT1∥F≤2∥X−Y∥Fσm(YT)