Minimax Lower Bounds for Ridge Combinations Including Neural Nets

02/09/2017 ∙ by Jason M. Klusowski, et al. ∙ Yale University 0

Estimation of functions of d variables is considered using ridge combinations of the form ∑_k=1^m c_1,kϕ(∑_j=1^d c_0,j,kx_j-b_k) where the activation function ϕ is a function with bounded value and derivative. These include single-hidden layer neural networks, polynomials, and sinusoidal models. From a sample of size n of possibly noisy values at random sites X ∈ B = [-1,1]^d , the minimax mean square error is examined for functions in the closure of the ℓ_1 hull of ridge functions with activation ϕ . It is shown to be of order d/n to a fractional power (when d is of smaller order than n ), and to be of order ( d)/n to a fractional power (when d is of larger order than n ). Dependence on constraints v_0 and v_1 on the ℓ_1 norms of inner parameter c_0 and outer parameter c_1 , respectively, is also examined. Also, lower and upper bounds on the fractional power are given. The heart of the analysis is development of information-theoretic packing numbers for these classes of functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ridge combinations provide flexible classes for fitting functions of many variables. The ridge activation function may be a general Lipschitz function. When the ridge activation function is a sigmoid, these are single-hidden layer artificial neural nets. When the activation is a sine or cosine function, it is a sinusoidal model in a ridge combination form. We consider also a class of polynomial nets which are combinations of Hermite ridge functions. Ridge combinations are also the functions used in projection pursuit regression fitting. What distinguishes these models from other classical functional forms is the presence of parameters internal to the ridge functions which are free to be adjusted in the fit. In essence, it is a parameterized, infinite dictionary of functions from which we make linear combinations. This provides a flexibility of function modeling not present in the case of a fixed dictionary. Here we discuss results on risk properties of estimation of functions using these models and we develop new minimax lower bounds.

For a given activation function on , consider the parameterized family of functions

(1)

where

is the vector of outer layer parameters and

are the vectors of inner parameters for the single hidden-layer of functions with horizontal shifts , . For positive , let

(2)

be the dictionary of all such inner layer ridge functions with parameter restricted to the ball of size and variables restricted to the cube . The choice of the norm on the inner parameters is natural as it corresponds to for .

Let be the closure of the set of all linear combinations of functions in with norm of outer coefficients not more than . These and control the freedom in the size of this function class. They can either be fixed for minimax evaluations, or adapted in the estimation (as reflected in some of the upper bounds on risk for penalized least square estimation). The functions of the form eq:ridge are in when and . Indeed, let be the subset of such functions in that use terms.

Data are of the form

, drawn independently from a joint distribution

with on . The target function is , the mean of the conditional distribution , optimal in mean square for the prediction of future from corresponding input . In some cases, assumptions are made on the error of the target function (i.e. bounded, Gaussian, or sub-Gaussian).

From the data, estimators are formed and the loss at a target is the square error and the risk is the expected squared error . For any class of functions on , the minimax risk is

(3)

where the infimum runs over all estimators of based on the data .

It is known that for certain complexity penalized least squares estimators [4], [6], [17], [1] the risk satisfies

(4)

where the constant depends on parameters of the noise distribution and on properties of the activation function , which can be a step function or a fixed bounded Lipschitz function. The in the second term is from the log-cardinality of customary -dimensional covers of the dictionary. The right side is an index of resolvability expressing the tradeoff between approximation error and descriptive complexity relative to sample size, in accordance with risk bounds for minimum description length criteria [7], [8], [9], [5]. When the target is in , it is known as in [19], [3], [13] that with slight improvements possible depending on the dimension as in [22], [20], [26]. When is not in , let be its projection onto this convex set of functions. Then the additional error beyond is controlled by the bound

(5)

Moreover, with restricted to , this bounds the mean squared error from the projection. The same risk is available from penalized least square estimation [17], [8], [9], [20] and from greedy implementations of complexity and penalized estimation [17], [20]. The slight approximation improvements (albeit not known whether available by greedy algorithms) provide the risk bound [20]

(6)

for bounded Lipschitz activation functions , improving a similar result in [14], [26]. This fact can be shown through improved upper bounds on the metric entropy from [23].

A couple of lower bounds on the minimax risk in are known [26] and, improving on [26], the working paper [20] states the lower bound

(7)

for an unconstrained .

Note that for large , these exponents are near . Indeed, if is large compared to , then the bounds in eq:upperlowdim and eq:lowerhighdim0 are of the same order as with exponent . It is desirable to have improved lower bounds which take the form to a fractional power as long as is of smaller order than .

Favorable performance of flexible neural network (and neural network like) models has often been observed as in [21] in situations in which is of much larger order than . Current developments [20] are obtaining upper bounds on risk of the form

(8)

for fixed positive , again for bounded Lipschitz . These allow much larger than , as long as . We have considered two cases. First with greedy implementations of least squares with complexity or penalty, such upper bounds are obtained in [20] with in the noise free case and in the sub-Gaussian noise case (which includes the Gaussian noise case). The rate with is also possible in the sub-Gaussian noise setting (as well as the noise free setting) via a least squares estimator over a discretization of the parameter space.

It is desirable likewise to have lower bounds on the minimax risk for this setting that show that is depends primarily on to some power (within factors). It is the purpose of this paper to obtain such lower bounds. Here with . Thereby, this paper on lower bounds is to provide a companion to (refinement of) the working paper on upper bounds [20]. Lower bounding minimax risk in non-parametric regression is primarily an information-theoretic problem. This was first observed by [18] and then [11], [12] who adapted Fano’s inequality in this setting. Furthermore, [26] showed conditions such that the minimax risk is characterized (to within a constant factor) by solving for the approximation error that matches the metric entropy relative to the sample size , where is the size of the largest -packing set. Accordingly, the core of our analysis is providing packing sets for for specific choices of .

2 Results for sinusoidal nets

We now state our main result. In this section, it is for the sinusoidal activation function . We consider two regimes: when is larger than and visa-versa. In each case, this entails putting a non-restrictive technical condition on either quantity. For larger than , this condition is

(9)

and when is larger than ,

(10)

for some positive constants . Note that when is large compared to , condition eq:cond2 holds. Indeed, the left side is at least and the right side is , which is near . Likewise, eq:cond1 holds when is large compared to .

Theorem 1.

Consider the model for , where and . If is large enough so that eq:cond1 is satisfied, then

(11)

for some universal constant . Furthermore, if is large enough so that eq:cond2 is satisfied, then

(12)

for some universal constant .

Before we prove thm:lower, we first state a lemma which is contained in the proof of Theorem 1 (pp. 46-47) in [15].

Lemma 1.

For integers with and , define the set

There exists a subset with cardinality at least such that the Hamming distance between any pairs of is at least .

Note that the elements of the set in lmm:subsets can be interpreted as binary codes of length , constant Hamming weight , and minimum Hamming distance . These are called constant weight codes and the cardinality of the largest such codebook, denoted by , is also given a combinatorial lower bound in [16]. The conclusion of lmm:subsets is .

Proof of thm:lower.

For simplicity, we henceforth write instead of . Define the collection . Without loss of generality, assume that is an integer so that . Consider sinusoidal ridge functions with in . Note that these functions (for

) are orthonormal with respect to the uniform probability measure

on . This fact is easily established using an instance of Euler’s formula .

For an enumeration of , define a subclass of by

where is the set in lmm:subsets. Any distinct pairs in have squared distance at least . A separation of determines . Depending on the size of relative to , there are two different behaviors of . For , we use and for , .

By lmm:subsets, a lower bound on the cardinality of is with logarithm lower bounded by . To obtain a cleaner form that highlights the dependence on , we assume that , giving . Since is proportional to , this condition puts a lower bound on of order . If , it follows that a lower bound on the logarithm of the packing number is of order . If , a lower bound on the logarithm of the packing number is of order . Thus we have found an -packing set of these cardinalities. As such, they are lower bounds on the metric entropy of .

Next we use the information-theoretic lower bound techniques in [26] or [25]. Let , where is the uniform density on and is the density. Then

where the estimators are now restricted to . The supremum is at least the uniformly weighted average over . Thus a lower bound on the minimax risk is a constant times provided the minimax probability is bounded away from zero, as it is for sufficient size packing sets. Indeed, by Fano’s inequality as in [26], this minimax probability is at least

for in , or by an inequality of Pinsker, as in Theorem 2.5 in [25], it is at least

for some in . These inequalities hold provided we have the following

bounding the mutual information between and the data , where is any fixed joint density for . When suitable metric entropy upper bounds on the log-cardinality of covers of are available, one may use as a uniform mixture of for in as in [26], as long as and are arranged to be of the same order. In the special case that has small radius already of order , one has the simplicity of taking to be the singleton set consisting of . In the present case, since each element in has squared norm and pairs of elements in have squared separation , these function are near and hence we choose . A standard calculation yields

We choose such that this . Thus, in accordance with [26], if and are available lower bounds on , to within a constant factor, a minimax lower bound on the squared error risk is determined by matching

and

Solving in either case, we find that

and

These quantities are valid lower bounds on to within constant factors, provided and are valid lower bounds on the -packing number of . Checking that and yields conditions eq:cond1 and eq:cond2, respectively. ∎

3 Implications for neural nets

The variation of a function with respect to a dictionary [2], also called the atomic norm of with respect to , denoted , is defined as the infimum of all such that is in . Here the closure in the definition of is taken in .

Define . On the interval , it can be shown that has variation with resepct to the dictionary of unit step activation functions , where , or equivalently, variation with respect to the dictionary of signum activation functions with shifts , where . This can be seen directly from the identity

for . Evaluation of gives the exact value of with respect to sgn as for integer . Accordingly, is contained in .

Likewise, for the clipped linear function a similar identity holds:

for . The above form arises from integrating

from to . And likewise, evaluation of gives the exact variation of with respect to the dictionary of clip activation functions as for integer . Accordingly, is contained in and hence we have the following corollary.

Corollary 1.

Using the same setup and conditions eq:cond1 and eq:cond2 as in thm:lower, the minimax risk for the sigmoid classes and have the same lower bounds eq:lowerhighdim and eq:lowerlowdim as for .

4 Implications for polynomial nets

It is also possible to give minimax lower bounds for the function classes with activation function equal to the standardized Hermite polynomial , where . As with thm:lower, this requires a lower bound on :

(13)

for some constant . Moreover, we also need a growth condition on the order of the polynomial :

(14)

for some constant . In light of eq:cond3, condition eq:cond4 is also satisfied if is at least a constant multiple of .

Theorem 2.

Consider the model for , where and . If and are large enough so that conditions eq:cond3 and eq:cond4 are satisfied, respectively, then

(15)

for some universal constant .

Proof of thm:lower_poly.

By lmm:subsets, if and , there exists a subset of with cardinality at least such that each element has Hamming weight and pairs of elements have minumum Hamming distance . Thus, if and belong to this codebook, . Choose (assuming that is an integer less than ), and form the collection . Note that each member of has unit norm and norm . Moreover, the Euclidean inner product between each pair has magnitude bounded by . Next, we use the fact that if and have unit norm, then . For an enumeration of , define a subclass of by

where is the set from lmm:subsets. Moreover, since each has unit norm, , and ,

provided . A separation of determines . If , or equivalently, , then is at least a constant multiple of . As before in thm:lower, a minimax lower bound on the squared error risk is determined by matching

which yields

If conditions eq:cond3 and eq:cond4 are satisfied, is a valid lower bound on the -packing number of . ∎

Remark. It is possible to obtain similar lower bounds with replaced by a clipped version, in which it is extended at constant height for , where and . Then corollary conclusions follow also for sigmoid classes using the variation of on . Thereby, we obtain lower bounds for sigmoid nets for Gaussian design as well as for the uniform design of cor:lowersigmoid.

5 Discussion

Our risk lower bound of the form shows that in the very high-dimensional case, it is the to a half-power that controls the rate (to within a logarithmic factor). The and , as norms of the inner and outer coefficient vectors, have the interpretations as the effective dimensions of these vectors. Indeed, a vector in with bounded coefficients that has non-negligible coordinates has norm of thin order. These rates confirm that it is the power of these effective dimensions over sample size (instead of the full ambient dimension ) that controls the main behavior of the statistical risk. Our lower bounds on packing numbers complement the upper bound covering numbers in [10] and [20].

Our rates are akin to those obtained by the authors in [24]

for high-dimensional linear regression. However, there is an important difference. The richness of

is largely determined by the sizes of and and more flexibly represents a larger class of functions. It would be interesting to see if the power in the lower minimax rates in eq:lowerhighdim could be further improved to match or get near eq:upperhighdim.

References

  • [1] Andrew Barron, Lucien Birgé, and Pascal Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields, 113(3):301–413, 1999.
  • [2] Andrew R. Barron. Neural net approximation. Yale Workshop on Adaptive and Learning Systems, Yale University Press, 1992.
  • [3] Andrew R. Barron.

    Universal approximation bounds for superpositions of a sigmoidal function.

    IEEE Trans. Inform. Theory, 39(3):930–945, 1993.
  • [4] Andrew R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115–133, 1994.
  • [5] Andrew R. Barron and Sabyasachi Chatterjee. Information theoretic validity of penalized likelihood. Proceedings IEEE International Symposium on Information Theory, Honolulu, HI, pages 3027–3031, June 2014.
  • [6] Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. DeVore. Approximation and learning by greedy algorithms. Ann. Statist., 36(1):64–94, 2008.
  • [7] Andrew R. Barron and Thomas M. Cover. Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37(4):1034–1054, 1991.
  • [8] Andrew R. Barron, Cong Huang, Jonathan Li, and Xi Luo. The mdl principle, penalized likelihoods, and statistical risk. Workshop on Information Theory Methods in Science and Engineering, Tampere, Finland, 2008.
  • [9] Andrew R. Barron, Cong Huang, Jonathan Li, and Xi Luo. The mdl principle, penalized likelihoods, and statistical risk. Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday, Tampere University Press, Tampere, Finland. Editor Ioan Tabus, 2008.
  • [10] Peter L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44(2):525–536, 1998.
  • [11] Lucien Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete, 65(2):181–237, 1983.
  • [12] Lucien Birgé. On estimating a density using Hellinger distance and some other strange facts. Probab. Theory Relat. Fields, 71(2):271–291, 1986.
  • [13] Leo Breiman.

    Hinging hyperplanes for regression, classification, and function approximation.

    IEEE Trans. Inform. Theory, 39(3):999–1013, 1993.
  • [14] Xiaohong Chen and Halbert White. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Trans. Inform. Theory, 45(2):682–691, 1999.
  • [15] Fuchang Gao, Ching-Kang Ing, and Yuhong Yang. Metric entropy and sparse linear approximation of -hulls for . J. Approx. Theory, 166:42–55, 2013.
  • [16] R. L. Graham and N. J. A. Sloane. Lower bounds for constant weight codes. IEEE Trans. Inform. Theory, 26(1):37–43, 1980.
  • [17] Cong Huang, G. LH Cheang, and Andrew R. Barron. Risk of penalized least squares, greedy selection and penalization for flexible function libraries. Yale University, Department of Statistics technical report, 2008.
  • [18] I. A. Ibragimov and R. Z. Hasminksii. Nonparametric regression estimation. Dokl. Akad. Nauk SSSR, 252(4):780–784, 1980.
  • [19] Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist., 20(1):608–613, 1992.
  • [20] Jason M. Klusowski and Andrew R. Barron. Risk bounds for high-dimensional ridge function combinations including neural networks. arXiv Preprint, 2016.
  • [21] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [22] Y. Makovoz. Random approximants and neural networks. J. Approx. Theory, 85(1):98–109, 1996.
  • [23] Shahar Mendelson. On the size of convex hulls of small sets. J. Mach. Learn. Res., 2(1):1–18, 2002.
  • [24] Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE Trans. Inform. Theory, 57(10):6976–6994, 2011.
  • [25] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.
  • [26] Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of convergence. Ann. Statist., 27(5):1564–1599, 1999.