Testing the number of parameters with multidimensional MLP

02/21/2008 ∙ by Joseph Rynkiewicz, et al. ∙ Université Paris 1 0

This work concerns testing the number of parameters in one hidden layer multilayer perceptron (MLP). For this purpose we assume that we have identifiable models, up to a finite group of transformations on the weights, this is for example the case when the number of hidden units is know. In this framework, we show that we get a simple asymptotic distribution, if we use the logarithm of the determinant of the empirical error covariance matrix as cost function.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a sequence of i.i.d.111It is not hard to extend all what we show in this paper for stationary mixing variables and so for time series

random vectors (i.e. identically distributed and independents). So, each couple

has the same law that a generic variable .

1.1 The model

Assume that the model can be written

where

  • is a function represented by a one hidden layer MLP with parameters or weights

    and sigmoidal functions in the hidden unit.

  • The noise, , is sequence of i.i.d. centered variables with unknown invertible covariance matrix . Write the generic variable with the same law that each .

Notes that a finite number of transformations of the weights leave the MLP functions invariant, these permutations form a finite group (see [Sussman (1992)]). To overcome this problem, we will consider equivalence classes of MLP : two MLP are in the same class if the first one is the image by such transformation of the second one, the considered set of parameter is then the quotient space of parameters by the finite group of transformations.

In this space, we assume that the model is identifiable, this can be done if we consider only MLP with the true number of hidden units (see [Sussman (1992)]

). Note that, if the number of hidden units is over-estimated, then such test can have very bad behavior (see

[Fukumizu (2003)]). We agree that the assumption of identifiability is very restrictive, but we want emphasize the fact that, even in this framework, classical test of the number of parameters in the case of multidimensional output MLP is not satisfactory and we propose to improve it.

1.2 testing the number of parameters

Let be an integer lesser than , we want to test “” against “”, where the sets and are compact. express the fact that belongs to a subset of with a parametric dimension lesser than or, equivalently, that weights of the MLP in are null. If we consider the classic cost function : where denotes the Euclidean norm of , we get the following statistic of test :

It is shown in [Yao (2000)], that converges in law to a ponderated sum of

where the are i.i.d. variables and are strictly positives values, different of 1 if the true covariance matrix of the noise is not the identity. So, in the general case, where the true covariance matrix of the noise is not the identity, the asymptotic distribution is not known, because the are not known and it is difficult to compute the asymptotic level of the test.

To overcome this difficulty we propose to use instead the cost function

(1)

we will show that, under suitable assumptions, the statistic of test :

(2)

will converge to a classical so the asymptotic level of the test will be very easy to compute. The sequel of this paper is devoted to the proof of this property.

2 Asymptotic properties of

In order to investigate the asymptotic properties of the test we have to prove the consistency and the asymptotic normality of . Assume, in the sequel, that

has a moment of order at least 2 and note

remark that these matrix and it inverse are symmetric. in the same way, we note , which is well defined because of the moment condition on

2.1 Consistency of

First we have to identify contrast function associated to

Lemma 1

with and if and only if .

Proof :

By the strong law of large number we have

where

denotes the identity matrix of

. So, the lemme is true if is a positive matrix, null only if . But this property is true since

We deduce then the theorem of consistency :

Theorem 1

If ,

Proof

Remark that it exist a constant such that

because is compact, so is bounded. For a matrix , let be a norm, for example . We have

and since the function :

is uniformly continuous, by the same argument that example 19.8 of
[Van der Vaart (1998)] the set of functions is Glivenko-Cantelli.

Finally, the theorem 5.7 of [Van der Vaart (1998)], show that

converge in probability to

.

2.2 Asymptotic normality

For this purpose we have to compute the first and the second derivative with respect to the parameters of . First, we introduce a notation : if is a -dimensional parametric function depending of a parameter , write (resp. ) for the -dimensional vector of partial derivative (resp. second order partial derivatives) of each component of .

First derivatives :

if is a matrix depending of the parameter vector , we get from [Magnus and Neudecker (1988)]

Hence, if we note

using the fact

we get

(3)

Second derivatives :

We write now

and

We get

Now, [Magnus and Neudecker (1988)], give an analytic form of the derivative of an inverse matrix, so we get

so

(4)

Asymptotic distribution of :

The previous equations allow us to give the asymptotic properties of the estimator minimizing the cost function , namely from equation (3) and (4) we can compute the asymptotic properties of the first and the second derivatives of . If the variable has a moment of order at least 3 then we get the following lemma :

Theorem 2

Assume that and , let be the gradient vector of at and be the Hessian matrix of at .

Write finally

We get then

where, the component of the matrix is :

proof :

We can show easily that, for all , we have :

Write

and .

Note that the component of the matrix is:

and, since the trace of the product is invariant by circular permutation,

Now, the derivative is square integrable, so fulfills Lindeberg’s condition (see [Hall and Heyde (1980)]) and

For the component of the expectation of the Hessian matrix, remark first that

and

so

Now, since and
, by standard arguments found, for example, in [Yao (2000)] we get

2.3 Asymptotic distribution of

In this section, we write and
, where is view as a subset of . The asymptotic distribution of is then a consequence of the previous section, namely, if we have to replace by its Taylor expansion around and , following [Van der Vaart (1998)] chapter 16 we have :

3 Conclusion

It has been show that, in the case of multidimensional output, the cost function leads to a test for the number of parameters in MLP simpler than with the traditional mean square cost function. In fact the estimator is also more efficient than the least square estimator (see [Rynkiewicz (2003)]). We can also remark that matches with twice the “concentrated Gaussian log-likelihood” but we have to emphasize, that its nice asymptotic properties need only moment condition on and , so it works even if the distribution of the noise is not Gaussian. An other solution could be to use an approximation of the covariance error matrix to compute generalized least square estimator :

assuming that is a good approximation of the true covariance matrix of the noise . However it take time to compute a good the matrix and if we try to compute the best matrix with the data, it leads to the cost function (see for example [Gallant (1987)]).

Finally, as we see in this paper, the computation of the derivatives of is easy, so we can use the effective differential optimization techniques to estimate and numerical examples can be found in [Rynkiewicz (2003)].

References

  • Fukumizu (2003) K. Fukumizu.

    Likelihood ratio of unidentifiable models and multilayer neural networks.

    Annals of Statistics, 31:3:533–851, 2003.
  • Gallant (1987) R.A. Gallant. Non linear statistical models. J. Wiley and Sons, New-York, 1987.
  • Hall and Heyde (1980) P. Hall and C. Heyde. Martingale limit theory and its applications. Academic Press, New-York, 1980.
  • Magnus and Neudecker (1988) Jan R. Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. J. Wiley and Sons, New-York, 1988.
  • Rynkiewicz (2003) J. Rynkiewicz. Estimation of multidimensional regression model with multilayer perceptrons. In J. Mira and J.R. Alvarez, editors, Computational methods in neural modeling, volume 2686 of Lectures notes in computer science, pages 310–317, 2003.
  • Sussman (1992) H.J. Sussman. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, pages 589–593, 1992.
  • Van der Vaart (1998) A.W. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998.
  • Yao (2000) J. Yao. On least square estimation for stable nonlinear ar processes. The Annals of Institut of Mathematical Statistics, 52:316–331, 2000.