Consider a sequence of i.i.d.111It is not hard to extend all what we show in this paper for stationary mixing variables and so for time series
random vectors (i.e. identically distributed and independents). So, each couplehas the same law that a generic variable .
1.1 The model
Assume that the model can be written
is a function represented by a one hidden layer MLP with parameters or weights
and sigmoidal functions in the hidden unit.
The noise, , is sequence of i.i.d. centered variables with unknown invertible covariance matrix . Write the generic variable with the same law that each .
Notes that a finite number of transformations of the weights leave the MLP functions invariant, these permutations form a finite group (see [Sussman (1992)]). To overcome this problem, we will consider equivalence classes of MLP : two MLP are in the same class if the first one is the image by such transformation of the second one, the considered set of parameter is then the quotient space of parameters by the finite group of transformations.
In this space, we assume that the model is identifiable, this can be done if we consider only MLP with the true number of hidden units (see [Sussman (1992)]
). Note that, if the number of hidden units is over-estimated, then such test can have very bad behavior (see[Fukumizu (2003)]). We agree that the assumption of identifiability is very restrictive, but we want emphasize the fact that, even in this framework, classical test of the number of parameters in the case of multidimensional output MLP is not satisfactory and we propose to improve it.
1.2 testing the number of parameters
Let be an integer lesser than , we want to test “” against “”, where the sets and are compact. express the fact that belongs to a subset of with a parametric dimension lesser than or, equivalently, that weights of the MLP in are null. If we consider the classic cost function : where denotes the Euclidean norm of , we get the following statistic of test :
It is shown in [Yao (2000)], that converges in law to a ponderated sum of
where the are i.i.d. variables and are strictly positives values, different of 1 if the true covariance matrix of the noise is not the identity. So, in the general case, where the true covariance matrix of the noise is not the identity, the asymptotic distribution is not known, because the are not known and it is difficult to compute the asymptotic level of the test.
To overcome this difficulty we propose to use instead the cost function
we will show that, under suitable assumptions, the statistic of test :
will converge to a classical so the asymptotic level of the test will be very easy to compute. The sequel of this paper is devoted to the proof of this property.
2 Asymptotic properties of
In order to investigate the asymptotic properties of the test we have to prove the consistency and the asymptotic normality of . Assume, in the sequel, that
has a moment of order at least 2 and note
remark that these matrix and it inverse are symmetric. in the same way, we note , which is well defined because of the moment condition on
2.1 Consistency of
First we have to identify contrast function associated to
with and if and only if .
By the strong law of large number we have
denotes the identity matrix of. So, the lemme is true if is a positive matrix, null only if . But this property is true since
We deduce then the theorem of consistency :
Remark that it exist a constant such that
because is compact, so is bounded. For a matrix , let be a norm, for example . We have
and since the function :
is uniformly continuous, by the same argument that example 19.8 of
[Van der Vaart (1998)] the set of functions is Glivenko-Cantelli.
2.2 Asymptotic normality
For this purpose we have to compute the first and the second derivative with respect to the parameters of . First, we introduce a notation : if is a -dimensional parametric function depending of a parameter , write (resp. ) for the -dimensional vector of partial derivative (resp. second order partial derivatives) of each component of .
First derivatives :
if is a matrix depending of the parameter vector , we get from [Magnus and Neudecker (1988)]
Hence, if we note
using the fact
Second derivatives :
We write now
Now, [Magnus and Neudecker (1988)], give an analytic form of the derivative of an inverse matrix, so we get
Asymptotic distribution of :
The previous equations allow us to give the asymptotic properties of the estimator minimizing the cost function , namely from equation (3) and (4) we can compute the asymptotic properties of the first and the second derivatives of . If the variable has a moment of order at least 3 then we get the following lemma :
Assume that and , let be the gradient vector of at and be the Hessian matrix of at .
We get then
where, the component of the matrix is :
We can show easily that, for all , we have :
Note that the component of the matrix is:
and, since the trace of the product is invariant by circular permutation,
Now, the derivative is square integrable, so fulfills Lindeberg’s condition (see [Hall and Heyde (1980)]) and
For the component of the expectation of the Hessian matrix, remark first that
Now, since and
, by standard arguments found, for example, in [Yao (2000)] we get
2.3 Asymptotic distribution of
In this section, we write and
, where is view as a subset of . The asymptotic distribution of is then a consequence of the previous section, namely, if we have to replace by its Taylor expansion around and , following [Van der Vaart (1998)] chapter 16 we have :
It has been show that, in the case of multidimensional output, the cost function leads to a test for the number of parameters in MLP simpler than with the traditional mean square cost function. In fact the estimator is also more efficient than the least square estimator (see [Rynkiewicz (2003)]). We can also remark that matches with twice the “concentrated Gaussian log-likelihood” but we have to emphasize, that its nice asymptotic properties need only moment condition on and , so it works even if the distribution of the noise is not Gaussian. An other solution could be to use an approximation of the covariance error matrix to compute generalized least square estimator :
assuming that is a good approximation of the true covariance matrix of the noise . However it take time to compute a good the matrix and if we try to compute the best matrix with the data, it leads to the cost function (see for example [Gallant (1987)]).
Finally, as we see in this paper, the computation of the derivatives of is easy, so we can use the effective differential optimization techniques to estimate and numerical examples can be found in [Rynkiewicz (2003)].
Likelihood ratio of unidentifiable models and multilayer neural networks.Annals of Statistics, 31:3:533–851, 2003.
- Gallant (1987) R.A. Gallant. Non linear statistical models. J. Wiley and Sons, New-York, 1987.
- Hall and Heyde (1980) P. Hall and C. Heyde. Martingale limit theory and its applications. Academic Press, New-York, 1980.
- Magnus and Neudecker (1988) Jan R. Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. J. Wiley and Sons, New-York, 1988.
- Rynkiewicz (2003) J. Rynkiewicz. Estimation of multidimensional regression model with multilayer perceptrons. In J. Mira and J.R. Alvarez, editors, Computational methods in neural modeling, volume 2686 of Lectures notes in computer science, pages 310–317, 2003.
- Sussman (1992) H.J. Sussman. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, pages 589–593, 1992.
- Van der Vaart (1998) A.W. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998.
- Yao (2000) J. Yao. On least square estimation for stable nonlinear ar processes. The Annals of Institut of Mathematical Statistics, 52:316–331, 2000.