 # Testing the number of parameters with multidimensional MLP

This work concerns testing the number of parameters in one hidden layer multilayer perceptron (MLP). For this purpose we assume that we have identifiable models, up to a finite group of transformations on the weights, this is for example the case when the number of hidden units is know. In this framework, we show that we get a simple asymptotic distribution, if we use the logarithm of the determinant of the empirical error covariance matrix as cost function.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a sequence of i.i.d.111It is not hard to extend all what we show in this paper for stationary mixing variables and so for time series

random vectors (i.e. identically distributed and independents). So, each couple

has the same law that a generic variable .

### 1.1 The model

Assume that the model can be written

 Yt=FW0(Zt)+εt

where

• is a function represented by a one hidden layer MLP with parameters or weights

and sigmoidal functions in the hidden unit.

• The noise, , is sequence of i.i.d. centered variables with unknown invertible covariance matrix . Write the generic variable with the same law that each .

Notes that a finite number of transformations of the weights leave the MLP functions invariant, these permutations form a finite group (see [Sussman (1992)]). To overcome this problem, we will consider equivalence classes of MLP : two MLP are in the same class if the first one is the image by such transformation of the second one, the considered set of parameter is then the quotient space of parameters by the finite group of transformations.

In this space, we assume that the model is identifiable, this can be done if we consider only MLP with the true number of hidden units (see [Sussman (1992)]

). Note that, if the number of hidden units is over-estimated, then such test can have very bad behavior (see

[Fukumizu (2003)]). We agree that the assumption of identifiability is very restrictive, but we want emphasize the fact that, even in this framework, classical test of the number of parameters in the case of multidimensional output MLP is not satisfactory and we propose to improve it.

### 1.2 testing the number of parameters

Let be an integer lesser than , we want to test “” against “”, where the sets and are compact. express the fact that belongs to a subset of with a parametric dimension lesser than or, equivalently, that weights of the MLP in are null. If we consider the classic cost function : where denotes the Euclidean norm of , we get the following statistic of test :

 Sn=n×(minW∈ΘqVn(W)−minW∈ΘsVn(W))

It is shown in [Yao (2000)], that converges in law to a ponderated sum of

 SnD→s−q∑i=1λiχ2i,1

where the are i.i.d. variables and are strictly positives values, different of 1 if the true covariance matrix of the noise is not the identity. So, in the general case, where the true covariance matrix of the noise is not the identity, the asymptotic distribution is not known, because the are not known and it is difficult to compute the asymptotic level of the test.

To overcome this difficulty we propose to use instead the cost function

 Un(W):=lndet(1nn∑t=1(Yt−FW(Zt))(Yt−FW(Zt))T). (1)

we will show that, under suitable assumptions, the statistic of test :

 Tn=n×(minW∈ΘqUn(W)−minW∈ΘsUn(W)) (2)

will converge to a classical so the asymptotic level of the test will be very easy to compute. The sequel of this paper is devoted to the proof of this property.

## 2 Asymptotic properties of Tn

In order to investigate the asymptotic properties of the test we have to prove the consistency and the asymptotic normality of . Assume, in the sequel, that

has a moment of order at least 2 and note

 Γn(W)=1nn∑t=1(Yt−FW(Zt))(Yt−FW(Zt))T

remark that these matrix and it inverse are symmetric. in the same way, we note , which is well defined because of the moment condition on

### 2.1 Consistency of ^Wn

First we have to identify contrast function associated to

###### Lemma 1
 Un(W)−Un(W0)a.s.→K(W,W0)

with and if and only if .

#### Proof :

By the strong law of large number we have

 Un(W)−Un(W0)a.s.→lndet(Γ(W))−lndet(Γ(W0))=lndet(Γ(W))det(Γ(W0))=lndet(Γ−1(W0)(Γ(W)−Γ(W0))+Id)

where

denotes the identity matrix of

. So, the lemme is true if is a positive matrix, null only if . But this property is true since

 Γ(W)=E((Y−FW(Z))(Y−FW(Z))T)=E((Y−FW0(Z)+FW0(Z)−FW(Z))(Y−FW0(Z)+FW0(Z)−FW(Z))T)=E((Y−FW0(Z))(Y−FW0(Z))T)+E((FW0(Z)−FW(Z))(FW0(Z)−FW(Z))T)=Γ(W0)+E((FW0(Z)−FW(Z))(FW0(Z)−FW(Z))T)■

We deduce then the theorem of consistency :

###### Theorem 1

If ,

 ^WnP→W0

#### Proof

Remark that it exist a constant such that

 supW∈Θs∥Y−FW(Z)|2<∥Y∥2+B

because is compact, so is bounded. For a matrix , let be a norm, for example . We have

 liminfW∈Θs∥Γn(W)∥=∥Γ(W0)∥>0limsupW∈Θs∥Γn(W)∥:=C<∞

and since the function :

 Γ↦lndetΓ, for C≥∥Γ∥≥∥Γ(W0)∥

is uniformly continuous, by the same argument that example 19.8 of
[Van der Vaart (1998)] the set of functions is Glivenko-Cantelli.

Finally, the theorem 5.7 of [Van der Vaart (1998)], show that

converge in probability to

.

### 2.2 Asymptotic normality

For this purpose we have to compute the first and the second derivative with respect to the parameters of . First, we introduce a notation : if is a -dimensional parametric function depending of a parameter , write (resp. ) for the -dimensional vector of partial derivative (resp. second order partial derivatives) of each component of .

#### First derivatives :

if is a matrix depending of the parameter vector , we get from [Magnus and Neudecker (1988)]

 ∂∂Wklndet(Γn(W))=tr(Γ−1n(W)∂∂WkΓn(W))

Hence, if we note

 An(Wk)=1nn∑t=1(−∂FW(zt)∂Wk(yt−FW(zt))T)

using the fact

 tr(Γ−1n(W)An(Wk))=tr(ATn(Wk)Γ−1n(W))=tr(Γ−1n(W)ATn(Wk))

we get

 ∂∂Wklndet(Γn(W))=2tr(Γ−1n(W)An(Wk)) (3)

#### Second derivatives :

We write now

 Bn(Wk,Wl):=1nn∑t=1(∂FW(zt)∂Wk∂FW(zt)∂WlT)

and

 Cn(Wk,Wl):=1nn∑t=1(−(yt−FW(zt))∂2FW(zt)∂Wk∂WlT)

We get

 ∂2Un(W)∂Wk∂Wl=∂∂Wl2tr(Γ−1n(W)An(Wk))=2tr(∂Γ−1n(W)∂WlAn(Wk))+2tr(Γ−1n(W)Bn(Wk,Wl))+2tr(Γn(W)−1Cn(Wk,Wl))

Now, [Magnus and Neudecker (1988)], give an analytic form of the derivative of an inverse matrix, so we get

 ∂2Un(W)∂Wk∂Wl=2tr(Γ−1n(W)(An(Wk)+ATn(Wk))Γ−1n(W)An(Wk))+2tr(Γ−1n(W)Bn(Wk,Wl))+2tr(Γ−1n(W)Cn(Wk,Wl))

so

 ∂2Un(W)∂Wk∂Wl=4tr(Γ−1n(W)An(Wk)Γ−1n(W)An(Wk))+2tr(Γ−1n(W)Bn(Wk,Wl))+2tr(Γ−1n(W)Cn(Wk,Wl)) (4)

#### Asymptotic distribution of ^Wn :

The previous equations allow us to give the asymptotic properties of the estimator minimizing the cost function , namely from equation (3) and (4) we can compute the asymptotic properties of the first and the second derivatives of . If the variable has a moment of order at least 3 then we get the following lemma :

###### Theorem 2

Assume that and , let be the gradient vector of at and be the Hessian matrix of at .

Write finally

 B(Wk,Wl):=∂FW(Z)∂Wk∂FW(Z)∂WlT

We get then

where, the component of the matrix is :

 tr(Γ−10E(B(W0k,W0l)))

#### proof :

We can show easily that, for all , we have :

 ∥∂FW(Z)∂Wk∥≤Cte(1+∥Z∥)∥∂2FW(Z)∂Wk∂Wl∥≤Cte(1+∥Z∥2)∥∂2FW(Z)∂Wk∂Wl−∂2F0W(Z)∂Wk∂Wl∥≤Cte∥W−W0∥(1+∥Z∥3)

Write

 A(Wk)=(−∂FW(Z)∂Wk(Y−FW(Z))T)

and .

Note that the component of the matrix is:

 E(∂U(W0)∂Wk∂U(W0)∂W0l)=E(2tr(Γ−10AT(W0k))×2tr(Γ−10A(W0l)))

and, since the trace of the product is invariant by circular permutation,

 E(∂U(W0)∂Wk∂U(W0)∂W0l)=4E(−∂FW0(Z)T∂WkΓ−10(Y−FW0(Z))(Y−FW0(Z))TΓ−10(−∂FW0(Z))∂Wl))=4E(∂FW0(Z)T∂WkΓ−10∂FW0(Z)∂Wl)=4tr(Γ−10E(∂FW0(Z)∂Wk∂FW0(Z)T∂Wl))=4tr(Γ−10E(B(W0k,W0l)))\par

Now, the derivative is square integrable, so fulfills Lindeberg’s condition (see [Hall and Heyde (1980)]) and

 √nΔUn(W0)Law→N(0,4I0)

For the component of the expectation of the Hessian matrix, remark first that

 limn→∞tr(Γ−1n(W0)An(W0k)Γ−1n(W0)An(W0k))=0

and

 limn→∞trΓ−1nCn(W0k,W0l)=0

so

 limn→∞Hn(W0)=limn→∞4tr(Γ−1n(W0)An(W0k)Γ−1n(W0)An(W0k))+2trΓ−1n(W0)Bn(W0k,W0l)+2trΓ−1nCn(W0k,W0l)==2tr(Γ−10E(B(W0k,W0l)))

Now, since and
, by standard arguments found, for example, in [Yao (2000)] we get

 √n(^Wn−W0)Law→N(0,I−10)

### 2.3 Asymptotic distribution of Tn

In this section, we write and
, where is view as a subset of . The asymptotic distribution of is then a consequence of the previous section, namely, if we have to replace by its Taylor expansion around and , following [Van der Vaart (1998)] chapter 16 we have :

 Tn=√n(^Wn−^W0n)TI0√n(^Wn−^W0n)+oP(1)D→χ2s−q

## 3 Conclusion

It has been show that, in the case of multidimensional output, the cost function leads to a test for the number of parameters in MLP simpler than with the traditional mean square cost function. In fact the estimator is also more efficient than the least square estimator (see [Rynkiewicz (2003)]). We can also remark that matches with twice the “concentrated Gaussian log-likelihood” but we have to emphasize, that its nice asymptotic properties need only moment condition on and , so it works even if the distribution of the noise is not Gaussian. An other solution could be to use an approximation of the covariance error matrix to compute generalized least square estimator :

 1nn∑t=1(Yt−FW(Zt))TΓ−1(Yt−FW(Zt)),

assuming that is a good approximation of the true covariance matrix of the noise . However it take time to compute a good the matrix and if we try to compute the best matrix with the data, it leads to the cost function (see for example [Gallant (1987)]).

Finally, as we see in this paper, the computation of the derivatives of is easy, so we can use the effective differential optimization techniques to estimate and numerical examples can be found in [Rynkiewicz (2003)].

## References

• Fukumizu (2003) K. Fukumizu.

Likelihood ratio of unidentifiable models and multilayer neural networks.

Annals of Statistics, 31:3:533–851, 2003.
• Gallant (1987) R.A. Gallant. Non linear statistical models. J. Wiley and Sons, New-York, 1987.
• Hall and Heyde (1980) P. Hall and C. Heyde. Martingale limit theory and its applications. Academic Press, New-York, 1980.
• Magnus and Neudecker (1988) Jan R. Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. J. Wiley and Sons, New-York, 1988.
• Rynkiewicz (2003) J. Rynkiewicz. Estimation of multidimensional regression model with multilayer perceptrons. In J. Mira and J.R. Alvarez, editors, Computational methods in neural modeling, volume 2686 of Lectures notes in computer science, pages 310–317, 2003.
• Sussman (1992) H.J. Sussman. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, pages 589–593, 1992.
• Van der Vaart (1998) A.W. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998.
• Yao (2000) J. Yao. On least square estimation for stable nonlinear ar processes. The Annals of Institut of Mathematical Statistics, 52:316–331, 2000.