# Convergence Rates for Mixture-of-Experts

In mixtures-of-experts (ME) model, where a number of submodels (experts) are combined, there have been two longstanding problems: (i) how many experts should be chosen, given the size of the training data? (ii) given the total number of parameters, is it better to use a few very complex experts, or is it better to combine many simple experts? In this paper, we try to provide some insights to these problems through a theoretic study on a ME structure where m experts are mixed, with each expert being related to a polynomial regression model of order k. We study the convergence rate of the maximum likelihood estimator (MLE), in terms of how fast the Kullback-Leibler divergence of the estimated density converges to the true density, when the sample size n increases. The convergence rate is found to be dependent on both m and k, and certain choices of m and k are found to produce optimal convergence rates. Therefore, these results shed light on the two aforementioned important problems: on how to choose m, and on how m and k should be compromised, for achieving good convergence rates.

There are no comments yet.

## Authors

• 4 publications
• 10 publications
• ### Convergence Rates for Gaussian Mixtures of Experts

We provide a theoretical treatment of over-specified Gaussian mixtures o...
07/09/2019 ∙ by Nhat Ho, et al. ∙ 6

• ### Optimal Bayesian estimation of Gaussian mixtures with growing number of components

We study posterior concentration properties of Bayesian procedures for e...
07/17/2020 ∙ by Ilsang Ohn, et al. ∙ 0

• ### Hierarchical Mixtures-of-Experts for Exponential Family Regression Models with Generalized Linear Mean Functions: A Survey of Approximation and Consistency Results

We investigate a class of hierarchical mixtures-of-experts (HME) models ...
01/30/2013 ∙ by Wenxin Jiang, et al. ∙ 0

• ### Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

In this paper we study the frequentist convergence rate for the Latent D...
10/30/2017 ∙ by Yining Wang, et al. ∙ 0

• ### Convergence rates for pretraining and dropout: Guiding learning parameters using network structure

Unsupervised pretraining and dropout have been well studied, especially ...
06/10/2015 ∙ by Vamsi K. Ithapu, et al. ∙ 0

• ### Minimax Risk and Uniform Convergence Rates for Nonparametric Dyadic Regression

Let i=1,…,N index a simple random sample of units drawn from some large ...
12/15/2020 ∙ by Bryan S. Graham, et al. ∙ 0

• ### Fast Nonoverlapping Block Jacobi Method for the Dual Rudin--Osher--Fatemi Model

We consider nonoverlapping domain decomposition methods for the Rudin--O...
08/04/2019 ∙ by Chang-Ock Lee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mixture-of-experts models (ME) (Jacobs et al., 1991) and hierarchical mixture-of-experts models (HME) (Jordan and Jacobs, 1994)

are powerful tools for estimating the density of a random variable

conditional on a known set of covariates . The idea is to “divide-and-conquer”. We first divide the covariate space into subspaces, then approximate each subspace

by an adequate model and, finally, weigh by the probability that

falls in each subspace. Additionally, it can be seen as a generalization of the classical mixture-of-models, whose weights are constant across the covariate space. Mixture-of-experts have been widely used on a variety of fields including image recognition and classification, medicine, audio classification and finance. Such flexibility have also inspired a series of distinct models including Wood et al. (2002), Carvalho and Tanner (2005a), Geweke and Keane (2007), Wood et al. (2008), Villani et al. (2009), Young and Hunter (2010) and Wood et al. (2011), among many others.

We consider a framework similar to Jiang and Tanner (1999a) among others. Assume each expert is in a one-parameter exponential family with mean , where is a -degree polynomial on the conditioning variables (hence a linear function of the parameters) and is the inverse link function. In other words, each expert is a Generalized Linear Model on an one-dimensional exponential family (GLM1). We allow the target density to be in the same family of distributions, but with conditional mean with , a Sobolev class with

derivatives. Some examples of target densities include the Poisson, binomial, Bernoulli and exponential distributions with unknown mean. Normal, gamma and beta distributions also fall in this class if the dispersion parameter is known.

One might be reluctant to use (H)ME models with polynomial experts since it leads to more and more complex models as the degree of the polynomials increases. The discussion whether is better to mixture many simple models or fewer more complex models is not new in the literature of mixture-of-experts. Earlier in the literature, Jacobs et al. (1991) and Peng et al. (1996) proposed mixtures of many simple models; more recently, Wood et al. (2002) and Villani et al. (2009) considered using only a few complex models. Celeux et al. (2000) and Geweke (2007) advocate for mixing fewer complex models, claiming that mixture models can be very difficult to estimate and interpret. We justify the use of such models through the approximation and estimation errors. We illustrate that might be a gain in a small increase of compared to the linear model but the number of parameters increases exponentially as increases. Therefore, a balance between the complexity of the model and the number of experts is required for achieving better error bounds.

This work extends Jiang and Tanner (1999a) in few directions. We show that, by including polynomial terms, one is able to improve the approximation rate on sufficiently smooth classes. This rate is sharp for the piecewise polynomial approximation as shown in Windlund (1977)

. Moreover, we contribute to the literature by providing rates of convergence of the maximum likelihood estimator to the true density. We emphasize that such rates have never been developed for this class of models and the method used can be straightforwardly generalized to more general classes of mixture of experts. Convergence of the estimated density function to the true density and parametric convergence of the quasi-maximum likelihood estimator to the pseudo-true parameter vector are also obtained.

We found that, under slightly weaker conditions than Jiang and Tanner (1999a), the approximation rate in Kullback-Leibler divergence is uniformly bounded by , where is some constant not depending on or , and the number of independent variables. This is a generalization of the rate found in Jiang and Tanner (1999a) who assume and . The convergence rate of the maximum likelihood estimator to the true density is , where is the total number of parameters in each polynomial (typically ), and is the number of parameters in the weight functions. To show the previous results we do not assume identifiability of the model as it is natural for mixture-of-experts to be unidentifiable under permutation of the experts. If we further assume identifiability (Jiang and Tanner, 1999a; Mendes et al., 2006), and that the likelihood function has a unique maximizer, we are able to remove the “” term in the convergence rate. Optimal nonparametric rates of convergence can be attained if and (Stone, 1980, 1985; Chen, 2006).

Zeevi et al. (1998) show approximation in the norm and estimation error for the conditional expectation of the ME with generalized linear experts. Jiang and Tanner (1999a) show consistency and approximation rates for the HME with generalized linear model as experts and a general specification for the gating functions. They consider the target density to belong to the exponential family with one parameter. Their approximation rate of the Kullback-Leibler divergence between the target density and the model is , where the number of experts and the number of covariates. Norets (2010)

show the approximation rate for the mixture of Gaussian experts where both the variance and the mean can be nonlinear and the weights are given by multinomial logistic functions. He considers the target density to be a smooth continuous function and the dependent variable

to be continuous and satisfy some moment conditions. His approximation rate is

, where is assumed to have at least moments and is a small number. Despite these findings, there are no convergence rates yet for the maximum likelihood estimator of mixture-of-experts type of models in the literature.

By studying the convergence rates in this paper, we will be able to shed light on two long-standing problems in ME: (i) How to choose the number of experts for a given sample size ? (ii) Is it better to mix many simple experts or to mix a few complex experts? None of the works discussed above directly address these questions. Our study of a ME structure mixing of the th order polynomial submodels is particularly useful in studying problem (i), which cannot be studied in the framework of (Jiang and Tanner, 1999a), for example, who have restricted to the special case .

Throughout the paper we use the following notation. Let and an denote some measure. For any finite vector we use and , for , if we take . For some function and measure we denote , for , and for we have .

The remainder of the paper is organized as follows. In the next section we introduce the target density and mixture of experts models. We also demonstrate that the quasi-maximum likelihood estimator converges to the pseudo-true parameter vector. Section 3 establishes the main results of the paper: approximation rate, convergence rate and non-parametric consistency. Section 4 discusses model specification and the tradeoff that we unveil between the number of experts and the degree of the polynomials. In the concluding remarks we compare our results with Jiang and Tanner (1999a) and provide direction for future research. The appendix collects technical details of the paper and a deeper treatment on how to bound the estimation error.

## 2 Preliminaries

In this section we introduce the target class of density, mixture-of-experts model with GLM1 experts and the estimation algorithm.

### 2.1 Target density

Consider a sequence of random vectors defined on where , and is the Borel -algebra generated by the set . We assume that has a density with respect to some measure . More precisely, we assume that is known and is member of an one-dimensional exponential family, i.e.

 py|x=exp{ya(h(x))+b(h(x))+c(y)}, (1)

where and are known three times continuously differentiable functions, with first derivative bounded away from zero and has a non-negative second derivative; is a known measurable function of . The function is a member of , a Sobolev class of order 111Suppose and is an integer. We define as the collection of measurable functions with all distributional derivatives , , on , i.e. . Here and for .. Throughout the paper denote by the class of density functions .

The one-parameter exponential family of distributions includes the Bernoulli, exponential, Poisson and binomial distributions, it also includes the Gaussian, gamma and Weibull distributions if the dispersion parameter is known. It is possible to extend the results to the case where the dispersion parameter is unknown, but defined in a compact subset bounded away from zero. In this work we focus only in the one-parameter case.

Some properties of the one parameter exponential family are : (i) conditional on

, the moment generating function of

exists in a neighborhood of the origin implying that moments of all orders exist; (ii) for each positive integer , is a differentiable function of ; and (iii) the first conditional moment , where and are the first derivatives of and respectively, and is called the inverse link function. See Lehmann (1991) and McCullagh and Nelder (1989) for more results about the exponential family of distributions.

### 2.2 Mixture-of-experts model

The mixture-of-experts model with GLM1 experts is defined as:

 fm,k(x,y;ζ) =m∑j=1gj(x;ν)π(hk(x,θj),y)⋅px =m∑j=1gj(x;ν)exp{ya(hk(x;θj))+b(hk(x;θj))+c(y)}⋅px, (2)

where the functions and with parameters 222We denote . , denoting the dimension of . The functions are -degree polynomials on with parameter vector , denoting the dimension of ; write the vector of parameters of all experts as defined on . The parameter vector of the model is and is defined on , a subset of . Throughout the paper we denote by the class of (approximant) densities .

To derive consistency and convergence rates, one need to impose some restrictions on the functions and to avoid abnormal cases. This condition is not restrictive and is satisfied by the multinomial logistic weight functions () and the Bernoulli, binomial, Poisson and exponential experts, among many other classes of distributions and weight functions.

###### Assumption 1.

There exist functions and with and , such that the vector-function satisfy

 supν∈Vm∂logg(x;ν)∂νi≤c(i)g(x);

and each expert satisfy

 supθ∈Θk∂π(hk(x,θj),y)∂θji≤F(i)(x,y), for each 1≤j≤m.

### 2.3 Maximum likelihood estimation and the EM algorithm

#### 2.1 Maximum likelihood estimation

We consider the maximum likelihood method of estimation. We want to find the parameter vector that maximizes

 Ln(ζ)=n−1n∑i=1log{fm,k(Xi,Yi;ζ)/φ0(Xi,Yi)}, (3)

where . That is,

 ^ζn=argmaxζ∈Vm×ΘmkLn(ζ). (4)

The maximum likelihood estimator is not necessarily unique. In general, mixture-of-experts models are not identifiable under permutation of the experts. To circumvent this issue one must impose restrictions on the experts and the weighting (or the parameter vector of the model), as shown in Jiang and Tanner (1999b).

Define the Kullback-Leibler (KL) divergence between and as

 KL(pxy,fm,k)=∫Ω∫Alogpxyfm,kdPy|xdPx. (5)

The log-likelihood function in (3) converges to its expectation with probability one as the number of observations increases. Therefore, in the limit, the minimizer of (3) (indexed by ) also minimizes the Kullback-Leibler divergence between the true density and the estimated density.

In this work only consider i.i.d. observations but is straightforward to extend the results to more general data generating processes. Next assumption formalizes it.

###### Assumption 2 (Data Generating Process).

The sequence , is an independent and identically distributed sequence of random vectors with common distribution .

Next results ensures the existence of such estimator.

###### Theorem 2.1 (Existence).

For a sequence of compact subsets of , , there exists a measurable function , satisfying equation (4) -almost surely.

We demonstrate that under the classical assumptions, such as identifiability and unique maximizer, the maximum likelihood estimator consistently estimate the best model in the class indexed by , i.e. the maximum likelihood estimator converges almost surely to . It can be shown that the convergence results also hold for the ergodic case if we assume that is ergodic. However, simpler conditions to ensure ergodicity of the likelihood function are not trivial and hence out of the scope of this paper.

###### Assumption 3 (Identifiability).

For any distinct and in , for almost every ,

 fm,k(x,y;ζ1)≠fm,k(x,y;ζ2)

Jiang and Tanner (1999b) find sufficient conditions for identifiability of the parameter vector for the HME with one layer, while Mendes et al. (2006) for a binary tree structure. Both cases can be adapted to more general specifications. Although one can show consistency to a set, we adopt a more traditional approach requiring identifiability of the parameter vector.

###### Assumption 4 (Unique Maximizer).

Let and the argument that maximizes over . Then

 det(E∂2∂ζ∂ζ′logfm,k|ζ=ζ∗)≠0 (6)

This assumption follows from a second order Taylor expansion of the expected likelihood around the parameter vector that maximizes (5), denoted . We require the Hessian to be invertible at . The requirement for an identifiable unique maximizer is only technical in a sense that the objective function is not allowed to become too flat around the maximum (For more discussion on this topic see Bates and White (1985), pg 156, and White (1996) chapter 3). A similar assumption was made in the series of papers from Carvalho and Tanner (2005a, b, 2006, 2007) and Zeevi et al. (1998) and is an usual assumption in the estimation of misspecified models.

###### Theorem 2.2 (Parametric consistency of misspecified models).

Under Assumptions 1, 2, 3, and 4, the maximum likelihood estimate as -a.s.

Huerta et al. (2003) and the series of papers by Carvalho and Tanner (2005a, b, 2006, 2007) derive similar results for time series processes.

#### 2.2 The EM algorithm

It is often easier to maximize the complete likelihood function of a (H)ME instead of (3) (see Jordan and Jacobs (1994), Xu and Jordan (1996) and Yang and Ma (2011)). Let denote a binary vector with if the observation is generated by the expert (i.e. ). We assume has a multinomial distribution with parameters . The complete log-likelihood function is given by

 lcn(κ)=n∑i=1m∑j=1zij(loggi(xi,ν)+logπ(hk(xi;θj),yi)−logφ0(xi,yi)), (7)

where .

We can estimate this model using the expectation-maximization (EM) algorithm put forward by

Dempster et al. (1977). Let denote the parameter estimates at the th iteration and define . In the E-step, we obtain by replacing with its expectation

 τ(l)ij=gj(xi,ν(l))π(hk(xi;θ(l)j),yi)∑mj=1gj(xi;ν(l))π(hk(xi;θ(l)j),yi).

In the M-step we maximize with respect to and . The problem simplifies to find the parameters that maximize

 q(ν;κ(l))=n∑i=1τ(l)ijloggj(xi;ν), (8)

and to find the parameters we have to maximize

 q(θ;κ(l))=n∑i=1k∑j=1τ(l)ij[yia(hk(xi;θj))+b(hk(xi;θj))]. (9)

## 3 Main results

In this section we present the main results of the paper. Write the KL-divergence as follows:

 KL(pxy,fm,k)=KL(pxy,f∗m,k)+E[logf∗m,kfm,k], (10)

where is the minimizer of the minimizer of on . The first term in the right-hand side is the approximation error and the second term is the estimation error. The approximation error measures “how well” an element of approximates , and approaches zero as increases. The estimation error measures “how far” is the estimated model from the best approximant in the class. Our goal is to find bounds for both approximation and estimation errors and combine these results to find the convergence rate of the maximum likelihood estimator.

### 3.1 Approximation rate

We follow Jiang and Tanner (1999a) to bound the approximation error. Define the upper divergence between and as

 D(p,fm,k)=∫Ωm∑j=1gj(x,ν)(hk(x;θj)−h(x))2dPx. (11)

We can use the upper divergence to bound the KL-divergence.

###### Lemma 3.1.

Let and . If ,

 KL(p,fm,k)≤M∞D(p,fm,k)

where .

This lemma will be used to bound uniformly the approximation rate of the family of functions .

Before presenting the main conditions, we shall introduce some key concepts.

###### Definition 3.1 (Fine partition).

For , let be a partition of . If and if for all , , for some constant independent of , , or . Then is called a sequence of fine partitions with cardinality and bounding constant .

Here we use some abuse of notation by using as an index of the collection of partitions of . However, this abuse of notation is justified because is an increasing sequence and the collection of partitions depends on an increasing function of . The next definition will be useful later to bound the “growth rate” of the model and is useful to deal with hierarchical mixture of experts (see Jiang and Tanner (1999a)).

###### Definition 3.2 (Subgeometric).

A sequence of numbers is called sub-geometric with rate bounded by if , as , and for all and for some finite constant .

The key idea behind find the approximation rates, is to control the approximation rate inside each fine partition of the space. More precisely, bound the approximation inside the “worst” (more difficult to approximate) partition. We need the following assumption.

###### Assumption 5.

There exists a fine partition of , with bounding constant and cardinality sequence , , such that is sub-geometric with rate bounded by some constant , and there exists a constant , and a parameter vector such that

 max1≤j≤rm∥gj(⋅;νc1)−IQmj(⋅)∥1,λ≤c1rm. (12)

This assumption is similar, but weaker than, the one employed in Jiang and Tanner (1999a) and requires that the vector

approximates the vector of characteristic functions

at a rate not slower then .

The notation is introduced to deal with the hierarchical mixture of experts structure. To allow more flexibility define as the maximum number of experts the structure can hold, e.g. a binary tree with layers has at most experts, and if we increase the number of layers by one, the actual number of experts is somewhere between and (here we are assuming the tree is balanced without loss of generality). If we denote this class of models by , then . The sub-geometric assumption ensures that , where is the actual number of experts in the model.

###### Theorem 3.1 (Approximation rate).

Let and . If assumption 5 holds, then

 suppinffm,kKL(p,fm,k)≤cm2(α∧(k+1))/s (13)

for some constant not depending on or .

This result is a generalization of Jiang and Tanner (1999a) in two directions. First we allow the target function to be in a Sobolev class with derivatives; second, we consider a polynomial approximation to the target function in each experts (in fact, their result is a special case when and ). This generalization enables us to address the important problem: whether it is better to mix many simple experts or to mix a few complex experts. The result also holds under more general specifications of densities/experts. In the case we also have a dispersion parameter to estimate, we just have to modify the lemma 3.1 accordingly and the same result holds.

This rate also agrees with the optimal approximation rate of functions on by piecewise polynomials (Windlund, 1977). One can see that, under assumption 5, it is exactly what we are doing. Therefore this approximation rate is sharp.

### 3.2 Convergence rate

In this section we deduce the convergence rate for the mixture-of-experts model. Equation (10) gives us an expansion of the KL divergence in terms of the approximation and estimation errors. In the previous section we found a bound for the approximation error, in this section we will find the estimation error and combine with the approximation error to find the rate of convergence.

The estimation error is the “how far” is the estimated function from the best approximant in the class. We will demonstrate that the estimation error in (10) is . We also show that by combining this result with the approximation rate it is possible to achieve a convergence rate of , with , which is close to the optimal nonparametric rate if . Moreover, if there is an unique identifiable maximizer to the likelihood problem (assumptions 3 and 4), we are able to remove the “” term and achieve a better convergence rate, possibly optimal if .

The next theorem summarizes the convergence rate of the maximum likelihood estimator with respect the KL divergence between the true density and the estimated density.

###### Theorem 3.2 (Convergence Rate).

Let and denote its maximum likelihood estimator on . Let be allowed to increase such that and as and increase. Under Assumptions 1, 2 and 5,

 KL(pxy,^fm,k)=Op(1m2τ/s+(mJk+vm)lognn), (14)

where . In particular, if we assume , and let be proportional to then

 KL(pxy,^fm,k)=Op⎛⎝(lognn)2τ2τ+s⎞⎠. (15)

Although the previous result is derived for the i.i.d. case, the result also holds for more general data generating process. In this result we use (through van der Geer (2000)), an uniform probability inequality for i.i.d. processes to derive theorem A.1, but the same result can be obtained by using uniform inequalities for more general processes. This convergence rate is close to the optimal rate found in the sieves literature if , see for instance Stone (1980) and Barron and Sheu (1991).

To derive this rate we do not assume that there is an unique identifiable maximizer ; in fact, we assume is any of such maximizers. The price to pay for such generality is the inclusion of the “” term in the convergence rates. If we assume is unique and uniquely identified by a parameter vector , we can explore the localization property of theorem A.1. More precisely, we can explore the fact that we are only interested in the behavior of the empirical process around a neighborhood of . Under such conditions and assuming , we are able to achieve the optimal convergence rate in the sieves literature (Stone, 1980; Barron and Sheu, 1991).

###### Theorem 3.3 (Optimal Convergence Rate).

Let and denote its maximum likelihood estimator on . Let be allowed to increase such that and as and increase. Under Assumptions 15,

 KL(pxy,^fm,k)=Op(1m2τ/s+(mJk+vm)n), (16)

where . In particular, if we assume , and let be proportional to then

 KL(pxy,^fm,k)=Op(n−2τ2τ+s). (17)

The same result follows for more general data generating processes and the same considerations after theorem 3.2 hold.

By imposing there exist an unique maximum we are able to remove the term and recover the optimal convergence rate for sieves estimates found in the literature.

### 3.3 Consistency

Now we apply the previous results to show the maximum likelihood estimator is consistent, i.e. the KL divergence between the true density and the estimated model approaches zero as the sample size , and the index of the approximation class goes to infinity. Here we show consistency essentially by using the previous results.

###### Corollary 3.1 (Consistency).

Let and denote its maximum likelihood estimator on . Allow and as and increase. Under Assumptions 1, 2 and 5 , as and increase.

## 4 Effects of m and k

We consider a framework similar to Jiang and Tanner (1999a), but one is allowed to mix GLM1 experts whose terms are polynomials on the variables, as opposed to . We also assume that the true mean function is with , a Sobolev class with derivatives, as opposed to .

By deriving a convergence rate such as (16) in this framework, we are able to gain insight on the two important problems in the area of ME: (i) What number of experts should be chosen, given the size of the training data? (ii) Given the total number of parameters, is it better to use a few very complex experts, or is it better to combine many simple experts?

For question (i), the results in Theorem 3.3 and Corollary 3.1 suggest that good results can be obtained by choosing the number of experts to grow as with some power , which may depend on the dimension of the input space and the underlying smoothness of the target function. Smoother target functions and lower dimensions generally encourage us to use less experts.

Question (ii) requires a more detailed study. The complexity of the experts (submodels) are related to , the order of the polynomials. We see that increasing does improve the approximation rate, however this improvement is bounded by the number of derivatives of the function . Moreover, this approximation rate is known to be sharp for piecewise polynomials (Windlund, 1977). The price to pay for this increase in the approximation rate is a larger number of parameters in the model, i.e. a worse estimation error. We will provide below a theoretical result on the optimal choice of , as well as some numerical evidence.

First of all, an easier expression of the upper bound of the KL divergence in (16) can be derived as where , where . [This assumes that and uses the fact that the number of parameters needed in -dimensional polynomials of order is bounded by .]

We now study the upper bound , fixing the product , where may depend on and is a bound for the rough order of the total number of parameters.

###### Proposition 4.1.

Let and (†) (which is an upper bound for the convergence rate derived in Theorem 3.3). Then the following statements are true:

I. Fixing the product , is minimized at . The corresponding optimal .

II. If is finite, then achieves the optimal rate under the following choices: is any constant that is at least and does not vary with , and for any constant .

III. If , the following choices will make to have a near-parametric rate : and is constant in , for any constant .

###### Remark 1.

This Proposition suggests that for achieving optimal performance, the (or , related to the complexity of the experts) and the (the number of experts) should be compromised. Fixing an upperbound of the total number of parameters, the optimal . The optimal compromise therefore depends both on (smoothness of the target function) and (the dimension of the input space). The formula implies that (a) a smoother target function (indexed by a larger ) will favor more complex submodels (with larger or ), (b) for a very smooth target function (with large enough ), a higher dimension of the input space will favor the use of simpler submodels (with smaller or , possibly smaller than ) and the use of more experts (bigger ).

###### Remark 2.

Although Result I shows how to construct an -optimal compromise between and , Results II and III show that good convergence rates are quite robust against deviations from these optimal solutions. We note that -optimal convergence rates can always be achieved with not being too large compared to the sample size . This is summarized in the two situations in Results II and III, where we see that even in the case , we only need about for us to achieve a near-parametric convergence rate.

One drawback of the above theoretic analysis is that it has used a rough upper bound (which has a simple expression) for the total number of parameters associated with th order dimensional polynomials. Below we conduct some numeric study, where the exact number of parameters are used. When considering the choice of , a first impulse is to use polynomials of order , but the number of variables in the model increase exponentially with if . In fact, in many cases it is preferable to use a smaller and many experts if one wishes to control the size of the estimation error. This is consistent with the earlier Remark 1 we made for our theoretic analysis.

Table 1 compares the approximation error using distinct values of and holding the estimation error fixed. Assume we have variables and , a modeler builds a model with experts and, since it is known that , also chooses . If we further assume , the total number of parameters in the model is . We can see the smallest approximation error is achieved at and .

Similarly, fixing the approximation error we see that a balance between and is necessary. Fix and and assume one wants a model with approximation error proportional to . Table 2 shows that the model with smaller estimation error that achieves this approximation error is the one with and .

This quick exercise illustrated one of the main conclusions of this paper: it is not true that one should always use few complex models (small and large ) or always choose for many complex ones (small and large ); a balance between and should be used instead. Moreover, a small increase in comparing to the linear model () can have a good improvement on the approximation and estimation errors.

The results in this paper focus only on target density and mixture-of-experts specified in sections 2.1 and 2.2 respectively. However, similar results can be derived for more complex models and target densities.

## 5 Conclusion

In this paper we study the mixture-of-experts model with experts in an one-exponential family with mean , where is a order polynomial and is the inverse link function. We derive sharp approximation rates with respect to the Kullback-Leibler divergence and convergence rate of the maximum likelihood estimator to densities in an one-parameter exponential family with mean with , a Sobolev class with derivatives. We found that the convergence rate of the maximum likelihood estimator to the true density is , where is the number of observations, is the number of covariates, is the number of parameters of the polynomial , the number of experts and is the number of parameters on the weight functions. Further, if the maximum likelihood estimator is uniquely identified we can remove the “” term of the convergence rates.

We discuss model specification and the effects on approximation and estimation errors and conclude that the best error bound is achieved using a balance between and , and inclusion of polynomial terms might render better error bounds. Also, the results of this paper can be generalized to more complex target densities and models with simple modifications to the proofs.

We generalize Jiang and Tanner (1999a) in several directions: (i) we assume one can include polynomial terms of the variables on the GLM1 experts; (ii) we assume the target density is in a class, for , instead of ; (iii) we show consistency of the quasi-maximum likelihood estimator for fixed number of experts; (iv) we calculate non-parametric convergence rates of the maximum likelihood estimator; (v) we show non-parametric consistency when the number of experts and the sample size increase; and finally (vi) that using polynomials in the experts one can get better estimation and error bounds. These developments have shed light on the important questions of how many experts should be chosen and how complex the experts themselves should be.

## Acknowledgements

The authors would like to thank Prof. Marcelo Fernandes and Prof. Marcelo Medeiros for insightful discussion about mixture-of-experts and comments on previous versions of this manuscript.

## References

• Barron and Sheu [1991] A. Barron and C. Sheu. Approximation of density functions by sequences of exponential families. The Annals of Statistics, 19(3):1347–1369, 1991.
• Bates and White [1985] C. Bates and H. White.

A unified theory of consistent estimation for parametric models.

Econometric Theory, 1(2):151–178, 1985.
• Carvalho and Tanner [2005a] A. Carvalho and M. Tanner. Modeling nonlinear time series with local mixtures of generalized linear models. Canadian Journal of Statistics, 33(1), 2005a.
• Carvalho and Tanner [2005b] A. Carvalho and M. Tanner. Mixtures-of-experts of autoregressive time series: asymptotic normality and model specification.

IEEE Transactions on Neural Networks

, 16(1):39–56, 2005b.
• Carvalho and Tanner [2006] A. Carvalho and M. Tanner. Modeling nonlinearities with mixtures-of-experts of time series models. International Journal of Mathematics and Mathematical Sciences, 9:1–22, 2006.
• Carvalho and Tanner [2007] A. Carvalho and M. Tanner. Modelling nonlinear count time series with local mixtures of poisson autoregressions. Computational Statistics and Data Analysis, 51(11):5266–5294, 2007.
• Celeux et al. [2000] G. Celeux, M. Hurn, and C. Robert. Computation and inferential difficulties with mixture distributions. Journal of the American Statistical Association, 99:957–970, 2000.
• Chen [2006] X. Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics, 6, 2006.
• Dempster et al. [1977] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
• Geweke [2007] J. Geweke. Interpretation and inference in mixture models: simple mcmc works. Computational Statistics and Data Analysis, 51:3529–3550, 2007.
• Geweke and Keane [2007] J. Geweke and M. Keane. Smoothly mixing regressions. Journal of Econometrics, 138(1):252–290, 2007.
• Huerta et al. [2003] G. Huerta, W. Jiang, and M. Tanner. Time series modeling via hierarchical mixtures. Statistica Sinica, 13(4):1097–1118, 2003.
• Jacobs et al. [1991] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
• Jennrich [1969] R. Jennrich. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics, 40(2):633–643, 1969.
• Jiang and Tanner [1999a] W. Jiang and M. Tanner. Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. The Annals of Statistics, pages 987–1011, 1999a.
• Jiang and Tanner [1999b] W. Jiang and M. Tanner. On the identifiability of mixtures-of-experts. Neural Networks, 12(9):1253–1258, 1999b.
• Jordan and Jacobs [1994] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
• Lehmann [1991] E. Lehmann. Theory of point estimation. Thomson Brooks/Cole, 1991.
• McCullagh and Nelder [1989] P. McCullagh and J. Nelder. Generalized linear models. Chapman & Hall/CRC, 1989.
• Mendes et al. [2006] E. Mendes, A. Veiga, and M. Medeiros. Estimation and asymptotic theory for a new class of mixture models. Manuscript, Pontifical Catholic University of Rio de Janeiro, 2006.
• Norets [2010] A. Norets. Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics, 38(3):1733–1766, 2010.
• Peng et al. [1996] F. Peng, R. Jacobs, and M. Tanner. Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, pages 953–960, 1996.
• Stone [1980] C. Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, 8(6):1348–1360, 1980.
• Stone [1985] C. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689–705, 1985.
• van der Geer [2000] S. van der Geer. Empirical Processes in M-Estimation. Cambridge Univ Pr, 2000.
• van der Vaart and Wellner [1996] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag New York, Inc., 1996.
• Villani et al. [2009] M. Villani, R. Kohn, and P. Giordani. Regression density estimation using smooth adaptive gaussian mixtures. Journal of Econometrics, 153(2):155–173, 2009.
• White [1996] H. White. Estimation, inference and specification analysis. Cambridge Univ Pr, 1996.
• Windlund [1977] O. Windlund. On best error bounds for approximation by piecewise polynomial functions. Numerische Mathematik, 27:327–338, 1977.
• Wood et al. [2002] S. Wood, W. Jiang, and M. Tanner. Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika, 89(3):513–528, 2002.
• Wood et al. [2008] S. Wood, R. Kohn, R. Cottet, W. Jiang, and M. Tanner. Locally adaptive nonparametric binary regression. Journal of Computational and Graphical Statistics, 17(2):352–372, 2008.
• Wood et al. [2011] S. Wood, O. Rosen, and R. Kohn.

Bayesian mixtures of autoregressive models.

Journal of Computational and Graphical Statistics, 20(1):174–195, 2011.
• Xu and Jordan [1996] L. Xu and M. Jordan. On convergence properties of the em algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.
• Yang and Barron [2002] Y. Yang and A. Barron. An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44(1):95–116, 2002. ISSN 0018-9448.
• Yang and Ma [2011] Y. Yang and J. Ma. Asymptotic convergence properties of the em algorithm for mixture of experts. Neural Computation, pages 1–29, 2011.
• Young and Hunter [2010] D. Young and D. Hunter. Mixtures of regressions with predictor-dependent mixing proportions. Computational Statistics & Data Analysis, 54(10):2253–2266, 2010.
• Zeevi et al. [1998] A. Zeevi, R. Meir, and V. Maiorov. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory, 44(3):1010–1025, 1998.

## Appendix A Showing the convergence rate

In this appendix we explain and justify the main steps in proving the convergence rate.

One of the drawbacks of working with the Kullback-Leibler divergence is that it is not bounded. An alternative is to use the Hellinger distance.

###### Definition A.1 (Hellinger Distance).

Let and denote two probability measures absolute continuous with respect to some measure . The Hellinger distance between and is given by

 dh(P,Q)=⎧⎨⎩12∫(√dPdλ−√dQdλ)2dλ⎫⎬⎭1/2. (18)

Alternatively, the Hellinger distance between two densities and with respect to is given by

 dh(p,q)={12∫(√p−√q)2dλ}1/2. (19)

One can show that if the likelihood ratio is bounded, the KL divergence is bounded by a constant times the square of the Hellinger distance. We use the following result due to Yang and Barron [2002], which is presented together with a basic inequality relating the Hellinger distance and Kullback-Leibler divergence.

###### Lemma A.1 (Yang and Barron [2002]).

Let and , for . Then

 d2h(pxy,f)≤KL(pxy,f)≤2(1+logcs)d2h(pxy,f),

where stands for the Hellinger distance between the densities and with respect to .

This Lemma implies that the Kullback-Leibler divergence is bounded by the square of the Hellinger distance, and therefore the convergence rate in the square of the Hellinger distance is the same as the convergence rate in the Kullback-Leibler divergence. The only problem is that, in general, the boundedness condition does not hold on the whole set (the support of ). One could overcome this complication by finding the convergence rate inside some subset of where the KL divergence is bounded and control the tail probability outside this subset.

Let denote a scalar function of and . For every ,

 P(S(Y,Y)>K)≤P({S(Y,X)>K}∩B(β))+P(|Y|>β). (20)

If is bounded, we can choose , and the second term on the right hand side will be zero. Otherwise, we can take to be large enough such that is small or converges to zero at some rate.

In order to bound the estimation error we shall use results from the theory of empirical processes. The convergence rate theorem presented below is derived for the i.i.d. case, however the same result holds for martingales (see van der Geer [2000]).

To control the estimation rate inside a class of functions we have to measure how big is the class. Let denote the number of -brackets333For a formal definition of Bracketing Numbers see van der Vaart and Wellner [1996] chapters 2.1 and 2.7 with respect to the distance , needed to cover the set and the respective bracketing entropy. Moreover, let denote some finite universal constant that may change each time it appears, and write . We show that, under some conditions,

 HB(ε,F1/2m,k,∥⋅∥2,Ω×A)≤const.(mJk+vm)logCε,

for some finite constant not depending on .

Hence, our first task is to find the bracketing entropy of . Assumption 1 implies that

 ∂√fm,k∂ζ′≤√fm,k2[cg(x)′,δ1F(x,y)′,…,δmF(x,y)′],

where and .

Hence, for any and in some indexed respectively by the parameter vectors and in ,

 |√f1−√f2|≤c(x,y)|ζ1−ζ2|2, (21)

with . Therefore, the square-root densities in are Lipschitz in parameters, with Lipschitz function .

###### Lemma A.2 (Bracketing Entropy).

Under assumption 1, for any ,

 HB(ε,F1/2m,k,∥⋅∥2)≤const.(mJk+vm)logCε, (22)

where ; and

 ∫δ0+logH1/2B(u,F1/2