    # Convergence Rates for Gaussian Mixtures of Experts

We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of algebraic independence of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Gaussian mixtures of experts, a class of piece-wise regression models introduced by [9, 14, 15], have found applications in many fields including social science [8, 7, 2], speech recognition [21, 19]

[3, 24, 18, 19], and system identification 

. Gaussian mixtures of experts differ from classical finite Gaussian mixture models in two ways. First, the mixture components (the “experts”) are regression models, linking the location and scale of a Gaussian model of the response variable to a covariate vector

and , where , are parameters. Second, the mixing proportions (the “gating network”) are also functions of the covariate vector , via a parametric model that maps

to a probability distribution over the labels of the experts. The overall model can be viewed as a covariate-dependent finite mixture. Despite their popularity in applications, the theoretical understanding of Gaussian mixtures of experts has proved challenging and lagged behind that of finite mixture models. The inclusion of covariates

in the experts and the gating networks leads to complex interactions of their parameters, which complicates the theoretical analysis.

In the setting of finite mixture models, while the early literature focused on identifiability issues [25, 26, 27, 17], recent work has provided a substantive inferential theory; see for example [23, 20, 5, 6]. Chen  set the stage for these recent developments by establishing a convergence rate of for parameter estimation in the univariate setting of over-specified mixture models. Later, Nguyen  used the Wasserstein metric to analyze the posterior convergence rates of parameter estimation for both finite and infinite mixtures. Recently, Ho et al.  provided a unified framework to rigorously characterize the convergence rates of parameter estimation based on the singularity structures of finite mixture models. Their results demonstrated that there is a connection between the singularities of these models and the algebraic-geometric structure of the parameter space.

Moving to Gaussian mixtures of experts, a classical line of research focused on the identifiability in these models  and on parameter estimation in the setting of exact-fitted models where the true number of components is assumed known [11, 10, 12]. This assumption is, however, overly strong for most applications; the true number of components is rarely known in practice. There are two common practical approaches to deal with this issue. The first approach relies on model selection, most notably the BIC penalty . This approach is, however, computationally expensive as we need to search for the optimal number of components over all the possible values. Furthermore, the sample size may not be large enough to support this form of inference. The second approach is to over-specify the true model, by using rough prior knowledge to specify more components than is necessary. However, theoretical analysis is challenging in this setting, given the complicated interaction among the parameters of the expert functions, a phenomenon that does not occur in the exact-fitted setting of Gaussian mixtures of experts. Another challenge arises from inhomogeneity—some parameters tend to have faster convergence rates than other parameters. This inhomogeneity makes it nontrivial to develop an appropriate distance for characterizing convergence rates.

In the current paper we focused on a simplified setting in which the expert functions are covariate-dependent, but the gating network is not. We refer to this as the Gaussian mixture of experts with covariate-free gating functions (GMCF) model. Although simplified, this model captures the core of the mixtures-of-experts problem, which is the interactions among the different mixture components. We believe that the general techniques that we develop here can be extended to the full mixtures-of-experts model—in particular by an appropriate generalization of the transportation distance to capture the variation of parameters from the gating networks—but we leave the development of that direction to future work.

### 1.1 Setting

We propose a general theoretical framework for analyzing the statistical performance of maximum likelihood estimation (MLE) for parameters in the setting of over-specified Gaussian mixtures of experts with covariate-free gating functions. In particular, we assume that are i.i.d. samples from a Gaussian mixture of experts with covariate-free gating functions (GMCF) of order , with conditional density function :

 gG0(Y|X):=k0∑i=1π0if(Y|h1(X,θ01i),h2(X,θ02i)), (1)

where is a true but unknown probability measure (mixing measure) and for all . We over-specify the true model by choosing components.

We estimate under the over-specified GMCF model via maximum likelihood estimation (MLE). We denote the MLE as . Our results reveal a fundamental connection between the algebraic structure of the expert functions and and the convergence rates of the MLE through a general version of the optimal transport distance, which refers to as the generalized transportation distance. A similar distance has been used to study the effect of algebraic singularities on parameter estimation in classical finite mixtures  .

### 1.2 Generalized transportation distance

In contrast to the traditional Wasserstein metric , the generalized transportation distance assigns different orders to each parameter. This special property of generalized transportation distance provides us with a tool to capture the inhomogeneity of parameter estimation in Gaussian mixtures of experts. In order to define the generalized transportation distance, we first define the semi-metric for any vector as follows:

 dκ(θ1,θ2):=(q1+q2∑i=1|θ(i)1−θ(i)2|κi)1/∥κ∥∞,

for any . Generally, does not satisfy the standard triangle inequality. More precisely, when not all are identical, satisfies a triangle inequality only up to some positive constant less than one. When all are identical, becomes a metric.

Now, we let be some probability measure. The generalized transportation distance between and with respect to is given by:

 ˜Wκ(G,G0):=(inf∑i,jqijd∥κ∥∞κ(ηi,η0j))1/∥κ∥∞, (2)

where the infimum is taken over all couplings between and ; i.e., where and . Additionally, and for all .

In general, the convergence rates of mixing measures under generalized Wasserstein distance translate directly to the convergence rates of their associated atoms or parameters. More precisely, assume that there exist a sequence and a vector such that at rate as . Then, we can find a sub-sequence of such that each atom (support) of is the limit point of atoms of . Additionally, the convergence rates for estimating , the th component of , are while those for estimating are for and . Furthermore, the convergence rates for estimating the weights associated with these parameters are . Finally, there may exist some atoms of that converge to limit points outside the atoms of . The convergence rates of these limit points are also similar to those for estimating the atoms of .

### 1.3 Main contribution

The generalized transportation distance in (2) allows us to introduce a notion of algebraic independence between expert functions and that is expressed in the language of partial differential equations (PDEs). Using this notion, we are able to characterize the convergence rates of parameter estimation for several choices of expert functions and when they are either algebraically independent or not. Our overall contributions in the paper can be summarized as follows:

• Algebraically independent settings: When the expert functions and are algebraically independent, we establish the best possible convergence rate of order for (up to a logarithmic factor) where . Furthermore, we demonstrate that this convergence rate is minimax. That result directly translates to a convergence rate of for the support of .

• Algebraically dependent settings: When the expert functions and are algebraically dependent, we prove that the convergence rates of parameter estimation are very slow and inhomogeneous. More precisely, the rates of convergence are either determined by the solvability of a system of polynomial equations or by the admissibility of a system of polynomial limits. The formulations of these systems depend on the PDEs that capture the interactions among the parameters for the expert functions. Furthermore, we show that the inhomogeneity of parameter estimation can be characterized based on the generalized transportation distance.

Organization. The remainder of the paper is organized as follows. In Section 2, we introduce the problem setup for Gaussian mixtures of experts with covariate-free gating functions. Section 3 establishes convergence rates for parameter estimation and provides global miximax lower bounds under the algebraically independent setting. In Section 4, we consider various settings in which the expert functions are algebraically dependent and establish the convergence rates of parameter estimation under these settings. We provide proofs for a few key results in Section 5 while deferring the majority of the proofs to the Appendices. Finally, we conclude in Section 6.

Notation. For any vector , we use superscript and subscript notation interchangeably, letting or . Thus, either or is the -th component of . Additionally, for each , we denote for any . Finally, for any two vectors , we write if for all and if and .

For any two density functions (with respect to the Lebesgue measure ), the total variation distance is given by . The squared Hellinger distance is defined as .

## 2 Background

In this section, we provide the necessary background for our analysis of the convergence rates of the MLE under over-specified Gaussian mixtures of experts with covariate-free gating functions. In particular, in Section 2.1, we define the over-specified Gaussian mixture of experts with covariate-free gating functions, and in Section 2.2, we establish identifiability and smoothness properties for these models as well as establishing the convergence rates of density estimation.

### 2.1 Problem setup

Let be a response variable of interest and let be a vector of covariates believed to have an effect on . We start with a definition of identifiable expert functions.

###### Definition 1.

Given for some . We say that an expert function is identifiable if almost surely for implies .

Recall that we focus on Gaussian mixtures of experts [9, 14, 15] for which the gating functions are independent of covariate . In particular, we denote

as the family of location-scale univariate Gaussian distributions and define our models of interest as follows.

###### Definition 2.

Assume that we are given two identifiable expert functions and where for given dimensions as . Let denote weights with . We say that follows a Gaussian mixtures of experts with covariate-free gating functions (GMCF) of order , with respect to expert functions , and gating functions , if the conditional density function of given has the following form

 gG(Y|X) :=∫f(Y|h1(X,θ1),h2(X,θ2))dG(θ1,θ2) =k∑i=1πif(Y|h1(X,θ1i),h2(X,θ2i)),

where is a discrete probability measure that has exactly atoms on .

As an example, when , generalized linear expert functions take the form and .

##### Over-specified GMCF

Let be i.i.d. draws from a GMCF of order with conditional density function where is a true but unknown probability measure (mixing measure). Since is generally unknown in practice, one popular approach to estimate the mixing measure is based on over-specifying the true number of components . In particular, we fit the true model with number of components where is a given threshold that is chosen based on prior domain knowledge. We refer to this setting as the over-specified GMCF.

##### Maximum likelihood estimation (MLE)

To obtain an estimate of , we define the MLE as follows:

 ˆGn:=argmaxG∈Gn∑i=1log(gG(Yi|Xi)), (3)

where is some subset of , namely, the set of all discrete probability measures with at most components. Detailed formulations of will be given later based on the specific structures of expert functions and .

##### Universal assumptions and notation

Throughout this paper, we assume that and are compact subsets of and respectively. Additionally, and is a random vector and has a given prior density function , which is independent of the choices of expert functions , . Furthermore, is a fixed compact set of . Finally we denote

 pG(X,Y):=gG(Y|X)¯¯¯f(X)

as the joint distribution (or equivalently mixing density) of

and for any .

### 2.2 General identifiability, smoothness condition, and density estimation

In order to establish the convergence rates of , our analysis relies on three main ingredients: general identifiability of the GMCF, Hölder continuity of the GMCF up to any order , and parametric convergence rates for density estimation under the over-specified GMCF. We begin with the following result regarding the identifiability of GMCF.

###### Proposition 1.

For given identifiable expert functions and , the GMCF is identifiable with respect to and , namely, whenever there are finite discrete probability measures and on such that almost surely , then it follows that .

A second result that plays a central role in analyzing convergence of the MLE in over-specified GMCF is the uniform Hölder continuity, formulated as follows:

###### Proposition 2.

For any , the GMCF admits the uniform Hölder continuity up to the th order, with respect to the expert functions , and prior density function :

 ∑|κ|=r¯¯¯f(x)∣∣∣(∂|κ|f∂θκ11∂θκ22(y|h1(x,θ1),h2(x,θ2)) −∂|κ|f∂θκ11∂θκ22f(y|h1(x,θ′1),h2(x,θ′2)))γκ∣∣∣≤C∥(θ1,θ2)−(θ′1,θ′2)∥δ∥γ∥r,

for any and for some positive constants and that are independent of and . Here, where for any .

Finally, when the expert functions and are sufficiently smooth in terms of their parameters, we can guarantee the parametric convergence rate of density estimation.

###### Proposition 3.

Assume that the expert functions and are twice differentiable with respect to their parameters. Additionally, assume that there exist positive constants such that , for all . Then, the following holds:

 P(h(pˆGn,pG0)>C(logn/n)1/2)≾exp(−clogn) (4)

for universal positive constants and that depend only on .

The proof of Proposition 3 is provided in Appendix C.

## 3 Algebraically independent expert functions

In this section, we consider the MLE in (3) over the entire parameter space . That is, we let . To analyze the convergence rates of MLE under over-specified GMCF we capture the algebraic interaction among the expert functions and via the following definition.

###### Definition 3.

We say that the expert functions are algebraically independent if they are twice differentiable with respect to their parameters and and the following holds:

• For any , if we have (for , and ) such that and

 q2∑i=1αi∂h22∂θ(i)2(X,θ2)+∑1≤u,v≤q1βuv∂h1∂θ(u)1(X,θ1)∂h1∂θ(v)1(X,θ1)=0,

almost surely in , then we must also have for all and .

Note that in this definition we use the convention that if almost surely for some , then we have . The same convention goes for other derivatives in Condition (O.1). An equivalent way to express the algebraic independence notion in Definition 3 is that the elements in a set of partial derivatives,

 ⎧⎨⎩∂h1∂θ(u)1(X,θ1)∂h1∂θ(v)1(X,θ1),∂h22∂θ(i)2(X,θ2): 1≤i≤q2,1≤u,v≤q1⎫⎬⎭,

are linearly independent with respect to . To exemplify Definition 3, consider the following simple examples of expert functions and that are algebraically independent.

###### Example 3.1.

(a) Let . If we choose expert functions and for all and , then and are algebraically independent.

(b) Let . If we choose expert functions for all , where and for all , then are algebraically independent.

Under the algebraic independence condition for the expert functions and , we have the following result regarding the convergence rates of parameter estimation as well as their corresponding minimax lower bound under the over-specified GMCF model.

###### Theorem 1.

Assume that expert functions and are algebraically independent. Then, we have:

• (Maximum likelihood estimation) There exists a positive constant depending on and such that

 P(˜Wκ(ˆGn,G0)>C0(logn/n)1/4)≾exp(−clogn),

where and is a positive constant depending only on .

• (Minimax lower bound) For any such that ,

 inf¯¯¯¯GnsupG∈Ok(Ω)∖Ok0−1(Ω)EpG(˜Wκ′(¯¯¯¯Gn,G))≥c′n−1/(2∥κ′∥∞).

Here, the infimum is taken over all sequences of estimates . Furthermore, denotes the expectation taken with respect to the product measure with mixture density , and stands for a universal constant depending only on .

The proof of Theorem 1 is deferred to Section 5.1.

Remark: First, part (a) of Theorem 1 establishes a convergence rate of (up to a logarithmic factor) of to under the generalized transportation distance while part (b) of the theorem indicates that this convergence rate is minimax. The convergence rate of suggests that the rate of estimating individual components and is for and . The main reason for these slow convergence rates is the singularity of Fisher information matrix for these components. Such a singularity phenomenon is caused by the effect of fitting the true model by larger model, a phenomenon which has been observed previously in traditional mixture models settings under strong identifiability [1, 20].

Second, we would like to emphasize that Theorem 1 is not only of theoretical interest. Indeed, it provides insight into the choice of expert functions that are likely to have favorable convergence in practice. When the expert functions are not algebraically independent, we demonstrate in the next section that the convergence rates of parameter estimation in over-specified GMCF are very slow and depend on a notion of complexity level of over-specification.

## 4 Algebraically dependent expert functions

In the previous section we established a convergence rate for the MLE as well as a minimax lower bound when the expert functions and are algebraically independent. In many scenarios, however, the expert functions are taken to be algebraically dependent. Here we show that in these settings the convergence rates of the MLE can be much slower than .

To simplify our proofs in the algebraically-dependent case, we focus on the case in which the MLE is restrained to a parameter space that has the following structure:

 G=Ok,¯c0(Ω)={G=l∑i=1πiδ(θ1i,θ2i):1≤l≤k and πi≥¯¯c0 ∀i}.

That is, we consider the set of discrete probability measures with at most components such that their weights are lower bounded by for some given sufficiently small positive number . Under this assumption, the true but unknown mixing measure is assumed to have for .

### 4.1 Linear expert functions and uniform convergence rates of the MLE

In this section, we consider a few representative examples involving expert functions and that are algebraically dependent. We establish the corresponding convergence rates of the MLE for these examples. Our analysis will be divided into two distinct choices for : when is covariate independent and when depends on the covariate.

#### 4.1.1 Covariate-independent expert function h2

We first consider an algebraic dependence setting where the expert function is independent of the covariate .

###### Example 4.1.

Let the expert functions be for all and for all . These expert functions and are algebraically dependent, as characterized via the following PDE relating and

 ⎛⎝∂h1∂θ(1)1(X,θ1)⎞⎠2=∂h22∂θ2(X,θ2), (5)

for all and .

Let be the minimum value of such that the following system of polynomial equations:

 k−k0+1∑j=1∑n1,n2c2jan1jbn2jn1!n2!=0 for each α=1,…,r, (6)

does not have any nontrivial solution for the unknown variables . The ranges of and in the second sum consist of all natural pairs satisfying the equation . A solution to the above system is considered nontrivial if all of variables are non-zeroes, while at least one of the is non-zero.

Our use of the parameter builds on earlier work by  who used it to establish convergence rates in the setting of over-specified Gaussian mixtures. The following theorem shows that plays a role in our setting in both the upper bound for the convergence of the MLE and in the minimax lower bound.

###### Theorem 2.

Given expert functions for and for , the following holds:

• (Maximum likelihood estimation) There exists a positive constant depending only on and such that

 P(˜Wκ(ˆGn,G0)>C0(logn/n)1/2¯r)≾exp(−clogn),

where and is defined in (6). Here, is a positive constant depending only on .

• (Minimax lower bound) For any such that ,

 inf¯¯¯¯GnsupG∈Ok(Ω)∖Ok0−1(Ω)EpG(˜Wκ′(¯¯¯¯Gn,G))≿n−1/(2∥κ′∥∞).

The proof of Theorem 2 is in Section 5.2.

Remark: First, the convergence rates of MLE in part (a) of Theorem 2 demonstrate that the best possible convergence rates of estimating , , and are not uniform. In particular, the rates for estimating and are and , respectively, while the rate for estimating is (up to a logarithmic factor) for all . Therefore, estimation of the second component of is generally much faster than estimation of the first component of and . As is seen in the proof, the slow convergence of and arises from the way in which the structure of the PDE (5) captures the statistically relevant dependence of the expert functions and . In particular, the PDE shows that and are linearly dependent, but, since the second component of is associated with the covariate , it does not have any interaction with , which explains why it enjoys a much faster convergence rate than the other parameters.

Second, if we choose expert functions for any and where , then with a similar argument we obtain that the best possible convergence rates for estimating for are for all while those for and are and , respectively (up to a logarithmic factor).

#### 4.1.2 Covariate-dependent expert function h2

We now turn to the setting of algebraic dependence between the parameters associated with covariate in and the parameters of .

###### Example 4.2.

Define expert functions for all and , for all such that and for some positive constant . We have the following PDE for these expert functions:

 ⎛⎝∂h1∂θ(1)1(X,θ1)⎞⎠2=∂h22∂θ(1)2(X,θ2), (7) ⎛⎝∂h1∂θ(2)1(X,θ1)⎞⎠2=∂h22∂θ(2)2(X,θ2), (8)

which shows that and are algebraically dependent.

The main distinction between Example 4.2 and Example 4.1 is that we have the covariate in the formulation of the expert function in Example 4.2. This inclusion leads to a rather rich spectrum of convergence rates for the MLE. To illustrate these convergence rates, we consider two distinct cases for the expert function :

• without offset: , i.e., .

• with offset: is taken into account; i.e., .

###### Theorem 3.

(Without offset) Let be defined as in (6). Given expert functions for and for , we have:

• (Maximum likelihood estimation) There exists a positive constant depending only on and such that

 P(˜Wκ(ˆGn,G0)>C0(logn/n)1/2¯r)≾exp(−clogn),

where . Here, is a positive constant depending only on .

• (Minimax lower bound) For any such that ,

 inf¯¯¯¯GnsupG∈Ok(Ω)∖Ok0−1(Ω)EpG(˜Wκ′(¯¯¯¯Gn,G))≿n−1/(2∥κ′∥∞).

The proof of Theorem 3 is deferred to Appendix A.1.

In contrast to the setting of Theorem 2, the expert function is now a function of . The convergence rate of in Theorem 3 demonstrates that the convergence rates for estimating , , and are , , and , respectively, for all . Therefore, with the formulation of expert functions given in Theorem 3, estimation of the first component of is much faster than estimation of the second component of . This is in contrast to the results in Theorem 2. A high-level explanation for this phenomenon is again obtained by considering the PDE structure, which in this case is given by (8):

 ⎛⎝∂h1∂θ(2)1(X,θ1)⎞⎠2=∂h22∂θ2(X,θ2).

Such a structure implies the dependence of the second component of and ; therefore, there exists a strong interaction between and in terms of their convergence rates. On the other hand, the first component of and are linearly independent, which implies that there is virtually no interaction between these two terms. As a consequence, will enjoy much faster convergence rates than and .

In contrast to the setting without an offset term in the expert function , the convergence rate of the MLE under the setting with the offset term in suffers from two ways: one which is captured by the PDE structure with respect to and in (7) and another from the PDE structure with respect to and in (8).

###### Theorem 4.

(With offset) Let be defined as in (6). Given expert functions for and for