Many statistical analyses involve finding the maximum of an objective function. In maximum likelihood estimation, the maximum likelihood estimator (MLE) is the maximum of the likelihood function. In variational inference (Blei et al., 2017)
, the variational estimator is constructed by maximizing the evidence lower bound. In regression analysis, we estimate the parameter by minimizing the loss function, which is equivalent to maximizing the negative loss function. In nonparametric mode hunting(Parzen, 1962; Romano, 1988b, a), the parameter of interest is the location of the density global mode; therefore, we are finding the point that maximizes the density function.
Each of the above analyses works well when the objective function is concave. However, when the objective function is non-concave and processes many local maxima, finding the (global) maximum can be challenging or even computationally intractable. The problems are severe when we want to construct a confidence interval (CI) of the maximum because the estimator we compute may not be close to the (population) maximum.
In this paper, we focus on the analysis of the MLE of a multi-modal likelihood function. Our analysis can also be applied to other types of M-estimators (Van der Vaart, 1998). Maximizing a multi-modal likelihood function is a common scenario encountered while fitting a mixture model (Titterington et al., 1985; Redner and Walker, 1984). Figure 1
plots the log-likelihood function of fitting a 2-Gaussian mixture model to data generated from a 3-Gaussian mixture model; the orange color indicates the regions of parameter space with a high likelihood value. There are two local maxima, denoted by the blue and green crosses. The blue maximum is the global maximum. To find the maximum of a multi-modal likelihood function, we often apply a gradient ascent method such as the EM algorithm(Titterington et al., 1985; Redner and Walker, 1984) with an appropriate initial guess of the MLE. The right panel of Figure 1 shows the result of applying a gradient ascent algorithm to a few initial points. The black dots are the initial guess of the MLE, and the corresponding black curve indicates the gradient ascent path starting from this initial point to a nearby local maximum. Although it is ensured that a gradient ascent method does not decrease the likelihood value, it may converge to a local maximum or a critical point rather than the global maximum. For instance, in the right panel of Figure 1, three initial point converges to the green cross, which is not the global maximum. To resolve this issue, we often randomly initialize the starting point (initial guess) many times and then choose the convergent point with the highest likelihood value as the final estimator (McLachlan and Peel, 2004; Jin et al., 2016). However, as we have not explored the entire parameter space, it is hard to determine whether the final estimator is indeed the MLE. Although the theory of MLEs suggests that the MLE is a -consistent estimator of the population maximum (population MLE) under appropriate conditions (Titterington et al., 1985), our estimator may not be a -consistent estimator because it is generally not the MLE111 In fact, for a mixture model, the convergence rate could be slower than if the number of mixture is not fixed; see, e.g., Li and Barron (1999) and Genovese and Wasserman (2000).. The CI constructed from the estimator inherits the same problem; if our estimator is not the MLE, it is unclear what population quantity the resulting CI is covering.
and variance. We fit a 2-Guassian mixture with the center of the first Gaussian being set to be and the variance of both Gaussians is . The parameters of interest are the center of the second Gaussian and the proportion of the second Gaussian . Namely, the log-likelihood function is where has a PDF . Left: Contour plot of the log-likelihood function . Regions with orange color are where the log-likelihood function have a high value. The two local maxima are denoted by the blue and green crosses. Right: The trajectories of the gradient ascent method with multiple initial points. Each solid black dot is an initial point and the curve attached to it indicates the trajectory of the gradient ascent method starting from that initial point.
The goal of this paper is to analyze the statistical properties of this estimator. Note that we do not provide a solution to resolve the problem causing by multiple local maxima; instead, we attempt to analyze how the local maxima affect the performance of the estimator. Although this estimator is not the MLE, it is commonly used in practice. To understand the population quantity that this estimator is estimating, we study the behavior of estimators that result from applying a gradient ascent algorithm to a likelihood function that has multiple local maxima. We investigate the underlying population quantity being estimated and analyze the properties of resulting CIs. Specifically, our main contributions are summarized as follows.
Related Work. The analysis of MLE under a multi-modal likelihood function has been analyzed for decades; see, for example, Redner (1981); Redner and Walker (1984); Sundberg (1974); Titterington et al. (1985). In the multi-modal case, finding the MLE is often accomplished by applying a gradient ascent method such as the EM-algorithm (Dempster et al., 1977; Wu, 1983; Titterington et al., 1985) with random initializations. The analysis of initializations and convergence of a gradient ascent method can be found in Lee et al. (2016); Panageas and Piliouras (2016); Jin et al. (2016); Balakrishnan et al. (2017). In our analysis, we use the Morse theory (Milnor, 1963; Morse, 1930; Banyaga and Hurtubise, 2013) to analyze the behavior of a gradient ascent algorithm. The analysis using the Morse theory is related to the work of Chazal et al. (2014); Mei et al. (2016); Chen et al. (2017).
Outline. We begin with providing the necessary background in Section 2. Then, we discuss how to perform statistical inference with local optima in Section 3. In Section 4, we analyze different sources of uncertainties that can arise while making our inference. We extend our analysis to a nonparametric bump hunting problem in Section 6. Finally, we discuss issues and opportunities for further work related research in Section 7.
In the first few sections, we will focus on an estimator that attempts to maximize the likelihood function. Let be a random sample. For simplicity, we assume that each is continuous. In parametric estimation, we impose a model on the underlying population distribution function
. This gives a parametrized probability density function. The MLE estimates the parameter using
which can be viewed as an estimator of the population MLE:
When the likelihood function has multiple local modes (maxima), the MLE does not in general have a closed form; therefore, we need a numerical method to find it. A common approach is to apply a gradient ascent algorithm to the likelihood function with a randomly chosen initial point. To simplify our analysis, we assume that we are able to perform a continuous gradient ascent to the likelihood function (this is like conducting a gradient ascent with an infinitely small step size). When the likelihood function has multiple local maxima, the algorithm may converge to a local maximum rather than to the global maximum. As a result, we need to repeat the above procedure several times with different initial values and choose the convergent point with the highest likelihood value.
To study the behavior of a gradient ascent method, we define the following quantities. Given an initial point , let be a gradient flow such that
Namely, the flow starts at and moves according to the gradient ascent of . The stationary point is the destination of the gradient flow starting at . Different starting points lead to flows that may end at different points.
Because our initial points are chosen randomly, we view these initial points as IID from a distribution (see, e.g., Chapter 2.12.2 of McLachlan and Peel 2004) that may depend on the original data . The number denotes the number of the initializations. Later we will also assume that converges to a fixed distribution when the sample size increases to infinity. Note that different initialization methods lead to a different distribution .
By applying the gradient ascent to each of the initial parameters, we obtain a collection of stationary points
The estimator is the one that maximizes the likelihood function so it can be written as
In practice, we often treat as and use it to make inferences about the underlying population. However, unless the likelihood function is concave, there is no guarantee that . Thus, our inferences and conclusions, which were based on treating as the MLE, could be problematic.
2.1 Population-level Analysis
To better understand the inferences we make when treating as , we start with a population level analysis over . The population version of the gradient flow starting at is a gradient flow such that
The destination of this gradient flow, , is one of the critical points of .
For a critical point of , we define the basin of attraction of as the collection of initial points where the gradient flow leads to this point:
Namely, is the region where the (population) gradient ascent flow converges to critical point .
Throughout this paper, we assume that is a Morse function. That is, critical points of are non-degenerate (well-separated). By the stable manifold theorem (e.g., Theorem 4.15 of Banyaga and Hurtubise 2013), is a -dimensional manifold, where
is the number of negative eigenvalues of, the Hessian matrix of evaluated at . Thus, the Lebesgue measure of is non-zero only when is a local maximum. Because of this fact, we restrict our attention to local maxima and ignore other critical points; a randomly chosen initial point has probability zero of falling within the basin of attraction of a critical point that is not a local maximum when is continuous. Note that a similar argument also appears in Lee et al. (2016) and Panageas and Piliouras (2016). Let be the collection of local maxima with
where is the number of local maxima. The population MLE is .
Figure 3 provides an illustration of the critical points and the basin of attraction. The left panel displays the contour plot of a log-likelihood function. The three solid black dots are the local maxima (, and ), the three crosses are the critical points, and the empty box indicates a local minimum. In the middle panel, we display gradient flows from some starting points. The right panel shows the corresponding basins of attraction (, , and ). Each color patch is a basin of attraction of a local maximum.
For the -th local maximum, we define the probability
where is the population version of (i.e., converges to in the sense of assumption (A4) that will be introduced later). is the probability that the initialization method chooses an initial point within the basin of attraction of . Namely, is the chance that the gradient ascent converges to from a random initial point. Note that we add a superscript to to emphasize the fact that this quantity depends on how we choose the initialization approach. Varying the initialization method leads to different probabilities .
We define a ‘cumulative’ probability of the top local maxima as
The quantity plays a key role in our analysis because it is the probability of seeing one of the top local maxima after applying the gradient ascent method once. Note that is the probability of selecting an initial parameter value within the basin of attraction of the MLE, which is also the probability of obtaining the MLE with only one initialization. Later we will give a bound on the number of initializations we need to obtain the MLE with a high probability (Proposition 1).
Because the estimator is constructed by initializations, we introduce a population version of it. Let be the initial points and let be the corresponding destinations. The quantity
is the population analog of .
Because is constructed by initializations, it may not contain the population MLE . However, it is still the best among all these candidates so it should be one of the top local maxima in terms of the likelihood value. Let be the top local maxima, where . By simple algebra, we have
Given any fixed number , such a probability converges to as when covers the basin of attraction of every local maximum. Therefore, we can pick and choose sufficiently large to ensure that we obtain the MLE with an overwhelming probability. However, when is finite, the chance of obtaining the population MLE could be slim.
To acknowledge the effect from on the estimator, we introduce a new quantity called the precision level, denoted as . Given a precision level , we define an integer
that can be interpreted as: with a probability of at least , is among the top local maxima. We further define
Namely, with a probability of at least , recovers one element of . We often want to set to be small because later we will show that common CIs have an asymptotic coverage containing an element of (Section 3.1). If we want to control the type-1 error to be, say , we may want to choose and .
Let denotes the directional derivative with respect to , where
is the collection of all unit vectors indimensions. When either or increase, the set may shrink. Under smoothness conditions of , a sufficiently large ensures as described in the following proposition.
Assume that is a compact parameter space. Assume all eigenvalues of are less than for some positive and
Moreover, assume that is unique within . Then for every ,
Proposition 1 describes a desirable scenario: when is sufficiently large, with a probability of at least the set contains only the MLE.
If the uniqueness assumption is violated, i.e., there are multiple parameters attaining the maximum value of the likelihood function, the set converges to the collection of all these maxima in probability when . One common scenario in which we encounter this situation is in mixture models where permuting some parameters results in the same model.
2.2 Sample-level Analysis
In this section, we will show that converges to an element of with a probability at least . We first introduce some generic assumptions. We define the projection distance for any point and any set . For a set and a scalar , define .
Let be the union of boundaries of basins of attraction of local maxima. Note that the Morse theory implies that
where is a -dimensional manifold (see Appendix A for more details). Namely, the boundary can be written as the union of basins of attraction of saddle points and local minima. Thus, every point admits a subspace space that is normal to at .
The maximum is unique and there exists constants such that:
At every critical point (i.e., ), the eigenvalues of the Hessian matrix of are outside , and
for any , there exists such that
where is the collection of all critical points.
There exists and an integrable function such that for every ,
Moreover, there exists such that
where is the normal space of at point .
where is the max-norm for a vector or a matrix. We need or .
There exists such that
(6) (7) (8)
with or .
Assumption (A1) is called the strongly Morse assumption in Mei et al. (2016), and is often used in the literature to ensure critical points are well separated (Chazal et al., 2014; Genovese et al., 2016; Chen et al., 2016, 2017). Under assumption (A1), the likelihood function is a Morse function (Milnor, 1963; Morse, 1925).
Assumption (A2) consists of two parts: a third-order derivative assumption and a boundary curvature assumption. The third-order derivative assumption is a classical assumption to establish asymptotic normality (Redner and Walker, 1984; Van der Vaart, 1998). The boundary curvature assumption (Chen et al., 2017) assumes that when we are moving away from the boundary of a basin of attraction, the log-likeihood function has to behaves like a quadratic function. When , the boundary becomes the collection of local minima so this assumption reduces to requiring the second derivative is non-zero at local minima, which is a common assumption to ensure non-degenerate critical points.
Assumption (A3) requires that both the gradient and the Hessian can be uniformly estimated. It is also a common assumption in the literature, see, e.g., Genovese et al. (2016); Chazal et al. (2014); Mei et al. (2016); Chen et al. (2017).
The assumptions in (A4) are more involved. At first glance, it seems that the bound in equation (7) will be difficult one to verify. However, there is a simple rule to verify it using the fact that the collection has VC dimension so as long as the convergence of toward can be written as an empirical process, this condition holds due to the VC theory (see Theorem 2.43 of Wasserman 2006). The assumption on the bound in equation (8) is required to avoid any probability mass around the boundary.
We start with a useful lemma which states that with a high probability (a probability tending to as the sample size increases), the local maxima of and the local maxima of have a one-to-one correspondence. Denote the collection of local maxima of as
where is the number of local maxima of . Note that by definition, .
Assume (A1) and (A3). Then there exists a constant such that when , and for every ,
This result appears many places in the literature so we will omit the proof in this exposition. Interested readers are encourage to consult Chazal et al. (2014); Mei et al. (2016); Chen et al. (2017). Note that much of the time we in fact have a stronger result than what Lemma 2 suggests – not only is there a one-to-one correspondence between the local maxima, but other critical points also have this property.
The following theorem provides a bound on the distance from to .
Assume (A1–4). Let
When are non-random, with a probability of at least ,
Moreover, when are random and there exists such that
then for any sequence , with a probability of at least ,
In the first claim ( are non-random), the probability that the lemma does not hold comes from the randomness of initializations. In the second claim ( are random), the probability statement accounts for both the randomness of initializations and .
In many applications, the probability is very small because that statement is true when both are less than a fixed threshold (see Lemma 16 of Chazal et al. 2014). Further, the chance that these two quantities are less than a fixed number has a probability of for some and so often can be ignored.
Now we introduce general assumptions on the likelihood function to obtain concrete rates in the previous theorem. Let , , and , where the operator is taking the gradient with respect to .
The gradient and Hessian of the log-likelihood function are Lipschitz in the maximum norm in the following sense: there exist and with such that for every ,
The quantities in Assumption (A4) satisfy
for some constants .
Assumption (A3L) is a sufficient condition to ensure we have a uniform concentration inequality of both the gradient and the Hessian. The Lipschitz condition is to guarantee that the parametric family has an -bracketing number scaling at rate , where is the dimension of parameters (Van der Vaart, 1998). The bounds on the bracketing number can further be inverted into a concentration inequality using Talagrand’s inequality (see Theorem 1.3 of Talagrand 1994) Furthermore, (A3L) is a mild assumption because many common models, such as a Gaussian mixture model, satisfy this assumption when is defined properly.
The concentration inequality assumption of (A4L) is also mild. The assumption is true whenever the data-driven approach
is based on fitting a smooth model such as a normal distribution on the parameter space with mean and variance being estimated by the data.
Assume (A1), (A2) (A3L), and (A4L). Then when , with a probability of at least ,
3 Statistical Inference
In this section we study the procedure of making inferences when the likelihood function has multiple maxima. In the first three subsections, we will analyze properties of CIs. Then, in the last subsection, we will describe how to implement a two-sample test.
To simplify the problem of constructing CIs, we focus on constructing CIs of , where is a known function. We estimate using . Moreover, we define
The set will be the population quantity that the CIs are covering. To derive the coverage of CIs, we consider the following assumptions.
The score function satisfies
There exist constants such that for every ,
. We need the existence of the fourth-moment because we need aconvergence rate of the sample variance toward the population variance. It could be replaced by a third-moment assumption but in that case, we would not be able to obtain the convergence rate of the coverage of a CI. Assumption (T) ensures that the mapping is smooth around critical points so that we can apply the delta method (Van der Vaart, 1998) to construct a CI.
3.1 Normal Confidence Interval
A naive approach to constructing a CI is to estimate the variance of and invert it into a CI. Such CIs are based on the asymptotic normality of the MLE (Redner and Walker, 1984):
for some . In practice, we only have access to , not , so we replace by and construct a CI using the normality. This is perhaps the most common approach to the construction of a CI and the representation of the error of estimation (see Chapter 2.15 and 2.16 in McLachlan and Peel 2004 for examples of mixture models). However, we will show that when the likelihood function has multiple local maxima, this CI has undercoverage for and has coverage for an element in .
To fully describe the construction of this normal CI, we begin with an analysis of the asymptotic covariance of the MLE. Let be the score function and be the Hessian matrix of the log-likelihood function. Moreover, let and . The MLE has an asymptotic covariance matrix
Note that under regularity conditions,
is the Fisher’s information matrix, which further implies However, when the model is mis-specified, , and in this case, we cannot use the information matrix to construct a normal CI.
Thus, given an estimator , we can estimate the covariance matrix using
And the CI is
where is the quantile of a standard normal distribution. Note that under suitable assumptions, one can also use Fisher’s information matrix or the empirical information matrix to replace .
However, because is never guaranteed to be the MLE, may not cover the population MLE with the right coverage. In what follows, we show that has an asymptotic coverage of covering an element of and coverage for covering the MLE, where is defined in equation (2).
Assume (A1), (A2), (A3L), (A4L), (A5), and (T). Then
Thus, by choosing , we have
The population quantity covered by the normal CI is given by fact that has an asymptotic coverage for an element of . The quantities and play similar role in terms of coverage but they have different meanings. The quantity is the conventional confidence level, which aims to control the fluctuation of the estimator. On the other hand, is the precision level that corrects for the multiple local optima.
When applying a bootstrap approach to an estimator that requires multiple initializations (such as our estimator or the estimator from an EM-algorithm), there is always a question: How should we choose the initial point for each bootstrap sample? Should we rerun the initialization several times to pick the highest value for each bootstrap sample?
Based on the arguments to follow, we recommend using the estimator of the original sample, , as the initial point for every bootstrap sample. The purpose of using the bootstrap is to approximate the distribution of the estimator . In the M-estimator theory (Van der Vaart, 1998), we know that the variation of is caused by the randomness of the function around . Thus, to make sure the bootstrap approximates such a randomness, we need to ensure that the bootstrap estimator is around so the distribution of approximates the distribution of for some . By Lemma 2, we know that there is a local maximum of the bootstrap log-likelihood function that is close to . Therefore, we need to guarantee that the initial point to which we apply the gradient ascent method in the bootstrap sample is within the basin of attraction of . Because is close to , is often within the basin of attraction of ; as a result, is a good initial point for the bootstrap sample.
Moreover, using the same initial point in every bootstrap sample avoids the problem of label switching (Redner and Walker, 1984). Label switching occurs when the distribution function is the same after permuting some parameters. For instance, in a Gaussian mixture model with equal variance and proportion, permuting the location parameters leads to the same model. When we use the same initial point in every bootstrap sample, we alleviate this problem.
Now we describe is the formal bootstrap procedure. Let be a bootstrap sample. We first calculate the bootstrap log-likelihood function . Next, we start a gradient ascent flow from the initial point . The gradient ascent flow leads to a new local maximum, denoted as . By evaluating the function at this new local maximum, we obtain a bootstrap estimate of the parameter of interest, . We repeat the above procedure many times and construct a CI using the upper and lower quantile of the distribution of . Namely, let
The CI is
Figure 4 outlines the procedure of this bootstrap approach.
A benefit of this CI is that does not require any knowledge about the variance of . When the variance is complicated or does not have a closed form, being able to construct a CI without knowledge of the variance makes this approach particularly appealing.
Assume (A1), (A2), (A3L), (A4L), (A5), and (T). Let be defined as the above. Then
Therefore, by choosing , where is defined in Theorem 5, we conclude that
There are many other variants of bootstrap approaches and Figure 4 describes only a simple one. A common alternative to the method so far presented is bootstrapping the pivotal quantity (also known as the studentized pivotal approach in Wasserman 2006 and the percentile- approach in Hall 2013). In certain scenarios, bootstrapping a pivotal quantity leads to a CI with a higher order accuracy (namely., the coverage will be ). However, we may not have such a property because the bottleneck of the coverage error comes from the uncertainty of the basins of attraction. Such uncertainty may not be reduced when using the pivotal approach.
3.3 Confidence Intervals by Inverting a Test
In this section, we introduce three CIs of created by inverting hypothesis tests. We consider three famous tests: the likelihood ratio test, the score test, and the Wald test. Although the three tests are asymptotically equivalent in a regular setting (when the likelihood function is concave and smooth), they lead to very different CIs when the likelihood function has multiple local maxima. Figure 5 provides an example illustrating these three CIs of a multi-modal likelihood function.
Because it is easier to invert a test for a CI of , we focus on describing the procedure of constructing a CI of is this section. With a CI of , say , one can easily invert it into a CI of by
3.3.1 Likelihood ratio test
Under appropriate conditions, the likelihood ratio test implies
where is a distribution with degrees of freedom and is the dimension of the parameter. This motivates a CI of of the form
where is the quantile of .
In practice, we do not know the actual MLE and have only the estimator . Therefore, we replace