## 1 Introduction

In this work, we address the statistics problem of maximum likelihood estimation (MLE). A statistical model is a subset of

whose points have nonnegative coordinates summing to one; these points represent probability distributions. We will be interested in statistical models that are defined by zero sets of polynomials restricted to the positive orthant. The study of such models is a central part of algebraic statistics.

Given a data set and statistical model, we ask, “Which point in the model best describes the data?” One way to answer this question is by maximum likelihood estimation. By determining the point in the model that maximizes the likelihood function, one finds the probability distribution that best describes the data.

For the models we consider, the data is discrete; more specifically, the data is a list of integers representing counts. The likelihood function we consider on the model is given by the monomial

where is our data.

The maximum likelihood estimate is the point in the statistical model that maximizes the likelihood function. To determine this estimate, one can solve the likelihood equations to determine critical points of the likelihood function. This set of critical points contains all local maxima. So when the maximum likelihood estimate is a local maximum, it will be one of these critical points. When the maximum likelihood estimate is not a local maximum, then it is in the boundary (in the Euclidean topology) or the singular locus of the model. In those cases, one restricts to those loci’s Zariski closure and performs an analysis in this restriction. Iterating this procedure, one solves the global optimization problem of MLE.

The number of complex solutions to the likelihood equations (for general data) is a topological invariant called the maximum likelihood degree. For discrete models, the maximum likelihood degree was shown to be an Euler-characteristic in the smooth case [22] and recently shown to be a weighted sum of Euler characteristics in the singular case [1]. In this paper, we will only consider discrete data. For a recent survey for the ML degree, one can refer to [23].

The ML degree measures the algebraic complexity of maximum likelihood estimation by counting the number of complex solutions (real and nonreal solutions) to the likelihood equations. In statistics only real solutions with positive coordinates are relevant, and to analyze the real root classification problem, one studies a discriminant of the system. Discriminants are a large topic studied in algebraic geometry, and they are important because they are used in combination with Ehresmann’s Theorem (Theorem 2) for real root classification. In principle, discriminants can be computed with Gröbner bases to determine elimination ideals [7, Chapter 3] or with geometric resolutions [13]

. In this paper we give a probabilistic interpolation algorithm to determine discriminants of the likelihood equations. These interpolation methods can be applied to other parameterized system of equations, for example determining the Euclidean distance degree, which was defined in

[8].Our main contribution is a probabilistic algorithm (Algorithm 2) that is more efficient for larger-sized models than the standard algorithm (Algorithm 1). Tables 1–2 show our contribution is nontrivial with great performance increases in particular for Model 4. Based on the discriminants we have computed by the probabilistic algorithm, we give the real root classification for by symmetric matrix model. The main differences between this paper and our ISSAC’15 paper [27] are listed below.

• We have reimplemented [27, Algorithms 1–2] based on Maple and FGb. Comparing [27, Tables 1–2] and Tables 1–2 in Section 4, we see the new implementation performs much better. A typical example is Model 4. [27] states the old implementation of Algorithm 2 (Strategy 2) takes 30 days to compute the discriminant while our new implementation takes less than 30 minutes.

• In Section 3, we add Section 3.2.3 in which we discuss the motivations and effectiveness of different strategies. We add Strategy 3 which improves the efficiency of sampling over Strategy 2 in some cases. We also add Lemma 3 in Section 3.2.1, which guarantees the correctness of Strategy 3.

• In Section 4, we present the new timings in Tables 1–2 based on our new implementation. We try more examples such as Dense Models 3–6 and Models 5–9. We add Table 3 which compares Strategy 2 and Strategy 3.

Now we motivate our study with an illustrating example.

###### Example 1 (Illustrative Example).

Suppose we have a weighted four-sided die such that the probability of observing side of the die satisfies the constraint . We toss the die 1000 times and record a -dimensional datavector , where is the number of times we observe the side . We want to determine the probability distribution that best explains the data subject to the constraint by maximum likelihood estimation:

Maximize the likelihood function subject to

For some statistical models, the MLE problem can be solved by well known local hill climbing

algorithms such as the expectation–maximization algorithm. However, the local method can fail if there is more than one local maximum. One way to remedy this, is by solving the system of

likelihood equations [19, 3]:where and are newly introduced indeterminates (Lagrange multipliers) for formulating the likelihood equations. More specifically, for given , if is a critical point of the likelihood function, then there exist numbers and such that is a solution of the polynomial system.

For general data , the likelihood equations have complex solutions. However, only solutions with positive coordinates are statistically meaningful. A solution with all positive coordinates is said to be a positive solution. Real root classification (RRC) and postive root classification (PRC) are important problems:

For which , the polynomial system has and real/positive solutions?

According to the theory of computational (real) algebraic geometry

[33, 25], the number of (real/positive) solutions only changes when the data goes across some “special” values (see Theorem 2). The set of “special” is a (projective) variety [25, Lemma 4] in ( dimensional complex projective space) -dimensional complex space.The number of real/positive solutions is uniform over each open connected component determined by the variety. In other words, the “special” plays the similar role as the discriminant for univariate polynomials. The first step of RRC is calculating the “special” , leading to the discriminant problem:

How to effectively compute the “special” ?

Geometrically, the “special” is a projection of a variety. So it can be computed by elimination. For instance, with the command fgb_gbasis_elim in FGb [10], we compute that the “special” in Example 2 form a hypersurface defined by a homogenous polynomial in . However, for many MLE problems, the elimination computation is too expensive as shown in Tables 1–2 in Section 4.

We remark that our work can be viewed as following the numerous efforts in applying computational algebraic geometry to tackle MLE and critical points problems [19, 3, 2, 20, 31, 16, 12, 17, 23, 18, 26].

The paper is organized as follows. Section 2 proves the discriminant variety of likelihood equation is projective and gives the definition of data-discriminant. Section 3 presents the elimination algorithm and probabilistic algorithm (with three strategies). Section 4 shows the experimental results comparing the two algorithms and different strategies. Section 5 discusses the future work and gives the real root classification of by symmetric matrix model.

## 2 Definitions and preliminaries

In this section, we discuss how to define “data-discriminant”. We assume the reader is familiar with elimination theory [7, Chapter 3].

###### Notation 1.

Let denote the projective closure of the complex numbers . For homogeneous polynomials in , denotes the projective variety in defined by . Let denote the -dimensional probability simplex .

###### Definition 1.

(Algebraic Statistical Model and Model Invariant) The set is said to be an algebraic statistical model if where define an irreducible generically reduced projective variety. Each is said to be a model invariant of . If the codimension of is , we say is a set of general model invariants for whenever the variety has as an irreducible component.

For a given algebraic statistical model, there are several different ways to formulate the likelihood equations [19]. In this section, we introduce the Lagrange likelihood equations and define the data-discriminant for this formulation. One can similarly define data-discriminants for other formulations of likelihood equations.

###### Notation 2.

For any in the polynomial ring , denotes the affine variety in defined by and denotes the ideal generated by . For an ideal in , denotes the affine variety defined by .

###### Definition 2.

(Lagrange Likelihood Equations and Correspondence) Given an algebraic statistical model of codimension and a set of general model invariants , the system of polynomials below is said to be the Lagrange likelihood equations of when set to zero:

where are the model invariants of and , , are indeterminates (also denoted by , , ). More specifically, are unknowns and are parameters.

, namely the set

is said to be the Lagrange likelihood correspondence of and denoted by .

###### Notation 3.

Let denote the canonical projection from the ambient space of the Lagrange likelihood correspondence to the associated to the indeterminants : . For any set in , denotes the ideal

denotes the affine closure of in , namely .

Given an algebraic statistical model and a data vector , the maximum likelihood estimation problem is to maximize the likelihood function subject to . The MLE problem can be solved by computing . More specifically, if is a regular point of , then is a critical point of the likelihood function if and only if there exists such that . Theorem 1 states that for a general data vector , is a finite set and the cardinality of is constant over a dense Zariski open set, which inspires the definition of ML degree. For details, see [19].

###### Theorem 1.

[19] For an algebraic statistical model , there exists an affine variety and a non-negative integer such that for any ,

###### Definition 3.

###### Definition 4.

For an algebraic statistical model with a set of general model invariants , suppose are defined as in Definition 2. Then, we have the following:

• denotes the set of non-properness of , i.e., the set of the such that there does not exist a compact neighborhood of where is compact;

• denotes ; and

• denotes where denotes the determinant below

The geometric meaning of

and are as follows. The first,, is the projection of the intersection of the Lagrange likelihood correspondence with the coordinate hyperplanes. The second,

, is the projection of the intersection of the Lagrange likelihood correspondence with the hypersurface defined by . Geometrically, is the closure of the union of the projection of the singular locus of and the set of critical values of the restriction of to the regular locus of [25, Definition 2].The Lagrange likelihood equations define an affine variety. As we continuously deform the parameters , coordinates of a solution can tend to infinity. Geometrically, is the set of the data such that the Lagrange likelihood equations have some solution at infinity; this is the closure of the set of “non-properness” as defined in the page 1, [24] and page 3, [29]. It is known that the set of non-properness of is closed and can be computed by Gröbner bases (see Lemma 2 and Theorem 2 in [25]).

The ML degree captures the geometry of the likelihood equations over the complex numbers. However, statistically meaningful solutions occur over real numbers. Below, Theorem 2 states that and define open connected components such that the number of real/positive solutions is uniform over each open connected component. Theorem 2 is a corollary of Ehresmann’s theorem for which there exists semi-algebraic statements since 1992 [6].

###### Theorem 2.

For an algebraic statistical model . If is an open connected component of

then over , the following is constant:

Moreover, if is an open connected component of

then over , the following is constant:

Before we give the definition of data-discriminant, we study the structures of , and . Proposition 1 shows that the structure of the likelihood equations forces to be contained in the union of coordinate hyperplanes defined by . Proposition 2 shows that the structure of the likelihood equations forces to be a projective variety. Similar to the proof of Proposition 2, we can also show that the structure of the likelihood equations forces to be a projective variety.

###### Proposition 1.

For any algebraic statistical model ,

###### Remark 1.

Generally, . For example, suppose the algebraic statistical model is . Then .

###### Notation 4.

denotes the product .

###### Proposition 2.

For an algebraic statistical model , we have is a projective variety in , where is the zero vector in .

Proof. By the formulation of the Lagrange likelihood equations, we can prove that is a homogeneous ideal by the two basic facts below, which can be proved by Definition 2 and basic algebraic geometry arguments.

F1. For every in , each scalar multiple is also in .

F2. For any , if for any and for any scalar , , then is a homogeneous ideal in .

That means the ideal is generated by finitely many homogeneous polynomials , , . Therefore, . So .

###### Notation 5.

For an algebraic statistical model , we define the notation according to the codimension of in .

• If the codimension is , then assume are the codimension irreducible components in the minimal irreducible decomposition of in and , , are radical. denotes the homogeneous polynomial .

• If the codimension is greater than , then our convention is to take .

Similarly, we use the notation to denote the projective variety . Now we define the “data-discriminant” of Lagrange likelihood equations.

###### Definition 5.

(Data-Discriminant) For a given algebraic statistics model , the homogeneous polynomial is said to be the data-discriminant (DD) of Lagrange likelihood equations of and denoted by .

###### Remark 2.

Note that DD can be viewed as a generalization of the “discriminant” for univariate polynomials. So it is interesting to compare DD with border polynomial (BP) [33] and discriminant variety (DV) [25]. DV and BP are defined for general parametric polynomial systems. DD is defined for the likelihood equations but can be generalized to generic zero-dimensional systems. Generally, for any square and generic zero-dimensional system, DD DV BP. Note that due to the special structure of likelihood equations, DD is a homogenous polynomial despite being an affine system of equations. However, generally, DV is not a projective variety and BP is not homogenous.

###### Example 2 (Linear Model).

The algebraic statistic model for the four sided die story in Section 1 is given by

The Langrange likelihood equations are the shown in Example 1. The Langrange likelihood correspondence is . If we choose generic , , namely the ML degree is . The data-discriminant is the product of , and , where

, , and

.

By applying the well known partial cylindrical algebraic decomposition (PCAD) [5] method to the data-discriminant above, we get that for any ,

• if , then the system of likelihood equations has distinct real solutions and of them is positive;

• if , then the system of likelihood equations has exactly real solution and it is positive.

The answer above can be verified by the RealRootClassification [33, 4] command in Maple 2015. In this example, the does not effect the number of real/positive solutions since it is always positive when each is positive. However, generally, plays an important role in real root classification. Also remark that the real root classification is equivalent to the positive root classification for this example but it is not true generally (see the example discussed in Section 5).

## 3 Algorithm

In this section, we discuss how to compute the discriminant . We assume that is the closure of a given statistical model, are defined as in Definition 2, and is defined as in Definition 4. We rename as . We also rename as . Subsection 3.1 presents the standard algorithm for reference and Subsection 3.2 presents the probabilistic algorithm.

### 3.1 Standard Algorithm

Considering the data-discriminant as a projection drives a natural algorithm to compute it. This is the standard elimination algorithm in symbolic computation:

• we compute the by elimination and then get by the radical equidimensional decomposition [25, Definition 3]. The algorithm is formally described in the Algorithm 1;

• we compute by the Algorithm PROPERNESSDEFECTS presented in [25] and then get by the radical equidimensional decomposition. We omit the formal description of the algorithm.

Practically, Algorithm 1 may not terminate in a reasonable time before a computer reaches its memory limit. For instance, when using FGb for our Gröbner Bases computations the memory limit is reached due to the large size of the intermediate computational results. Since the algorithms for Gröbner Bases in FGb are based on row echelon formcomputations [11, Section 2], we record the sizes of matrices generated by FGb for Models 4–9, see Table 4 in Section 4. This motivates the probabilistic algorithm found in the next subsection.

### 3.2 Probabilistic Algorithm

In Section 3.2.1, we prepare the lemmas which are used in the algorithms in Sections 3.2.2–3.2.3. We present the probabilistic algorithm with Strategy 1 (Algorithm 2) in Section 3.2.2. We discuss two different strategies (Strategy 2 and Strategy 3) for Algorithm 2 in Section 3.2.3.

#### 3.2.1 Lemmas

We prepare the lemmas which are used in the Sections 3.2.2 and 3.2.3. Lemma 1

is used to linearly transform parameter space. Corollary

1 and Lemma 2 are used to compute the totally degree of . Corollary 2 is used in the sampling for interpolation. Lemma 3 is used to compute the degree of and to do sampling in Strategy 3. In the statements of Lemmas 1–2 and Corollaries 1–2, we say an affine variety in is non-trivial if .###### Lemma 1.

For any , there exists a non-trivial affine variety in such that for any , the total degree of equals the degree of w.r.t. to , where

Proof. Suppose the total degree of is and is the homogeneous component of with total degree . For any , let . It is easily seen that the degree of w.r.t. equals .

###### Corollary 1.

For any , there exists a non-trivial affine variety in such that for any

the total degree of equals the degree of where

###### Lemma 2.

There exists a non-trivial affine variety in such that for any , if

where is the polynomial by replacing with in and

then .

Proof. By the definition of (Notation 5), there exists an affine variety such that for any , is radical. Thus, we only need to show that there exists an affine variety in such that for any , .

Suppose is the canonical projection: . For any

let (for ), then . Hence and so . Thus

Therefore,

For any , let for , then . By the Extension Theorem [7], there exists an affine variety such that if , then , thus

###### Corollary 2.

There exists a non-trivial affine variety in such that for any , if

where is the polynomial by replacing with in () and

then .

###### Lemma 3.

Suppose is an algebraic model with ML degree . Let generate the reduced codimension component of . If and , then .

###### Proof.

By Notation 5, is radical. So we only need to show . Alternatively, we show

Comments

There are no comments yet.