# Estimating linear covariance models with numerical nonlinear algebra

Numerical nonlinear algebra is applied to maximum likelihood estimation for Gaussian models defined by linear constraints on the covariance matrix. We examine the generic case as well as special models (e.g. Toeplitz, sparse, trees) that are of interest in statistics. We study the maximum likelihood degree and its dual analogue, and we introduce a new software package LinearCovarianceModels.jl for solving the score equations. All local maxima can thus be computed reliably. In addition we identify several scenarios for which the estimator is a rational function.

## Authors

• 34 publications
• 5 publications
• 13 publications
12/21/2020

### Computing Maximum Likelihood Estimates for Gaussian Graphical Models with Macaulay2

We introduce the package GraphicalModelsMLE for computing the maximum li...
11/17/2020

### Maximum Likelihood Estimation for Nets of Conics

We study the problem of maximum likelihood estimation for 3-dimensional ...
10/14/2009

### L_0 regularized estimation for nonlinear models that have sparse underlying linear structures

We study the estimation of β for the nonlinear model y = f(Xβ) + ϵ when ...
04/24/2018

### Parallel Approximation of the Maximum Likelihood Estimation for the Prediction of Large-Scale Geostatistics Simulations

Maximum likelihood estimation is an important statistical technique for ...
03/14/2019

### Discrete Statistical Models with Rational Maximum Likelihood Estimator

A discrete statistical model is a subset of a probability simplex. Its m...
09/17/2021

### Mapping of coherent structures in parameterized flows by learning optimal transportation with Gaussian models

We present a general (i.e., independent of the underlying model) interpo...
11/12/2012

### Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

The L1-regularized maximum likelihood estimation problem has recently be...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

In many statistical applications, the covariance matrix has a special structure. A natural setting is that one imposes linear constraints on or its inverse . We here study models for Gaussians whose covariance matrix  lies in a given linear space. Such linear Gaussian covariance models were introduced by Anderson [2]. He was motivated by the Toeplitz structure of in time series analysis. Recent applications of such models include repeated time series, longitudinal data, and a range of engineering problems [26]. Other occurrences are Brownian motion tree models [29], as well as pairwise independence models, where some entries of are set to zero.

We are interested in maximum likelihood estimation (MLE) for linear covariance models. This is a nonlinear algebraic optimization problem over a spectrahedral cone, namely the convex cone of positive definite matrices satisfying the constraints. The objective function is not convex and can have multiple local maxima. Yet, if the sample size is large relative to the dimension, then the problem is essentially convex. This was shown in [31]. In general, however, the MLE problem is poorly understood, and there is a need for accurate methods that reliably identify all local maxima.

Nonlinear algebra furnishes such a method, namely solving the score equations [30, Section 7.1] using numerical homotopy continuation [27]. This is guaranteed to find all critical points of the likelihood function and hence all local maxima. A key step is the knowledge of the maximum likelihood degree (ML degree). This is the number of complex critical points. The ML degree of a linear covariance model is an invariant of a linear space of symmetric matrices which is of interest in its own right.

Our presentation is organized as follows. In Section 2 we introduce various models to be studied, ranging from generic linear equations to colored graph models. In Section 3 we discuss the maximum likelihood estimator as well as the dual maximum likelihood estimator. Starting from [30, Proposition 7.1.10], we derive a convenient form of the score equations. The natural point of entry for an algebraic geometer is the study of generic linear constraints. This is our topic in Section 4. We compute a range of ML degrees and we compare them to the dual degrees in [28, Section 2.2].

In Section 5 we present our software LinearCovarianceModels.jl. This is written in Julia and it is easy to use. It computes the ML degree and the dual ML degree for a given subspace , and it determines all complex critical points for a given sample covariance matrix . Among these, it identifies the real and positive definite solutions, and it then selects those that are local maxima. The package is available at

This rests on the software HomotopyContinuation.jl due to Breiding and Timme [3].

Section 6 discusses instances where the likelihood function has multiple local maxima. This is meant to underscore the strength of our approach. We then turn to models where the maximum is unique and the MLE is a rational function.

In Section 7 we examine Brownian motion tree models. Here the linear constraints are determined by a rooted phylogenetic tree. We study the ML degree and dual ML degree. We show that the latter equals one for binary trees, and we derive the explicit rational formula for their MLE. A census of these degrees is found in Table 5.

## 2. Models

Let be the

-dimensional real vector space of

symmetric matrices . The subset of positive definite matrices is a full-dimensional open convex cone. Consider any linear subspace of whose intersection with is nonempty. Then is a relatively open convex cone. In optimization, where one uses the closure, this is known as a spectrahedral cone. In statistics, the intersection is a linear covariance model. These are the models we study in this paper. In what follows we discuss various families of linear spaces that are of interest to us.

Generic linear constraints: Fix a positive integer and suppose that is a generic linear subspace of . Here “generic” is meant in the sense of algebraic geometry, i.e.

is a point in the Grassmannian that lies outside a certain algebraic hypersurface. This hypersurface has measure zero, so a random subspace will be generic with probability one. For a geometer, it is natural to begin with the generic case, since its complexity controls the complexity of any special family of linear spaces. In particular, the ML degree for a generic

depends only on and , and this furnishes an upper bound for the ML degree of the special families below.

Diagonal covariance matrices: Here we take and we assume that is a linear space that consists of diagonal matrices. Restricting to covariance matrices that are diagonal is natural when modeling independent Gaussians. We use the term generic diagonal model when is a generic point in the -dimensional Grassmannian of -dimensional subspaces inside the diagonal matrices.

Brownian motion tree models: A tree is a connected graph with no cycles. A rooted tree is obtained by fixing a vertex, called the root, and directing all edges away from the root. Fix a rooted tree with leaves. Every vertex of defines a clade, namely the set of leaves that are descendants of . For the Brownian motion tree model on , the space is spanned by the rank-one matrices , where is the indicator vector of . Hence, if is the set of all clades of then

 (2) Σ=∑A∈CθAeAeTA,where θA are model parameters.

The linear equations for the subspace are whenever the least common ancestors and agree in the tree . Assuming , the union of the models for all trees are characterized by the ultrametric condition

. Matrices of this form play an important role also in hierarchical clustering

[16, Section 14.3.12], phylogenetics [13], and random walks on graphs [10].

Maximum likelihood estimation for this class of models is generally complicated but recently there has been progress (cf. [1, 29]) on exploiting the nice structure of the matrices above. In Section 7 we study computational aspects of the MLE, and, more importantly, we provide a significant advance by considering the dual MLE.

Covariance graph models: We consider models that arise from imposing zero restrictions on entries of . This was studied in [5, 11]. This is similar to Gaussian graphical models where zero restrictions are placed on the inverse . We encode the sparsity structure with a graph whose edges correspond to nonzero off-diagonal entries of . Zero entries in correspond to pairwise marginal independences. These arise in statistical modeling in the context of causal inference [8]. Models with zero restrictions on the covariance matrix are known as covariance graph models. Maximum likelihood in these Gaussian models can be carried out using Iterative Conditional Fitting [5, 11], which is implemented in the ggm package in R [22].

Toeplitz matrices: Suppose

follows the autoregressive model of order 1, that is,

, where and for some . Assume that the are mutually uncorrelated. Then , and hence is a Toeplitz matrix. More generally, covariance matrices from stationary time series are Toeplitz. Multichannel and multidimensional processes have covariance matrices of block Toeplitz form [4, 24]. Similarly, if follows the moving average process of order then if and it is zero otherwise; see, for example, [14, Section 3.3]. Thus, in time series analysis, we encounter matrices like

 (3) ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣γ0γ1γ2γ3γ4γ1γ0γ1γ2γ3γ2γ1γ0γ1γ2γ3γ2γ1γ0γ1γ4γ3γ2γ1γ0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ or ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣γ0γ1000γ1γ0γ1000γ1γ0γ1000γ1γ0γ1000γ1γ0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

We found that the ML degree for such models is surprisingly low. This means that nonlinear algebra can reliably estimate Toeplitz matrices that are fairly large.

Colored covariance graph models: A generalization of covariance graph models is obtained by following [15], which introduces graphical models with vertex and edge symmetries. Models of this type also generalize the Toeplitz matrices and the Brownian motion tree models. Following the standard convention we use the same colors for edges or vertices when the corresponding entries of are equal. The black color is considered neutral and encodes no restrictions.

The Brownian motion tree model corresponds to a colored model over the complete graph, where edge symmetries are encoded by the tree; cf. Figure 1. Also, both matrices in (3) represent covariance graph models with edge and vertex symmetries.

## 3. Maximum likelihood estimator and its dual

Now that we have seen motivating examples, we formally define the MLE problem for a linear covariance model . Suppose we observe a random sample in from . The sample covariance matrix is . The matrix is positive semidefinite. Our aim is to maximize the function

 ℓ(Σ)=logdetΣ−1−tr(SΣ−1)% subject to Σ∈L.

Following [30, Proposition 7.1.10], this equals the log-likelihood function times .

We fix the standard inner product on the space of symmetric matrices. The orthogonal complement to a subspace is defined as usual.

###### Proposition 3.1.

Finding all the critical points of the log-likelihood function amounts to solving the following system of linear and quadratic equations in unknowns:

 (4) Σ∈L,KΣ=In,KSK−K∈L⊥.
###### Proof..

The matrix is a critical point if and only if, for every , the derivative of at in the direction vanishes. This directional derivative equals

 −tr(Σ−1U)+tr(SΣ−1UΣ−1).

This formula follows by multivariate calculus from two facts: (i) the derivative of the matrix mapping

is the linear transformation

; (ii) the derivative of the function is the linear functional .

Using the identity , vanishing of the directional derivative is equivalent to

 −⟨K,U⟩+⟨KSK,U⟩=0.

The condition for all is equivalent to . ∎

###### Example 3.2 (3×3 Toeplitz matrices).

Let be the space of Toeplitz matrices

 Σ=⎡⎢⎣γ0γ1γ2γ1γ0γ1γ2γ1γ0⎤⎥⎦.

This space has dimension in . Fix a sample covariance matrix with real entries. We need to solve the system (4). This consists of equations in unknowns, namely the entries of the covariance matrix and its inverse . The condition gives three linear polynomials:

 σ11−σ33,σ12−σ23,σ22−σ33.

The condition translates into nine bilinear polynomials:

 σ11k11+σ12k12+σ13k13−1,σ12k11+σ22k12+σ23k13,σ13k11+σ23k12+σ33k13,σ11k12+σ12k22+σ13k23,σ12k12+σ22k22+σ23k23−1,σ13k12+σ23k22+σ33k23,σ11k13+σ12k23+σ13k33,σ12k13+σ22k23+σ23k33,σ13k13+σ23k23+σ33k33−1.

Finally, the condition translates into three quadratic polynomials:

 k211s11+k212s11+k213s11+2k11k12s12+2k12k22s12+2k13k23s12+2k11k13s13+2k12k23s13+2k13k33s13+k212s22+k222s22+k223s22+2k12k13s23+2k22k23s23+2k23k33s23+k213s33+k223s33+k233s33−k11−k22−k33,
 k23s13+k12k33s13+k12k22s22+k22k23s22+k13k22s23+k12k23s23+k223s23+k22k33s23+k13k23s33+k23k33s33−k12−k23,
 k11k13s11+k12k13s12+k11k23s12+k213s13+k11k33s13+k12k23s22+k13k23s23+k12k33s23+k13k33s33−k13.

The zero set of these polynomials in unknowns consists of three points . All three solutions are real and positive definite for the sample covariance matrix

 (5) S=⎡⎢⎣4/5−9/5−1/25−9/579/1625/24−1/2525/2417/16⎤⎥⎦≈⎡⎢⎣−0.8000−1.8000−0.0400−1.8000−4.9375−1.0417−0.0400−1.0417−1.0625⎤⎥⎦.

Namely, the three Toeplitz matrices that solve the score equations for this are

 [^γ0,^γ1,^γ2]log-likelihood value−5.349366256855714global maximum−5.407087529857949local maximum−5.415351037271449  saddle point

So, even in this tiny example, our optimization problem has multiple local maxima in the cone . A numerical study of this phenomenon will be presented in Section 6.

Maximum likelihood is usually the preferred estimator in statistical analysis. However, for reasons like robustness and computational efficiency, it is customary to also consider alternative estimators. A large family of estimators of general interest are the M-estimators, which are all asymptotically normal and consistent under relatively minor conditions; see [17, Sections 6.2-6.3]. A special case of an M-estimator is the dual maximum likelihood estimator which we now discuss in more detail.

Dual estimation is based on the maximization of a dual likelihood function. In the Gaussian case this is motivated by interchanging the role of the parameter matrix and the empirical covariance matrix . The Kullback-Leibler divergence

of two Gaussian distributions

and on is equal to

 KL(Σ0,Σ1)=12(tr(Σ−11Σ0)−n+log(detΣ1detΣ0)).

Computing the MLE is equivalent to minimizing with respect to with . On the other hand, the dual MLE is obtained by minimizing with respect to with . Equivalently, we set and we maximize

 ℓ∨(Σ)=logdetΣ−tr(WΣ).

The idea of utilizing the “wrong” Kullback-Leibler distance is ubiquitous in variational inference and is central for mean field approximation and related methods. The idea of using this estimation method for Gaussian linear covariance models is very natural. It results in a unique maximum, since is a convex function on the positive definite cone . See [6] and also [5, Section 3.2] and [18, Section 4].

The following algebraic formulation is the analogue to Proposition 3.1.

###### Proposition 3.3.

Finding all the critical points of the dual log-likelihood function amounts to solving the following system of equations in unknowns:

 (6) Σ∈L,KΣ=In,K−W∈L⊥.
###### Proof..

After switching the role of and , and of and , our problem becomes MLE for linear concentration models. Formula (6) is found in [28, equation (10)]. ∎

The next result lists properties of the dual MLE that are important for statistics.

###### Proposition 3.4.

The dual maximum likelihood estimator of a Gaussian linear covariance model is consistent, asymptotically normal, and first-order efficient.

###### Proof..

See Theorem 3.1 and Theorem 3.2 in [6]. ∎

First-order efficiency means that the asymptotic variance of the properly normalized dual MLE is optimal, that is, it equals the asymptotic variance of the MLE. For finite samples the dual MLE is less efficient than the MLE, but the computational cost of the MLE is higher. In practice, one rarely has access to the real MLE and one uses a local maximum of the likelihood function. Thus, the advantage in efficiency is smaller than predicted by the theory. The dual MLE offers a convenient alternative.

In this paper, we focus on algebraic structures, and we note the following important distinction between our two estimators. The MLE requires the quadratic equations in (4), whereas the dual MLE requires the linear equations in (6). The latter are easier to solve than the former, and they give far fewer solutions. This is quantified by the tables for the ML degrees in the next sections.

We are particularly interested in models whose dual ML estimator can be written as an explicit expression in the sample covariance matrix . We identify such scenarios in Sections 6 and 7. Here is a first example to illustrate this point.

###### Example 3.5.

We revisit the Toeplitz model in Example 3.2. For the dual MLE, the three quadratic polynomials in are now replaced by three linear polynomials:

 k11+k22+k33−w11−w22−w33,k12+k23−w12−w23,k13−w13.

The are the entries of the inverse sample covariance matrix . The new system has two solutions, and we can write the and in terms of the (or the ) using the familiar formula for solving quadratic equations in one variable. Specifically, for the covariance matrix in (5) we find that the dual MLE is given by

 [ˇγ0,ˇγ1,ˇγ2]=[0.203557267562,−0.189349961613,0.1963649733282]=[128436826526803883951212363704694314904961417+52√56164777765459298768970215002736466708112363704694314904961417,−581739061180432087305161818523471574524807085−655679934637√561647777654592987689702150027364667081163146905524715599705244729886305,199045140844651067369185922254668449766828930550600+264990063915733√56164777765459298768970215002736466708158732885988897615893888102759069800].

Needless to say, nonlinear algebra goes much beyond the quadratic formula. In what follows we shall employ state-of-the-art methods for solving polynomial equations.

## 4. General Linear Constraints

The maximum likelihood degree of a linear covariance model is, by definition, the number of complex solutions to the likelihood equations (4) for generic data . This is abbreviated ML degree; see [30, Section 7.1]. To compute the ML degree, take to be a random symmetric matrix and count all complex critical points of the likelihood function for . Equivalently, the ML degree of the model is the number of complex solutions to the polynomial equations in (4).

We also consider the complex critical points of the dual likelihood function . Their number, for a generic matrix , is the dual ML degree of . It coincides with the number of complex solutions to the polynomial equations in (6).

Our ML degrees can be computed symbolically in a computer algebra system that rests on Gröbner bases. However, this approach is limited to small instances. To get further, we use the methods from numerical nonlinear algebra described in Section 5.

We here focus on a generic -dimensional linear subspace of . In practice this means that a basis for is chosen by sampling matrices at random from .

###### Proposition 4.1.

The ML degree and the dual ML degree of a generic subspace of dimension in depends only on and . It is independent of the particular choice of . For small parameter values, these ML degrees are listed in Table 1.

###### Proof..

The independence rests on general results in algebraic geometry, to the effect that the computation can be done over the rational function field where the coordinates of and are unknowns. The ML degree will be the same for all specializations to that remain outside a certain discriminant hypersurface. Table 1 and further values are computed rapidly using the software described in Section 5. ∎

The dual ML degree was already studied by Sturmfels and Uhler in [28, Section 2]. Our table on the right is in fact found in their paper. The symmetry along its columns is proved in [28, Theorem 2.3]. It states that the dual ML degree for dimension coincides with the dual ML degree for codimension . This is derived from the equations (6) by an appropriate homogenization. Namely, the middle equation is clearly symmetric under switching the role of and , and the linear equations on the left and on the right in (6) can also be interchanged under this switch.

It was conjectured in [28, Section 2] that, for fixed , the dual ML degree is a polynomial of degree in the matrix size . This is easy to see for . The polynomials for and were also derived in [28, Section 2].

The situation is similar but more complicated for the ML degree. First of all, the symmetry along columns no longer holds as seen on the left in Table 1. This is explained by the fact that the linear equation is now replaced by the quadratic equation . However, the polynomiality along the rows of Table 1 seems to persist. For the ML degree equals , as shown recently by Coons, Marigliano and Ruddy [7]. For we propose the following conjecture.

###### Conjecture 4.2.

The ML degree of a linear covariance model of dimension is a polynomial of degree in the ambient dimension . For this ML degree equals , and for it equals .

We now come to diagonal linear covariance models. For these models, is a linear subspace of dimension inside the space of diagonal -matrices. We wish to determine the ML degree and dual ML degree when is generic in .

In the diagonal case, the score equations simplify as follows. Both the covariance matrix and the concentration matrix are diagonal. We eliminate the entries of by setting and . We also write for the diagonal entries of the sample covariance matrix , and for their reciprocals. Finally, let denote the reciprocal linear space of , i.e. the variety obtained as the closure of the set of coordinatewise reciprocals of vectors in .

###### Proposition 4.3.

Let be a linear space, viewed as a Gaussian covariance model of diagonal matrices. The score equations for the likelihood in (4) and the dual likelihood in (6) can be written as systems of equations in unknowns as follows:

•  and

•  and  .

The number of complex solutions to (6’) for generic of dimension equals .

###### Proof..

The translation of (4) and (6) to (4’) and (6’) is straightforward. The equations (6’) represent a general linear section of the reciprocal linear space . Proudfoot and Speyer showed that the degree of equals the Möbius invariant of the underlying matroid. We refer to [19] for a recent study and many references. This Möbius invariant equals in the generic case, when the matroid is uniform. ∎

It would be desirable to express the number of complex solutions to (4’) as a matroid invariant, and thereby explain the entries on the left side of Table 2. As before, the th row gives the values of a polynomial of degree . For instance, for we find , and for we find .

## 5. Numerical Nonlinear Algebra

Linear algebra is the foundation of scientific computing and applied mathematics. Nonlinear algebra [23]

is a generalization where linear systems are replaced by nonlinear equations and inequalities. At the heart of this lies algebraic geometry, but there are links to many other branches, such as combinatorics, algebraic topology, commutative algebra, convex and discrete geometry, tensors and multilinear algebra, number theory and representation theory. Nonlinear algebra is not simply a rebranding of algebraic geometry. It highlights that the focus is on computation and applications, and the theoretical needs that this requires, results in a new perspective.

We refer to numerical nonlinear algebra as the branch of nonlinear algebra which is concerned with the efficient numerical solution of polynomial equations and inequalities. In the existing literature, this is referred to as numerical algebraic geometry. In the following we discuss the numerical solution of polynomial equations, and we describe the techniques used for deriving the computational results in this paper.

One of our main contributions is the Julia package LinearCovarianceModels.jl for estimating linear covariance models; see (1). Given , our package computes the ML degree, and the dual ML degree. For any , it finds all critical points and it selects those that are local maxima. The following example explains how this is done.

###### Example 5.1.

We use the package to verify Example 3.2:
julia using LinearCovarianceModels
julia toeplitz(3)
3-dimensional LCModel:

We compute the ML degree of the family by computing all solutions for a generic instance. The pair of solutions and generic instance is called an ML degree witness.
julia W = ml_degree_witness()
MLDegreeWitness:
ML degree 3
model dimension 3
dual false

By default, the computation of the ML degree witness relies on a heuristic stopping criterion. We can numerically verify the correctness by using a trace test

[21].
julia verify(W)
Compute trace...
Norm of trace: 2.6521474798326718e-12
true
We now input the specific sample covariance matrix in (5), and we compute all critical points of this MLE problem using the ML degree witness from the previous step.
julia S = [4/5 -9/5 -1/25; -9/5 79/16 25/24; -1/25 25/24 17/16];
julia critical_points(W, S)
3-element Array{Tuple{Array{Float64,1},Float64,Symbol},1}:
([2.39038, -0.286009, 0.949965], -5.421751313919751, :local_maximum)
([2.52783, -0.215929, -1.45229], -5.346601549034418, :global_maximum)
If only the global maximum is of interest then this can also be computed directly.
julia mle(W, S)
3-element Array{Float64,1}:
2.527832268219689
-0.21592947057775033
-1.4522862659134732

By default only positive definite solutions are reported. To list all critical points we run the command with an additional option.
julia critical_points(W, S, only_positive_definite=false)
In this case, since the ML degree is 3, we are not getting more solutions.

In the rest of this section we explain the mathematics behind our software, and how it applies to our MLE problems. A textbook introduction to the numerical solution of polynomial systems by homotopy continuation methods is [27].

Suppose we are given polynomials in unknowns with complex coefficients, where . We are interested in computing all isolated complex solutions of the system . These solutions comprise the zero-dimensional components of the variety where .

The general idea of homotopy continuation is as follows. Assume we have another system of polynomials for which we know some or all of its solutions. Suppose there is a homotopy with , with the property that, for every , there exists a smooth path with and for all . Then we can track each point in to a point in . This is done by solving the Davidenko differential equation

 ∂H∂x(x(t),t)⋅˙x(t)+∂H∂t(x(t),t)=0

with initial condition . Using a predictor-corrector scheme for numerical path tracking, both the local and global error can be controlled. Methods called endgames are used to handle divergent paths and singular solutions [27, Chapter 10].

Here is a general framework for start systems and homotopies. Embed in a family of polynomial systems , continuously parameterized by a convex open set . We have for some . Outside a Zariski closed set , every system in has the same number of solutions. If then is such a generic instance of the family , and the following is a suitable homotopy [25]:

 (7) H(x,t)=F(1−t)p+tq(x).

Now, to compute , it suffices to find all solutions of a generic instance and then track these along the homotopy (7). Obtaining all solutions of a generic instance can be a challenge, but this has to be done only once! That is the offline phase. Tracking from a generic to a specific instance of interest is the online phase.

A key point in applying this method is the choice of the family . For MLE problems in statistics, it is natural to choose as the space of data or instances. In our scenario, is , or a complex version thereof. We shall discuss this below.

First, we explain the monodromy method for an arbitrary family . Suppose the general instance has solutions, and that we are given one start pair . This means that is a solution to the instance . Consider the incidence variety

 Y:={(x,p)∈Cn×Q|Fp(x)=0}.

Let be the projection from onto the second factor. For , the fiber has exactly points. A loop in based at has lifts to . Associating a point in the fiber to the endpoint of the corresponding lift gives a permutation in . This defines an action of the fundamental group of on the fiber . The monodromy group of our family is the image of the fundamental group in .

The monodromy method fills the fiber by exploiting the monodromy group. For this, the start solution is numerically tracked along a loop in , yielding a solution at the end. If , then is also tracked along the same loop, possibly yielding again a new solution. This is done until no more solutions are found. Then, all solutions are tracked along a new loop, where the process is repeated. This process is stopped by use of a trace test. For a detailed description of the monodromy method and the trace test see [9, 21]. To get this off the ground, one needs a start pair . This can often be found by inverting the problem. Instead of finding a solution to a given , we start with and look for such that .

We now explain how this works for the score equations (4

) of our MLE problem. First pick a random matrix

in the subspace . We next compute by inverting . Finally we need to find a symmetric matrix such that . Note that this is a linear system of equations and hence directly solvable. In this manner, we easily find a start pair by setting and .

The number of solutions to a generic instance is the ML degree of our model. A priori knowledge of is useful because it serves as a stopping criterion in the monodromy method. This is one reason for focusing on the ML degree in this paper.

## 6. Local Maxima versus Rational MLE

The theme of this paper is maximum likelihood inference for linear covariance models. We developed some numerical nonlinear algebra for this problem, and we offer a software package (1). From the applications perspective, this is motivated by the fact that the likelihood function is non-convex. It can have multiple local maxima. A concrete instance for Toeplitz matrices was shown in Example 3.2.

In this section we undertake a more systematic experimental study of local maxima. Our aim is to answer the following question: there is the theoretical possibility that has many local maxima, but can we also observe this in practice?

To address this question, we explored a range of linear covariance models . For each model, we conducted the following experiment. We repeatedly generated sample covariance matrices . This was done as follows. We first sample a matrix

by picking each entry independently from a normal distribution with mean zero and variance one. And, then we set

. This is equivalent to sampling from the standard Wishart distribution with degrees of freedom.

For each of the generated sample covariance matrices , we computed the real solutions of the likelihood equations (4). From these, we identified the set of all local maxima in , and we extracted its subset of local maxima in the positive definite cone . We recorded the numbers of these local maxima. Moreover, we kept track of the fraction of instances for which there were multiple (positive define) local maxima. In Table 3 we present our results for and generic linear subspaces .

For each between and , we selected five generic linear subspaces in the -dimensional space . Each linear subspace was constructed by choosing a basis of positive definite matrices. The basis elements were constructed with the same sampling method as the sample covariance matrices. The ML degree of this linear covariance model is the corresponding entry in the column on the left in Table 1. These degrees are repeated in the row named ML degree in Table 3.

For each model , we generated sample covariance matrices , and we solved the likelihood equations (4) using our software  LinearCovarianceModels.jl. The row max denotes the largest number of local maxima that was observed in these experiments. The row multiple gives the fraction of instances which resulted in two or more local maxima. These two numbers pertain to local maxima in . The rows max pd and multiple pd are the analogues restricted to the positive definite cone .

For an illustration, let us discuss the models of dimension . These equations (4) have complex solutions, but the number of real solutions is much smaller. Nevertheless, in two fifths of the instances () there were two or more local maxima in . In one third of the instances () the same happened . The latter is the case of interest in statistics. One instance had four local maxima in .

The second experiment we report concerns a combinatorially defined class of linear covariance models, namely the Brownian motion tree models in (2). We consider eleven combinatorial types of trees with leaves. For each model we perform the experiment described above, but we now used sample covariance matrices per model. Our results are presented in Table 4, in the same format as in Table 3.

The eleven trees are numbered by the order in which they appear in Table 5. For instance, tree 1 gives the -dimensional model in whose covariance matrices are

 Σ=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣γ1γ6γ6γ6γ7γ6γ2γ6γ6γ7γ6γ6γ3γ6γ7γ6γ6γ6γ4γ7γ7γ7γ7γ7γ5⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

This model has ML degree . Around eight percent of the instances led to multiple maxima among positive definite matrices. Up to three such maxima were observed.

The results reported in Tables 3 and 4 show that the maximal number of local maxima increases with the ML degree. But, they do not increase as fast as one would expect from the growth of the ML degree. On the other hand, the frequency of observing multiple local maxima seems to be closer correlated to the ML degree.

Here is an interesting observation to be made in Table 4. The last three trees, labeled 9, 10 and 11, are the binary trees. These have the maximum dimension . For these models, every local maximum in is also in the positive definite cone . We also verified this for all binary trees with leaves. This is interesting since the positive definiteness constraint is the hardest to respect in an optimization routine. It is tempting to conjecture that this persists for all binary trees with .

There is another striking observation in Table 5. The dual ML degree for binary trees is always equal to one. We shall prove in Theorem 7.3 that this holds for any . This means that the dual MLE can be expressed as a rational function in the data . Hence there is only one local maximum, which is therefore the global maximum.

We close this section with a few remarks on the important special case when the ML degree or the dual ML degree is equal to one. This holds if and only if each entry of the estimated matrix or is a rational function in the quantities .

Rationality of the MLE has received a lot of attention in the case of discrete random variables. See [30, Section 7.1] for a textbook reference. If the MLE of a discrete model is rational then its coordinates are alternating products of linear forms in the data [30, Theorem 7.3.4]. This result due to Huh was refined in [12, Theorem 1]. At present we have no idea what the analogue in the Gaussian case might look like.

###### Problem 6.1.

Characterize all Gaussian models whose MLE is a rational function.

In addition to the binary trees in Theorem 7.3, statisticians are familiar with a number of situations when the dual MLE is rational. The dual MLE is the MLE of a linear concentration model with the sample covariance matrix replaced by its inverse . This is studied in [28] and in many other sources on Gaussian graphical models and exponential families. The following result paraphrases [28, Theorem 4.3].

###### Proposition 6.2.

If a linear covariance model is given by zero restrictions on , then the dual ML degree is equal to one if and only if the associated graph is chordal.

It would be interesting to extend this result to other combinatorial families, such as colored covariance graph models (cf. [15]), including structured Toeplitz matrices.

The following example illustrates Problem 6.1 and it raises some further questions.

###### Example 6.3.

We present a linear covariance model such that both the MLE and the dual MLE are rational functions. Fix and let

be the hyperplane with equation

. By Proposition 6.2, the dual ML degree of is one. The model is dual to the decomposable undirected graphical model with missing edge .

Following [20, 28], we obtain the rational formula for its dual MLE:

 (8) ˇk12=W1,RW−1R,RWR,2,andˇkij=wijfor (i,j)≠(1,2).

Here and is our notation for submatrices of .

The ML degree of the model is also one. To see this, we note that is the DAG model with edges whenever unless . By [20, Section 5.4.1], the MLE of any Gaussian DAG model is rational. In our case, we find , where is the matrix which is zero apart from the upper left block

The entries in and are rational functions in the data . But, unlike in the discrete case of [12], here the rational functions are not products of linear forms. Problem 6.1 asks for an understanding of its irreducible factors.

Example 6.3 raises many questions. First of all, can we characterize all linear spaces with rational formulas for their MLE, or their dual MLE, or both of them? Second, it would be interesting to study arbitrary models that are hyperplanes. Consider the entries for in Tables 1 and 3. We know from [28, Section 2.2] that the dual ML degree equals . The ML degree seems to be . In all cases there seems to be only one local (and hence global) maximum. How to prove these observations? Finally, it is worthwhile to study the MLE when is a generic symmetric matrix of rank . What is the ML degree in terms of and ?

## 7. Brownian Motion Tree Models

We now study the linear space associated with a rooted tree with  leaves. The equations of are whenever . In the literature (cf. [13, 29]) one assumes that the parameters in (2) are nonnegative. Here, we relax this hypothesis: we allow all covariance matrices in the spectrahedron .

The ML degree and its dual do not depend on how the leaves of a tree are labeled but only on the tree topology. For fixed each tree topology is uniquely identified by the set of clades. Since the root clade and the leaf-clades , …, are part of every tree, they are omitted in our notation. For example, if then the tree is the binary tree with four inner vertices corresponding to the three non-trivial clades mentioned explicitly. This tree is depicted in Figure 1.

We computed the ML degree and the dual ML degree of for many trees . In Table 5 we report results for five and six leaves. We notice that the dual ML degree is exactly one for all binary trees. This suggests that the dual MLE is a rational function. Our main result in this section (Theorem 7.3) says that this is indeed true.

The equations (6) for the dual ML degree can be written as for all clades . Here is given and is unknown. We abbreviate

 (9) wA,B=∑i∈A∑j∈Bwij=eTAWeB.

The same notation is used for general matrices. We present two examples with .

###### Example 7.1.

Consider the tree with clades , , shown in [29, Figure 1]. The dual MLE satisfies for , and , , and

 ˇkij=w12,34wi,12wj,34w12,12w34,34for i∈{1,2}, j∈{3,4}.
###### Example 7.2.

The tree with clades , has , , and

 ˇk13=w12,3w1,12w12,12,ˇk14=w123,4w1,12w12,123w12,12w123,123,ˇk23=w12,3w2,12w12,12,
 ˇk24=w123,4w2,12w12,123w12,12w123,123,ˇk34=w123,4w123,3w123,123.

Both examples were computed in Mathematica using the description of the Brownian motion tree model in terms of the inverse covariance matrix given in [29].

Recall that for we write for the set of leaves of that are descendants of . The following theorem generalizes formulas in the above two examples. It is our main result in Section 7.

###### Theorem 7.3.

Consider the model given by a rooted binary tree with leaves. The dual MLE satisfies for all , and its off-diagonal entries are

 (10) ˇkij=wA,B∏u→vwde(v),de(u)wde(u),de(u)for 1≤i

Here are the clades of the two children of . The product is over all edges of , except for the two edges with , on the path from to in .

###### Proof..

We first rewrite (6) in terms of the coordinates for and for . See [29, Example 1.1] for an example of this coordinate change. The condition means that for every clade of . This can be rewritten in the new coordinates as follows:

• for all , and

• for all inner vertices of , where is the partition of given by the two children of .

Now fix with clade partition as above, so for all , . The parametrization and in [29, Theorem 1.2] yields

 pA,B=tu⋅∑i∈A1ti⋅∑j∈B1tj=tu⋅p0,A⋅p0,B.

Using the equations in (ii), we obtain

 (11) tu=−wA,Bp0,Ap0,Band hence pij=−wA,Bp0ip0,Ap0jp0,B.

We claim that the following identity holds for any clade :