# Computationally efficient likelihood inference in exponential families when the maximum likelihood estimator does not exist

In a regular full exponential family, the maximum likelihood estimator (MLE) need not exist, in the traditional sense, but the MLE may exist in the Barndorff-Nielsen completion of the family. Existing algorithms for finding the MLE in the Barndorff-Nielsen completion solve many linear programs; they are slow in small problems and too slow for large problems. We provide new, fast, and scalable methodology for finding the MLE in the Barndorff-Nielsen completion based on approximate null eigenvectors of the Fisher information matrix. Convergence of Fisher information follows from cumulant generating function convergence, conditions for which are given.

There are no comments yet.

## Authors

• 9 publications
• 2 publications
• ### Penalized Maximum Likelihood Estimator for Mixture of von Mises-Fisher Distributions

The von Mises-Fisher distribution is one of the most widely used probabi...
09/07/2020 ∙ by Tin Lok James Ng, et al. ∙ 0

• ### Relative Performance of Fisher Information in Interval Estimation

Maximum likelihood estimates and corresponding confidence regions of the...
07/09/2021 ∙ by Sihang Jiang, et al. ∙ 0

• ### A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families

We consider the problem of computing the maximum likelihood multivariate...
07/18/2019 ∙ by Brian Axelrod, et al. ∙ 0

• ### A simple and efficient profile likelihood for semiparametric exponential family

Semiparametric exponential family proposed by Ning et al. (2017) is an e...
11/30/2017 ∙ by Lu Lin, et al. ∙ 0

• ### Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies

Anomaly estimation, or the problem of finding a subset of a dataset that...
07/15/2020 ∙ by Uthsav Chitra, et al. ∙ 25

• ### Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families

We study online learning under logarithmic loss with regular parametric ...
05/19/2013 ∙ by Peter Bartlett, et al. ∙ 0

• ### Some Investigations about the Properties of Maximum Likelihood Estimations Based on Lower Record Values for a Sub-Family of the Exponential Family

Here, in this paper it has been considered a sub family of exponential f...
11/05/2017 ∙ by S. D. Gore, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We develop an inferential framework for regular full discrete exponential families when the observed value of the canonical statistic lies on the boundary of its convex support. Then the maximum likelihood estimator (MLE) for the canonical parameter cannot exist [3, Theorem 9.13]. But the MLE may exist in a completion of the exponential family. Completions for exponential families have been described (in order of increasing generality) by Barndorff-Nielsen [3, pp. 154–156], Brown [4, pp. 191–201], Csiszár and Matúš [5], and Geyer [12, unpublished PhD thesis, Chapter 4]. The latter two are equivalent for full exponential families, but the last, most general, has much stronger algebraic properties that help with theory, so we use it. It also is the only completion that is a completion under no regularity conditions whatsoever (other than exponential family). It works for curved exponential families and other non-full exponential families ([5] works for non-full but closed convex families). Following Geyer [8] we will call all of these completions the Barndorff-Nielsen completion without fuss about the technical details differentiating them.

Geyer [8]

developed ways to do hypothesis tests and confidence intervals when the MLE in an exponential family does not exist in the conventional sense. The hypothesis test scheme was credited to Fienberg (personal communication — an answer he gave to a question at the end of a talk). The confidence interval scheme generates one-sided non-asymptotic confidence intervals, because the MLE fails to exist in the conventional sense when the canonical statistic is on the boundary of its convex support, and this is an inherently one-sided situation, and conventional asymptotics do not work near the boundary.

To simplify both explanation and computation, Geyer [8] assumed the regularity conditions that Brown [4] assumed for his completion. These conditions of Brown hold for nearly all applications known to us (applications for which a more general completion are required include aster models [10] and Markov spatial point processes [7]). We will also need to use Brown’s conditions to guarantee our methods work.

The issue of when the MLE exists in the conventional sense and what to do when it doesn’t is very important because of the wide use of generalized linear models for discrete data and log-linear models for categorical data. In every application of these, existing statistical software gives completely invalid results when the MLE does not exist in the conventional sense, but most such software either does not check for this problem or does very weak checks that have high probability of both false positives and false negatives. Moreover, even if these checks correctly detect nonexistence of the MLE in the conventional sense, conventional software implements no valid procedures for statistical inference when this happens. When this issue is detected most users will go to smaller statistical models for which the MLE seems to exist, even though such models may neither fit the data nor address the questions of scientific interest.

Geyer [8] gives examples with valid inference. Authoritative textbooks, such as Agresti [1, Section 6.5], discuss the issue but provide no solutions.

Thus a solution to this issue that is efficiently computable would be very important. The algorithms of Geyer [12, 8] and Albert and Anderson [2] are based on doing many linear programs. The algorithm of Geyer [8] is the most efficient; it is mostly due to Fukuda, who provided the underlying C code for the computational geometry functions of R package rcdd [11], which this algorithm uses. This algorithm does at most linear programs, where

is the number of cases of a generalized linear model (GLM) or the number of cells in a contingency table, in order to determine the existence of the MLE in the conventional sense. Each of these linear programs has

variables, where is the number of parameters of the model, and up to inequality constraints. Since linear programming can take time exponential in when pivoting algorithms are used, and since such algorithms are necessary in computational geometry to get correct answers despite inaccuracy of computer arithmetic (see the warnings about the need to use rational arithmetic in the documentation for R package rcdd [11]), these algorithms can be very slow. Typically, they take several minutes of computer time for toy problems and can take longer than users are willing to wait for real applications. These algorithms do have the virtue that if they use infinite-precision rational arithmetic, then their calculations are exact, as good as a mathematical proof.

Previous theoretical discussions of these issues that do not provide algorithms [3, 4, 5] use the notions of faces of convex sets or tangent cones or normal cones and all of these are much harder to compute than the algorithm of Geyer [8]. So they provide no direction toward efficient computing.

Because computational geometry is so slow and does not scale to large problems, we abandon it and return to calculations using the inexact computer arithmetic provided by computer hardware. Conventional maximum likelihood computations come close, in a sense, to finding the MLE in the Barndorff-Nielsen completion. They go uphill on the likelihood function until they meet their convergence criteria and stop. At this point, the canonical parameter estimates are still infinitely far away from their analogs for the MLE in the completion, but the corresponding probability distributions are close in total variation norm. Here we show that they are also close in the sense of moment generating function convergence (Theorem

7 below) and consequently moments of all orders are also close. The MLE in the completion is not only a limit of distributions in the original family but also a distribution in the original family conditioned on the affine hull of a face of the effective domain of the log likelihood supremum function [12, Theorem 4.3, special cases of which were known to other authors]. To do valid statistical inference when the MLE does not exist in the conventional sense, we need to know this affine hull.

This affine hull is the support of the canonical statistic under the MLE distribution (in the completion). Hence it is a translate of the null space of the Fisher information matrix, which (for an exponential family) is the variance-covariance matrix of the canonical statistic. This affine hull must contain the mean vector of the canonical statistic under the MLE distribution. Hence knowing the mean vector and variance-covariance matrix of the canonical statistic under the MLE distribution allows us to do valid statistical inference, and our conventional maximum likelihood calculation (go uphill until things don’t change much in an iteration) will give us good approximations of them (relative to the inexactness of computer arithmetic).

We will get nearly the correct affine hull if we can guess the correct null space of the Fisher information matrix from its eigenvalues and eigenvectors computed using inexact computer arithmetic. We will not be able to do this when the statistical model has an ill-conditioned model matrix (the model matrix for categorical data analysis being the model matrix when it is recast as a Poisson regression). Ill-conditioning will add spurious nearly zero eigenvalues that arise from the ill-conditioning rather than the concentration of the MLE distribution on the correct affine hull. We will suppose that the model matrix is not ill-conditioned. If a sequence of parameter estimates maximizes the likelihood, then the corresponding sequence of probability density functions (PDFs) has subsequences converging to PDFs of MLE distributions in the Barndorff-Nielsen completion

[12, Theorem 4.1]. If the MLE distribution is unique, as it always is for a full exponential family [8, Section 3.8], then all of these MLE PDFs will correspond to the same probability distribution. For a curved exponential family, the MLE need not be unique, even when it exists in the conventional sense.

## 2 Motivating example

Consider the case of complete separation in the logistic regression model as an example of a discrete exponential family with data on the boundary of the convex support of the canonical statistic. Suppose that we have one predictor vector

having values 10, 20, 30, 40, 60, 70, 80, 90, and suppose the components of the response vector are 0, 0, 0, 0, 1, 1, 1, 1. Then the simple logistic regression model that has linear predictor exhibits failure of the MLE to exist in the conventional sense. This example is the same as that of Agresti [1, Section 6.5.1].

For an exponential family, the submodel canonical statistic is , where is the model matrix [8, Section 3.9]. Figure 1 shows the observed value of the canonical statistic vector and the support (all possible values) of this vector.

As is obvious from the figure, the observed value of the canonical statistic is on the boundary of the convex support, in which case the MLE does not exist in the conventional sense [8, Theorem 4]

. In general, this figure is too computationally intensive and too high-dimensional to draw. So our methods do not use such figures. It is here to develop intuition. In our methodology, this degeneracy follows from the Fisher information matrix at the apparent MLE being nearly the zero matrix.

In this example, like in Example 1 of Geyer [8], the MLE in the Barndorff-Nielsen completion corresponds to a completely degenerate distribution. This MLE distribution says no other data than what was observed could be observed. But the sample is not the population and estimates are not parameters. So this degeneracy is not a problem. To illustrate the uncertainty of estimation we follow Figure 2 of Geyer [8], which shows confidence intervals (necessarily one-sided) for the saturated model mean value parameters.

Our Figure 2 shows that, as would be expected from so little data, the confidence intervals are very wide. The MLE in the completion says the probability of observing a response equal to one jumps from zero to one somewhere between 40 and 60. The confidence intervals show that we are fairly sure that this probability goes from near zero at to near one at but we are very unsure where jumps are if there are any. These intervals were constructed using the theory of Geyer [8, Section 3.16]. The actual computations follow some later course notes [9].

Our theory allows for inference in not only the complete separation example but also in any discrete regular full exponential family where the MLE does not exist in the traditional sense. For further motivation, see the examples in Section 2 of [8]. We redo Example 2.3 of [8] in our Section 7.2, and we find that our methodology produces the same inferences as theirs in a fraction of the time.

## 3 Laplace transforms and standard exponential families

Let be a positive Borel measure on a finite-dimensional vector space . The log Laplace transform of is the function defined by

 c(θ)=log∫e⟨x,θ⟩λ(dx),θ∈E∗, (1)

where is the dual space of , where is the canonical bilinear form placing and in duality, and where is the extended real number system, which adds the values and to the real numbers with the obvious extensions to the arithmetic and topology [15, Section 1.E].

If one prefers, one can take for some , and define

but the coordinate-free view of vector spaces offers more generality and more elegance. Also, as we are about to see, if is the sample space of a standard exponential family, then a subset of is the canonical parameter space, and the distinction between and helps remind us that we should not consider these two spaces to be the same space.

A log Laplace transform is a lower semicontinuous convex function that nowhere takes the value (the value is allowed and occurs where the integral in (1) does not exist) [12, Theorem 2.1]. The effective domain of an extended-real-valued convex function on is

 domc={θ∈E∗:c(θ)<+∞}.

For every , the function defined by

 fθ(x)=e⟨x,θ⟩−c(θ),x∈E, (2)

is a probability density with respect to . The set where is any nonempty subset of , is called a standard exponential family of densities with respect to . This family is full if . We also say is the standard exponential family generated by having canonical parameter space , and is the generating measure of .

The log likelihood of this family is (2) have log likelihood

 l(θ)=⟨x,θ⟩−c(θ). (3)

A general exponential family [12, Chapter 1] is a family of probability distributions having a sufficient statistic taking values in a finite-dimensional vector space that induces a family of distributions on that have a standard exponential family of densities with respect to some generating measure. Reduction by sufficiency loses no statistical information, so the theory of standard exponential families tells us everything about general exponential families [12, Section 1.2].

In the context of general exponential families is called the canonical statistic and the canonical parameter (the terms natural statistic and natural parameter are also used). The set is the canonical parameter space of the family, the set is the canonical parameter space of the full family having the same generating measure. A full exponential family is said to be regular if its canonical parameter space is an open subset of .

The cumulant generating function (CGF) of the distribution of the canonical statistic for parameter value is the function defined by

 kθ(t)=log∫e⟨x,t⟩fθ(x)λ(dx)=c(θ+t)−c(θ) (4)

provided this distribution has a CGF, which it does if and only if is finite on a neighborhood of zero, that is, if and only if . Thus every distribution in a full family has a CGF if and only if the family is regular. Derivatives of evaluated at zero are the cumulants of the distribution for . These are the same as derivatives of evaluated at .

## 4 Generalized affine functions

### 4.1 Characterization on affine spaces

Exponential families defined on affine spaces instead of vector spaces are in many ways more elegant [12, Sections 1.4 and 1.5 and Chapter 4]. To start, a family of densities with respect to a positive Borel measure on an affine space is a standard exponential family if the log densities are affine functions. Following Geyer [12, Chapter 4], we complete the exponential family by taking pointwise limits of densities, allowing and as limits.

We call these limits generalized affine functions. Real-valued affine functions on an affine space are functions that are are both convex and concave. Generalized affine functions on an affine space are extended-real-valued functions that are are both concave and convex [12, Chapter 4]. (For a definition of extended-real-valued convex functions see Rockafellar [14, Chapter 4].)

We thus have two characterizations of generalized affine functions: functions that are both convex and concave and functions that are limits of sequences of affine functions. Further characterizations will be given below.

Let denote a sequence of affine functions that are log densities in a standard exponential family with respect to , that is, for all . Since pointwise if and only if pointwise, the idea of completing an exponential family naturally leads to the the study of generalized affine functions.

If is a generalized affine function, we use the notation

 h−1(R) ={x∈E:h(x)∈R} h−1(∞) ={x∈E:h(x)=∞} h−1(−∞) ={x∈E:h(x)=−∞}
###### Theorem 1.

An extended-real-valued function on a finite-dimensional affine space is generalized affine if and only if one of the following cases holds

• ,

• ,

• and is an affine function, or

• there is a hyperplane

such that for all points on one side of , for all points on the other side of , and restricted to is a generalized affine function.

All theorems for which a proof does not follow the theorem statement are proved in either the appendix or the supplementary material.

The intention is that this theorem is applied recursively. If we are in case (d), then the restriction of to is another generalized affine function to which the theorem applies. Since a nested sequence of hyperplanes can have length at most the dimension of , the recursion always terminates.

### 4.2 Topology

Let denote the space of generalized affine functions on a finite-dimensional affine space with the topology of pointwise convergence.

###### Theorem 2.

is a compact Hausdorff space.

###### Theorem 3.

is a first countable topological space.

###### Corollary 1.

is sequentially compact.

Sequentially compact means every sequence has a (pointwise) convergent subsequence. That this follows from the two preceding theorems is well known [16, p. 22, gives a proof].

The space is not metrizable, unless is zero-dimensional [12, penultimate paragraph of Section 3.3]. So we cannot use - arguments, but we can use arguments involving sequences, using sequential compactness.

Let be a positive Borel measure on , and let be a nonempty subset of such that

 ∫ehdλ=1,h∈H. (5)

Then, following Geyer [12, Chapter 4], we call a standard generalized exponential family of log densities with respect to . Let denote the closure of in .

###### Theorem 4.

Maximum likelihood estimates always exist in the closure .

###### Proof.

Suppose is the observed value of the canonical statistic. Then there exists a sequence in such that

 hn(x)→suph∈Hh(x).

This sequence has a convergent subsequence in . This limit is in and maximizes the likelihood. ∎

We claim this is the right way to think about completion of exponential families. For full exponential families or even closed convex exponential families the closure only contains proper log probability densities ( that satisfy the equation in (5)). This is shown by Geyer [12, Chapter 2] and also by Csiszár and Matúš [5].

For curved exponential families and for general non-full exponential families, applying Fatou’s lemma to pointwise convergence in gives only

 0≤∫ehdλ≤1,h∈¯¯¯¯¯H. (6)

When the integral in (6) is strictly less than one we say is an improper log probability density. The examples in Geyer [12, Chapter 4] show that improper probability densities cannot be avoided in curved exponential families.

Geyer [12, Theorem 4.3] shows that this closure of an exponential family can be thought of as a union of exponential families, so this generalizes the conception of Brown [4] of the closure as an aggregate exponential family. Thus our method generalizes all previous methods of completing exponential families.

Admittedly, this characterization of the completion of an exponential family is very different from any other in its ignoring of parameters. Only log densities appear. Unless one wants to call them parameters — and that conflicts with the usual definition of parameters as real-valued — parameters just do not appear.

So in the next section, we bring parameters back.

### 4.3 Characterization on vector spaces

In this section we take sample space to be vector space (which, of course, is also an affine space, so the results of the preceding section continue to hold). Recall from Section 3 above, that denotes the dual space of , which contains the canonical parameter space of the exponential family.

###### Theorem 5.

An extended-real-valued function on a finite-dimensional vector space is generalized affine if and only if there exist finite sequences (perhaps of length zero) of vectors , in in and scalars , such that , are linearly independent and has the following form. Define and, inductively, for integers such that

 Hi ={x∈Hi−1:⟨x,ηi⟩=δi} C+i ={x∈Hi−1:⟨x,ηi⟩>δi} C−i ={x∈Hi−1:⟨x,ηi⟩<δi}

all of these sets (if any) being nonempty. Then whenever for any , whenever for any , and is either affine or constant on , where and are allowed for constant values.

The “if any” refers to the case where the sequences have length zero, in which case the theorem asserts that that is affine on or constant on .

As we saw in the preceding section, we are interested in likelihood maximizing sequences. Here we represent the likelihood maximizing sequence in the coordinates of the linearly independent vectors that characterize the generalized affine function according to its Theorem 5 representation. Let be a likelihood maximizing sequence of canonical parameter vectors, that is,

 l(θn)→supθ∈Θl(θ),as n→∞, (7)

where the log likelihood is given by (3) and where is the canonical parameter space of the family. To make connection with the preceding section, define

 hθ(x)=l(θ)=⟨x,θ⟩−c(θ).

Then is a sequence of affine functions, which has a subsequence that converges (in ) to some generalized affine function , which maximizes the likelihood:

 h(x)=supθ∈Θl(θ). (8)

The following lemma gives us a better understanding of the convergence .

###### Lemma 1.

Suppose that a generalized affine function on a finite dimensional vector space is finite at at least one point. Represent as in Theorem 5, and extend , to be a basis , for . Suppose is a sequence of affine functions converging to in . Then there are sequences of scalars and such that

 hn(y)=an+j∑i=1bi,n(⟨y,ηi⟩−δi)+p∑i=j+1bi,n⟨y,ηi⟩,y∈E, (9)

and, as , we have

1. , for ,

2. , for ,

3. converges, for , and

4. converges.

In (9) the first sum is empty when and the second sum is empty when . Such empty sums are zero by convention.

The results given in Lemma 1 are applicable to generalized affine functions in full generality. The case of interest to us, however, is when is the likelihood maximizing sequence constructed above.

###### Corollary 2.

For data from a regular full exponential family defined on a vector space , suppose is a likelihood maximizing sequence satisfying (7) with log densities defined by (8) converging pointwise to a generalized affine function . Characterize and as in Theorem 5 and Lemma 1. Define . Then conclusions (a) and (b) of Lemma 1 hold in this setting and

 ψn→θ∗,asn→∞,

where is the MLE of the exponential family conditioned to .

In case the conclusion is the trivial zero converges to zero. The original exponential family conditioned on the event is what Geyer [8] calls the limiting conditional model (LCM).

###### Proof.

The conditions of Lemma 1 are satisfied by our assumptions so all conclusions of Lemma 1 are satisfied. As a consequence, as . The fact that is the MLE of the LCM restricted to follows from our assumption that is a likelihood maximizing sequence. ∎

Taken together, Theorem 5, Lemma 1, and Corollary 2 provide a theory of maximum likelihood estimation in the completions of exponential families that is the theory of the preceding section with canonical parameters brought back.

## 5 Convergence theorems

### 5.1 Cumulant generating function convergence

We now show CGF convergence along likelihood maximizing sequences (7). This implies convergence in distribution and convergence of moments of all orders.

Theorems 6 and 7 in this section say when CGF convergence occurs. Their conditions are somewhat unnatural (especially those of Theorem 6). However, the example in Section 4 of the supplementary material shows not only that some conditions are necessary to obtain CGF convergence (it does not occur for all full discrete exponential families) but also that the conditions of Theorem 6 are sharp, being just what is needed to rule out that example.

The CGF of the distribution having log density that is the generalized affine function is defined by

 κ(t)=log∫e⟨y,t⟩eh(y)λ(dy),

and similarly

 κn(t)=log∫e⟨y,t⟩ehn(y)λ(dy)

where we assume are the log densities for a likelihood maximizing sequence such that pointwise. The next theorem characterizes when pointwise.

Let denote the log Laplace transform of the restriction of to the set , that is,

 cA(θ)=log∫Ae⟨y,θ⟩λ(dy),

where, as usual, the value of the integral is taken to be when the integral does not exist (a convention that will hold for the rest of this section).

###### Theorem 6.

Let be a finite-dimensional vector space of dimension . For data from a regular full exponential family with natural parameter space and generating measure , assume that all LCMs are regular exponential families. Suppose that is a likelihood maximizing sequence satisfying (7) with log densities converging pointwise to a generalized affine function . Characterize as in Theorem 5. When , and for , define

 Di={y∈C−i:⟨y,ηk⟩>δk,somek>i},F=E∖∪j−1i=1Di={y:⟨y,ηi⟩≤δi,1≤i≤j}, (10)

and assume that

 supθ∈Θsupy∈∪j−1i=1Die⟨y,θ⟩−c∪j−1i=1Di(θ)<∞orλ(∪j−1i=1Di)=0. (11)

Then converges to pointwise for all in a neighborhood of 0.

Discrete exponential families automatically satisfy (11) when

 infy∈∪j−1i=1Diλ({y})>0.

In this setting,

corresponds to the probability mass function for the random variable conditional on the occurrence of

. Thus,

 supθ∈Θsupy∈∪j−1i=1Di(e⟨y,θ⟩−c∪j−1i=1Di(θ)) =supθ∈Θsupy∈∪j−1i=1Di⎛⎜⎝e⟨y,θ⟩λ({y})λ({y})∑x∈∪j−1i=1Die⟨x,θ⟩λ({x})⎞⎟⎠ ≤supy∈∪j−1i=1Di(1/λ({y}))<∞.

Therefore, Theorem 6 is applicable for the non-existence of the maximum likelihood estimator that may arise in logistic and multinomial regression.

We show in the next theorem that discrete families with convex polyhedral support also satisfy (11) under additional regularity conditions that hold in practical applications. When is convex polyhedron, we can write as in [15, Theorem 6.46]. When the MLE does not exist, the data is on the boundary of . Denote the active set of indices corresponding to the boundary containing by In preparation for Theorem 7 we define the normal cone , the tangent cone , and faces of convex sets and then state conditions required on .

###### Definition 1.

The normal cone of a convex set in the finite dimensional vector space at a point is

 NK(x)={η∈E∗:⟨y−x,η⟩≤0% for ally∈K}.
###### Definition 2.

The tangent cone of a convex set in the finite dimensional vector space at a point is

 TK(x)=cl{s(y−x):y∈Kands≥0}

where denotes the set closure operation.

When is a convex polyhedron, and are both convex polyhedron with formulas given in [15, Theorem 6.46]. These formulas are

 TK(x) ={y:⟨y,αi⟩≤0for alli∈I(x)}, NK(x) ={c1α1+⋯+cmαm:ci≥0fori∈I(x),ci=0fori∉I(x)}.
###### Definition 3.

A face of a convex set is a convex subset of such that every (closed) line segment in with a relative interior point in has both endpoints in . An exposed face of is a face where a certain linear function achieves its maximum over [14, p. 162].

The conditions required on for our theory to hold are from Brown [4, pp. 193–197]. These conditions are:

• The support of the exponential family is a countable set .

• The exponential family is regular.

• Every is contained in the relative interior of an exposed face of the convex support .

• The convex support of the measure equals , where is the generating measure for the exponential family.

Conditions (i) and (ii) are already assumed in Theorem 6. It is now shown that discrete exponential families satisfy (11) under the above conditions.

###### Theorem 7.

Assume the conditions of Theorem 6 with the omission of (11) when . Let denote the convex support of the exponential family. Assume that the exponential family satisfies the conditions of Brown. Then (11) holds.

### 5.2 Consequences of CGF convergence

Theorems 6 and 7 both verify CGF convergence along likelihood maximizing sequences (7) on neighborhoods of 0. The next theorems show that CGF convergence on neighborhoods of 0 is enough to imply convergence in distribution and of moments of all orders. Therefore moments of distributions with log densities that are affine functions converge along likelihood maximizing sequences (7) to those of a limiting distributions whose log density is a generalized affine function.

Suppose that is a random vector in a finite-dimensional vector space having a moment generating function (MGF) , then

 φX(t)=φ⟨X,t⟩(1),t∈E∗,

regardless of whether the MGF exist or not. It follows that the MGF of for all determine the MGF of and vice versa, when these MGF exist. More generally,

 φ⟨X,t⟩(s)=φX(st),t∈E∗ands∈R. (12)

This observation applied to characteristic functions rather than MGF is called the Cramér-Wold theorem. In that context it is more trivial because characteristic functions always exist.

If , is a basis for a vector space , then there exists a unique dual basis , for that satisfies

 ⟨vi,wj⟩={1,i=j0,i≠j (13)

[13, Theorem 2 of Section 15].

###### Theorem 8.

If is a random vector in having an MGF, then the random scalar has an MGF for all . Conversely, if has an MGF for all , then has an MGF.

###### Theorem 9.

Suppose , , , is a sequence of random vectors, and suppose their moment generating functions converge pointwise on a neighborhood of zero. Then

 Xnd⟶X, (14)

and has an MGF , and

 φXn(t)→φX(t),t∈E∗.
###### Theorem 10.

Under the assumptions of Theorem 9, suppose , , are vectors defined on , the dual space of . Then is uniformly integrable so

 E{k∏i=1⟨Xn,ti⟩}→E{k∏i=1⟨X,ti⟩}.

The combination of Theorems 6-10 provide a methodology for statistical inference along likelihood maximizing sequences when the MLE is in the Barndorff-Nielsen completion. In particular, we have convergence in distribution and convergence of moments of all orders along likelihood maximizing sequence. The limiting distribution in this context is a generalized exponential family with density where is a generalized affine function.

### 5.3 Convergence of null spaces of Fisher information

Our method for finding the MLE in the Barndorff-Nielsen completion relies on finding the null space of the Fisher information matrix. We need to show that we have convergence for that. In order to prove this we need an appropriate notion of convergence of vector subspaces.

###### Definition 4.

Painlevé-Kuratowski set convergence [15, Section 4.A] can be defined as follows (Rockafellar and Wets [15] give many equivalent characterizations). If is a sequence of sets in and is another set in , then we say if

• For every there exists a subsequence of the natural numbers and there exist such that .

• For every sequence in such that there exists a natural number such that whenever , we have .

###### Theorem 11.

Suppose that is a sequence of positive semidefinite matrices and componentwise. Fix less than half of the least nonzero eigenvalue of unless is the zero matrix in which case may be chosen arbitrarily. Let denote the subspace spanned by the eigenvectors of corresponding to eigenvalues that are less than . Let denote the null space of . Then (Painlevé-Kuratowski).

## 6 Calculating the MLE in the completion

### 6.1 Assumptions

So far everything has been for general exponential families except for Theorems 6 and 7, the later of which assumes the conditions of Brown [4], and those conditions hold for GLM and log-linear models for categorical data analysis.

Now, following Geyer [8] we restrict our attention to discrete GLM. This, in effect, includes log-linear models for contingency tables because we can always assume Poisson sampling, which makes them equivalent to GLM [1, Section 8.6.7; 8, Section 3.17].

### 6.2 The form of the MLE in the completion

#### 6.2.1 First characterization

Suppose we know the affine support of the MLE distribution in the completion. This is the smallest affine set that contains the canonical statistic with probability one. Denote the affine support by . An affine set is a translate of a vector subspace. Since the observed value of the canonical statistic is contained in with probability one, and the canonical statistic for a GLM is , where is the model matrix, is the response vector, and its observed value [8, Section 3.9], we have for some vector space .

Then the LCM in which the MLE in the completion is found is the OM conditioned on the event

 MT(Y−y)∈V,almost surely

[12, Theorem 4.3]. Suppose we characterize as the subspace where a finite set of linear equalities are satisfied

 V={w∈Rp:⟨w,ηi⟩=0, i=1,…,j}.

Then the LCM is the OM conditioned on the event

 ⟨MT(Y−y),ηi⟩=⟨Y−y,Mηi⟩=0,i=1,…,j.

From this we see that the vectors , span the null space of the Fisher information matrix for the LCM, which our Theorems 7 and 10 say is well approximated by the Fisher information matrix for the OM at parameter values that are close to maximizing the likelihood.

The vector subspace spanned by the vectors , is called the constancy space of the LCM in [8].

#### 6.2.2 Second characterization

Any vector in the canonical parameter space of an exponential family is called a direction of recession (DOR) if the likelihood function is nondecreasing in that direction or, equivalently, if

 ⟨Y,Mδ⟩≤⟨y,Mδ⟩,almost surely, (15)

where denotes the response vector considered as a random vector, denotes its observed value, and is the model matrix [12, Section 2.2; 8, Theorem 3 and the following discussion and Section 3.9].

If we can find a DOR, then the MLE in the completion is a distribution in the LCM, which is the family of distributions in the original model (OM) conditioned on the event

 ⟨Y,Mδ⟩=⟨y,Mδ⟩,almost surely, (16)

or a distribution in the completion of the LCM [12, Chapter 2; 8, Section 3].

Define . In light of (15) the only way (16) can hold is if implies almost surely.

Thus the distributions in the LCM are the distributions in the OM conditioned on this event. Moreover the MLE in the LCM is easily found using standard software. We simply change the model by removing the components of the response that are fixed in the LCM. Using R function GLM this is done using the optional argument subset = zeta == 0, where zeta is the R object corresponding to the vector .

If the MLE for the LCM exists in the conventional sense, then we have solved the problem of finding the MLE in the completion of the OM. If not we have to solve the problem of finding the MLE in the completion of the LCM we found, and that is done as we did before. And so forth. The iteration must terminate because each LCM has smaller dimension than the one before. Geyer [12, Chapter 2] gives details.

#### 6.2.3 Third characterization

Geyer [8, Section 3] shows that the recursion in the preceding section can be avoided by use of a generic direction of recession (GDOR), which is a DOR in the relative interior of the set of all DOR.

### 6.3 Calculating limiting conditional models

#### 6.3.1 Based on the first characterization

We do not need a DOR because we only use that to determine the affine support of the LCM, and we can estimate that by other methods. Suppose , and other notation are as in Section 6.2.1 above. The LCM is the OM conditioned on the event

 ⟨Y,Mηi⟩=⟨y,Mηi⟩,almost % surely for i∈1,…,j. (17)

We have no readily available way to fit a GLM subject to (17).

We know, however, (Section 6.2.2 above) that the event (17) fixes some components of the response vector at their observed values and leaves the rest entirely unconstrained. Those components, that are entirely unconstrained are those for which the corresponding component of is zero (or, taking account of the inexactness of computer arithmetic, nearly zero) for all , .

#### 6.3.2 Based on the second characterizations

We can find a DOR by minimizing the function defined by

 f(a)=maxi∈1,…,mj∑k=1ak⟨vi,ηk⟩, (18)

where , are vectors that generate the tangent cone for the GLM, which are defined in Geyer [8, Sections 3.10 and 3.11] and for which R code to calculating them is given in the technical reports accompanying [8], and where , are as in Section 6.2.3 above.

We minimize over all unit vectors , to avoid unbounded domain (if we minimized over all vectors, the optimal value might be , and no solution would exist) and also to avoid the zero vector being a solution.

If , are the components of the solution, the DOR is

 δ=j∑k=1¯akηk,

and the LCM is the OM conditioned on the event that every component of the response vector for which the corresponding component of is nonzero is constrained to be equal to its observed value.

We fit the model using the subset argument of R function glm as explained in Section 6.2.2 above.

We can use ideas from Section 6.2.1 to tell us whether we need to iterate. We already know the dimension of the constancy space of the MLE in the completion. If the LCM determined by this GDOR has a constancy space of this dimension, then we have the correct LCM and do not need to iterate. R function glm will figure out the dimension of the constancy space (how many coefficients it needs to drop to get an identifiable model) on its own.

#### 6.3.3 Based on the third characterization

As far as we know, there is no way to calculate a GDOR except by using the very time consuming computational geometry calculations explained by [8].

## 7 Examples

### 7.1 Complete separation example

We return to the motivating example of Section 2. Here we see that the Fisher information matrix has only null eigenvectors. Thus the LCM is completely degenerate at the one point set containing only the observed value of the canonical statistic of this exponential family.

We adopt the techniques of Section 3.16.2 of [8] to make inferences about mean-value parameters (success probability considered as a function of the predictor ). This is outlined in Section 2. One-sided confidence intervals are seen in Figure 2. As stated in Section 2, the actual computations follow some later course notes [9].

### 7.2 Example in Section 2.3 of [8]

This example consists of a contingency table with seven dimensions hence cells. These data now have a permanent location [6]

. There is one response variable

that gives the cell counts and seven categorical predictors , ,

that specify the cells of the contingency table. We fit a generalized linear regression model where

is taken to be Poisson distributed. We consider a model with all three-way interactions included but no higher-order terms.

Geyer [8] shows the MLE in this example does not exist in the traditional sense, and then computes a generic direction of recession using the repeated linear programming with R package rcdd (Section 6.3.3). In this example there is only single null eigenvector of the Fisher information matrix, which consequently must be a generic direction of recession. Therefore all of our methods of determining the support of the LCM in Sections 6.3.1, 6.3.2, and 6.3.3 must do the same thing.

Table 1 displays the comparison between the characterizations in Sections 6.3.2 and 6.3.3. The vector is the estimated null eigenvector of the Fisher information matrix using our implementation. The vector is the estimated gdor in [8]. The vector is identical to up to six decimal places (the results in Table 1 are rounded). Therefore, the inferences resulting from these two distinct approaches is identical up to rounding. The only material difference between our implementation and the linear programming in [8] is computational time. Our implementation estimates in 0.017 seconds of computer time, while the functions in the rcdd package estimates in 4.481 seconds of computer time. This is a big difference for a relatively small amount of data.

Inference for the MLE in the LCM are included in the supplementary materials.

### 7.3 Big data example

This example uses the other dataset at [6]. It shows our methods are much faster than the linear programming method of [8] for recovering directions of recession (Sections 6.2.2 and 6.2.3). The characterization in Section 6.2.1

is even faster since no direction of recession is computed. This dataset consists of five categorical variables with four levels each and a response variable

that is Poisson distributed. A model with all four-way interaction terms is fit to this data. It may seem that the four way interaction model is too large (1024 data points vs 781 parameters) but tests select this model over simpler models, see Table 2.

We estimate that the dimension of the null space of the estimated Fisher information matrix is 23. In the Section 6.3.2 characterization we minimize over in (18), to find a DOR. The resulting vector is a GDOR since it satisfies conditions (20a) and (20b) of [8]. Fitting the model, estimating the dimension of the null space of estimated Fisher information, finding , and estimating the support of the LCM took less than 2 seconds of computer time. In the Section 6.3.3 characterization, the functions in the rcdd package perform the same tasks in 334701 seconds (roughly 3.8 days) of computer time. The two different methods estimated different GDORs but they estimate the same support for the LCM.

Inferences for the MLE in the LCM are included in the supplementary materials. One-sided 95% confidence intervals for mean-value parameters that correspond to components of the canonical statistic which are on the boundary of their support (MLE equal to 0) are also included in the supplementary materials. We provide a new method for calculating these intervals that has not been previously published, but whose concept is found in Geyer [8] in the penultimate paragraph of Section 3.16.2.

Let denote the index set of the components of the response vector on which we condition the OM to get the LCM, and let and denote these components considered as a random vector and as an observed value, respectively. Let denote the saturated model canonical parameter (usually called “linear predictor” in GLM theory) with being the submodel canonical parameter vector. Then endpoints for a confidence interval for a scalar parameter are

 minγ∈Γlimpr^β+γ(YI=yI)≥αg(^β+γ)andmaxγ∈Γlimpr^β+γ(YI=yI)≥αg(^β+γ) (19)

where is a basis for the null space of Fisher information. At least one of (19) is at the end of the range of this parameter (otherwise we can use conventional two-sided intervals). For Poisson sampling, let denote the mean value parameter (here operates componentwise like the R function of the same name does), then We take the confidence interval problem to be

 maximizeμksubject to−∑i∈Iμi≥log(α) (20)

where is taken to be the function of described in (19). The optimization in (20) can be done for any . Implementation details are included in Sections 10.7.1 and 10.7.2 in the supplementary materials.

One-sided 95% confidence intervals for mean-valued parameters whose MLE is equal to 0 are displayed in Table 3. The full table is included in the supplementary materials. Some of the intervals in Table 3 are relatively wide which represents non-trivial uncertainty about the observed MLE being 0.

## 8 Discussion

The chance of observing a canonical statistic on the boundary of its support increases when the dimension of the model increases. Researchers naturally want to include all possibly relevant covariates in an analysis, and this will often result in the MLE not existing in the conventional sense. Our methods provide a computationally inexpensive solution to this problem.

The theory of generalized affine functions and the geometry of exponential families allows GLM software to provide a MLE when the observed value of the canonical statistic is on the boundary of its support. In such settings, the MLE does not exist in the traditional sense and is said to belong to the Barndorff-Nielsen completion of the exponential family [3, 4, 8, 5] when the supremum of the log likelihood is finite. [3, 4, 5] all provided a MLE when it exists in the Barndorff-Nielsen completion of the family and [8] provided estimates of variability under the conditions of [4]. We do the same here using the theory of generalized affine functions.

The limiting distribution evaluated along the iterates of such an optimization is a generalized exponential family taking the form of a generalized affine function with structure given by Theorem 5. Cumulant generating functions converge along this sequence of iterates (Theorems 6 and 7) as well as estimates of moments of all orders (Theorem 10) for distributions taking estimated parameter values along this sequence of iterates. We can then use the null eigenvectors of estimated Fisher information to find a DOR and the support of a LCM. Parameter estimation in the LCM is conducted in the traditional manner using GLM software. One-sided confidence intervals for mean-value and canonical parameters that are observed to be on the boundary can also be provided.

The costs of computing a DOR and the support of a LCM are minimal compared to the repeated linear programming in the rcdd package, especially when the dimension of the data is large. This is where the desirability of our approach stems from. It is much faster to let optimization software, such as glm in R, simply go uphill on the log likelihood of the exponential family until a convergence tolerance is reached. Our examples show what kind of time saving is possible using our methods on small and large datasets.

## Appendix A Technical appendix

###### Proof of Theorem 6.

First consider the case when , the sequences of vectors and scalars are both of length zero. There are no sets and in this setting and is affine on . From Lemma 1 we have . From Corollary 2, as . We observe that from continuity of the cumulant function. The existence of the MLE in this setting implies that there is a neighborhood about 0 denoted by such that . Pick and observe that . Therefore when .

Now consider the case when . Define for all . In this scenario we have

 κn(t) =c(ψn+t+b1,nη1)−c(ψn+b1,nη1) =c(ψn+t+b1,nηj)−c(ψn+b1,nη1)±b1,nδ1 =[c(ψn+t+b1,nη1)−b1,nδ1]−[c(ψn+b1,nη1)−b1,nδ1].

From [12, Theorem 2.2], we know that

 c(θ∗+t+sη1)−sδ1→c1(θ∗+t),c(θ∗+sη1)−sδ1→c1(θ∗), (21)

as since for all . The left hand side of (21) is a convex function of and the right hand side is a proper convex function. If is nonempty, which holds whenever is nonempty, then the convergence in (21) is uniform on compact subsets of [15, Theorem 7.17]. Also [15, Theorem 7.14], uniform convergence on compact sets is the same as continuous convergence. Using continuous convergence, we have that both

 c(ψn+t+b1,nη1)−b1,nδ1 →c1(θ∗+t), c(ψn+b1,nη1)−b1,nδ1 →c1(θ∗),

where as by Lemma 1. Thus

 κn(t) =c(θn+t)−c(θn)→c1(θ∗+t)−c1(θ∗) =log∫H1e⟨y+t,θ∗⟩−c(θ∗)λ(dy)=log∫H1e⟨y,t⟩+h(y)λ(dy) =log∫e⟨y,t⟩+h(y)λ(dy)=κ(t).

This concludes the proof when .

For the rest of the proof we will assume that where dim() = . Represent the sequence in coordinate form as

 θn=b1,nη1+b2,nη2+⋯+bp,nηp, (22)

with scalars , . For , we know that as from Corollary 2. The existence of the MLE in this setting implies that there is a neighborhood about 0, denoted by , such that . Pick , fix , and construct -boxes about and , denoted by and respectively, such that both . Let be the set of vertices of . For all define

 Mt,ε(y)=maxv∈Vt,ε{⟨v,y⟩},˜Mt,ε(y)=minv∈Vt,ε{⟨v,y⟩}. (23)

From the conclusions of Lemma 1 and Corollary 2, we can pick an integer such that and for all and . For all , we have

 ⟨y,θn+t⟩−j∑i=1bi,nδi=⟨y,ψn+t⟩+j∑i=1bi,n(⟨y,ηi⟩−δi)≤Mt,ε(y) (24)

for all . The integrability of and follows from

 ∫e˜Mt,ε(y)λ(dy)≤∫eMt,ε(y)λ(dy)=∑v∈Vt,ε∫{y:⟨y,v⟩=Mt,ε(y)}e⟨y,v⟩λ(dy) ≤∑v∈Vt,ε∫e⟨y,v⟩λ(dy)<∞.

Therefore,

 ⟨y,ψn+t⟩+j∑i=1bi,n(⟨y,ηi⟩−δi)→{⟨y,θ∗+t⟩,y∈Hj,−∞,y∈F∖Hj.

which implies that

 cF(θn+t)−cF(θn)→cHj(θ∗+t)−cHj(θ∗) (25)

by dominated convergence. To complete the proof, we need to verify that

 c(θn+t)−c(θn)=cF(θn+t)−cF(θn)+c∪j−1i=1Di(θn+t)−c∪j−1i=1Di(θn)→cHj(θ∗+t)−cHj(θ∗). (26)

We know that (26) holds when in (11) because of (25). Now suppose that . We have,

 ⟨y,ψn+t⟩+j∑i=1bi,n(⟨y,ηi⟩−δi)→−∞,y∈∪j−1i=1Di, (27)

and

 exp(c∪j−1i=1Di(θn+t)−c∪j−1i=1Di(θn))=∫∪j−1i=1Die⟨y,θn+t⟩−c∪j−1i=1Di(θn)λ(dy)≤∫∪j−1i=1DieMt,ε(y)−˜M0,ε(y)+⟨y,θn⟩−c∪j−1i=1Di(θn</