# On optimal designs for non-regular models

Classically, Fisher information is the relevant object in defining optimal experimental designs. However, for models that lack certain regularity conditions, the Fisher information does not exist and, hence, there is no notion of design optimality available in the literature. This article seeks to fill the gap by proposing a so-called Hellinger information, which generalizes Fisher information in the sense that the two measures agree in regular problems, but the former also exists for certain types of non-regular problems. We derive a Hellinger information inequality, showing that Hellinger information defines a lower bound on the local minimax risk of estimators. This provides a connection between features of the underlying model---in particular, the design---and the performance of estimators, motivating the use of this new Hellinger information for non-regular optimal design problems. Hellinger optimal designs are derived for several non-regular regression problems, with numerical results empirically demonstrating the improved efficiency of these designs compared to alternatives.

Comments

There are no comments yet.

## Authors

• 8 publications
• 27 publications
• 39 publications
• ### A note on properties of using Fisher information gain for Bayesian design of experiments

Designs found by maximizing the expected Fisher information gain can res...
03/16/2020 ∙ by Antony M. Overstall, et al. ∙ 0

read it

• ### Designs, permutations, and transitive groups

A notion of t-designs in the symmetric group on n letters was introduced...
05/17/2021 ∙ by Minjia Shi, et al. ∙ 0

read it

• ### The semi-algebraic geometry of optimal designs for the Bradley-Terry model

Optimal design theory for nonlinear regression studies local optimality ...
01/08/2019 ∙ by Thomas Kahle, et al. ∙ 0

read it

• ### Theta functions and optimal lattices for a grid cells model

Certain types of neurons, called "grid cells" have been shown to fire ex...
10/16/2020 ∙ by Laurent Bétermin, et al. ∙ 0

read it

• ### Adaptive Designs for Optimal Observed Fisher Information

Expected Fisher information can be found a priori and as a result its in...
12/22/2017 ∙ by Adam Lane, et al. ∙ 0

read it

• ### Theory of Weak Identification in Semiparametric Models

We provide general formulation of weak identification in semiparametric ...
08/27/2019 ∙ by Tetsuya Kaji, et al. ∙ 0

read it

• ### Designs in finite metric spaces: a probabilistic approach

A finite metric space is called here distance degree regular if its dist...
02/16/2021 ∙ by Minjia Shi, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Optimal experimental design is a classical problem with substantial recent developments. For example, Biedermann et al. (2006), Dette et al. (2008), Feller et al. (2017), and Schorning et al. (2017) studied optimal designs for dose-response models; Dette et al. (2016) and Dette et al. (2017) investigated optimal designs for correlated observations; Dror and Steinberg (2006) and Gotwalt et al. (2009) studied robustness issues in optimal designs; López-Fidalgo et al. (2007), Waterhouse et al. (2008), Biedermann et al. (2009), Dette and Titoff (2009), and Dette et al. (2018) studied optimal discrimination designs; Biedermann et al. (2011) studied optimal design for additive partially nonlinear models; Yu (2011), Yang et al. (2013), Sagnol and Harman (2015), and Harman and Benková (2017) investigated algorithms for deriving optimal designs; and Yang and Stufken (2009), Yang (2010), Dette and Melas (2011), Yang and Stufken (2012), and Dette and Schorning (2013) built a new theoretical framework for studying optimal designs. The focus of these developments has been exclusively on regular models that enjoy certain normal features asymptotically, such as generalized linear models. However, certain non-regular models may be appropriate in practical applications (e.g., Chernozhukov and Hong 2004; Hirose and Lai 1997; Cousineau 2009). In particular, Smith (1994)

describes a class of non-regular linear regression models,

 y=x⊤θ+ε,

where the error is non-negative, which implies a non-regular model for , given , since its distribution has -dependent support. Such models are useful if the goal is to study extremes; for example,

might represent the lower bound on remission time when a patient is subjected to treatment settings described by the vector

. To date, there is no literature on optimal designs for cases like this, and the goal of this paper is to fill this gap by developing an approach to optimal design in non-regular problems.

Towards formulating a design problem in a non-regular model, the first obstacle is that the Fisher information matrix—the fundamental object in the classical optimal design context—does not exist. To overcome this, we draw inspiration from recent work on the development of non-informative priors in the Bayesian context, thereby backtracking the path taken by Lindley (1956) and Bernardo (1979) from information in an experiment to non-informative priors. In particular, Shemyakin (2014) proposes an alternative to Fisher information and generalizes the non-informative prior construction of Jeffreys. An important feature of the Fisher information is how it describes the local behavior of the Hellinger distance (see Section 2), leading to its connection to estimator quality via the information inequality. Unfortunately, the role that Shemyakin’s information plays in the local approximation of Hellinger distance for multi-parameter models remains unclear; see Remark 2. Since a connection to the quality of estimators is essential to our efforts to define a meaningful notion of optimal designs, we take an alternative approach where the focus is on a local approximation of Hellinger distance.

We start by looking at the local behavior of the squared Hellinger distance between models and , for . In the regular cases, there is a local quadratic approximation to the squared distance and the Fisher information matrix appears in the approximating quadratic form. In non-regular problems, by definition, the squared Hellinger distance is not locally quadratic, so there is no reason to expect that an “information matrix” can be extracted from this approximation. In fact, not being differentiable in quadratic mean implies that the Hellinger distance is continuous at , but not differentiable, so important features of the local approximation will generally depend on both the magnitude and the direction of the departure of from . From the local Hellinger distance approximation for a given direction, we define a direction-dependent Hellinger information, which is additive like Fisher information for independent data, and establish a corresponding information inequality that suitably lower-bounds the risk function of an arbitrary estimator along that direction. The direction-dependence is removed via profiling, and the result is a locally minimax lower bound on the risk of arbitrary estimators, which is inversely related to our direction-free Hellinger information. Therefore, just like in the familiar Cramér–Rao inequality for regular models, larger Hellinger information means a smaller lower bound and, consequently, better estimation in terms of risk.

The established connection between our Hellinger information for non-regular models and the quality of estimators provides a natural path to approach the optimal design problem. In particular, our Hellinger information depends on the design, so we define the optimal design as one that maximizes the Hellinger information. The intuition, just like in the regular case, is that maximizing the information minimizes the lower bound on the risk, thereby leading to improved estimation. If the model happens to be regular, then our proposed optimal design corresponds to the classical E-optimal design that maximizes the minimum eigenvalue of the Fisher information matrix, so the new approach at least has intuitive appeal. After formally defining the notion of optimal design in this context, we develop some novel theoretical results, in particular a complete class theorem for symmetric designs in the context of non-regular polynomial regression. This theorem, along with some special cases presented in Propositions

45, suggests the potential for a line of developments parallel to that for regular models.

The remainder of the paper is organized as follows. Section 2 sets our notation and briefly reviews the Fisher information and its properties under regularity conditions. We relax those regularity conditions in Section 3 and develop a notion of Hellinger information for certain non-regular models. The main result of the paper, Theorem 1, establishes a connection between this Hellinger information and the quality of estimators, thus paving the way for a framework of optimal designs for non-regular models in Section 4. Some specific non-regular regression models are considered in Section 5, and we derive some analytical optimality results and some numerical demonstrations of the improved efficiency of the optimal designs over other designs. Some concluding remarks are given in Section 6 and proofs of the two main theorems are presented in Appendix A; the remaining details are given in the Supplementary Material (Lin et al. 2018).

## 2 Review of information in regular models

The proposed model assumes that the -valued observations are independent, and the marginal distribution of is , where is a fixed and unknown parameter in . For example, might be a distribution that depends on both the parameter and a fixed covariate vector . We will further assume that, for each , has a density with respect to a fixed dominating -finite measure . When the index is not important, and there is no risk of confusion, we will drop the index and write simply for the density function with respect to .

It is common to assume that the model is regular in the sense that is smooth for each , and that -derivatives of expectations can be evaluated by interchanging differentiation and integration. For example, under conditions (6.6) in Lehmann and Casella (1998), one can define the Fisher information matrix , whose entry is given by

 Eθ{∂∂θklogpi,θ(Yi)⋅∂∂θℓlogpi,θ(Yi)},k,ℓ=1,…,d. (1)

The Fisher information matrix can be defined in broader generality for families of distributions with a differentiability in quadratic mean property (e.g., Pollard 1997; van der Vaart 1998). That is, assume that there exists a function , typically the gradient of , taking values in , such that

 ∫(p1/2θ+ε−p1/2θ−12ε⊤˙ℓθp1/2θ)2dμ=o(∥ε∥2),ε→0,

where denotes the -norm. Then the Fisher information matrix exists and is given by the formula . If we let denote the Hellinger distance and define as

 h(θ;ϑ)≡H2(Pθ,Pϑ):=∫(p1/2θ−p1/2ϑ)2dμ=2−2∫(pθpϑ)1/2dμ,

then the above condition amounts to being locally quadratic:

 h(θ;θ+ε)=14ε⊤I(θ)ε+o(∥ε∥2).

Therefore, a model is regular if the squared Hellinger distance is locally approximately quadratic, with the Fisher information matrix characterizing that quadratic approximation. This is the description of Fisher information that we will attempt to extend to the non-regular case below.

Recall, also, that Fisher information is additive under independence. That is, if are independent, with , regular as above for each , then the Fisher information in the sample of size satisfies

 In(θ)=n∑i=1Ii(θ),

where is the Fisher information matrix in (1) based on alone. This property has a nice interpretation: larger samples have more information.

Under differentiability in quadratic mean, one can prove an information inequality

which states that, for any unbiased estimator

of

with finite second moment, the variance is lower-bounded and satisfies

 Vθ(T)≥˙m(θ)⊤In(θ)−1˙m(θ),

where is the gradient of ; see Pollard (2005). The information inequality above, and its various extensions, establishes a fundamental connection between the quality of an estimator—in this case, the variance of an unbiased estimator—and the Fisher information matrix. This connection has been essential to the development of optimal design theory and practice since the quality of an estimator can be “optimized” by choosing a design that makes the quadratic form in the lower bound as small as possible, or the Fisher information as large as possible.

Finally, differentiability in quadratic mean implies local asymptotic normality (e.g., van der Vaart 1998, Theorem 7.2) which is almost all one needs to show that maximum likelihood estimators are efficient in the sense that they attain the information inequality lower bound (e.g., van der Vaart 1998, Theorem 7.12). Therefore, in sufficiently regular problems, there is a general procedure for constructing high-quality estimators, and that the quality of such estimators is controlled by the Fisher information matrix. The remainder of this paper is concerned with non-regular cases and, unfortunately, these differ from their regular counterparts in several fundamental ways. First, the Fisher information is not well-defined in non-regular cases, so we have no general way of measuring the quality of estimators. Second, one cannot rely on maximum likelihood for constructing good estimators. For example, Le Cam writes (see van der Vaart 2002, p. 674)

The author is firmly convinced that a recourse to maximum likelihood is justifiable only when one is dealing with families of distributions that are extremely regular. The cases in which maximum likelihood estimates are readily obtainable and have been proved to have good properties are extremely restricted.

Therefore, to achieve our goals, we need a measure of information that is flexible enough to handle non-regular problems and is connected to estimation quality in general, but does not depend on a particular estimator. The Hellinger information, defined in Section 3.1, will meet these criteria and will provide a basis for defining optimal designs in non-regular problems.

## 3 Information in non-regular models

### 3.1 Definition and basic properties

To start, we consider the scalar case with . Suppose that there exists a constant such that, for each , the limit exists, is finite, and non-zero. If such an exists, then it must be unique; but there are cases where existence fails, e.g., when is not identifiable, so that for all sufficiently small . The case corresponds to differentiable in quadratic mean and, hence, “regular,” while corresponds to “non-regular.” Differentiability of or lack thereof determines a model’s regularity, so the largest value can take is 2; otherwise, the limit is infinite. From the above limit, there is a local approximation,

 h(θ;ϑ)=J(θ)|θ−ϑ|α+o(|θ−ϑ|α). (2)

This resembles the local Hölder condition considered in Ibragimov and Hasminskii (1981, Section I.6). We call the index of regularity and the Hellinger information. Of course, if , then is proportional to , the Fisher information. Next are a few quick examples, all with .

• If , , then .

• If , , then .

• If , , then .

A class of non-regular models of particular interest to us here are those in Smith (1994) based on location shifts of distributions supported on the positive half-line. Consider a density on that satisfies

 p0(y)=βcyβ−1,as y→0, (3)

where and . For example, the gamma and Weibull families, with shape parameter and scale , have and , respectively. The next result identifies the regularity index and the Hellinger information for this class of location parameter problems, with . It shows that need not be an integer and the Hellinger information, like Fisher’s, is constant in location models. When , the model is regular—with and the Fisher information defined as usual—so we focus here on the non-regular case with .

###### Proposition 1.

Let satisfy (3) with . If, for some ,

 ∫∞Δ(ddylogp0(y))2p0(y)dy<∞, (4)

then and , where is as in (3) and

 r(β)=∫∞0{(w+1)(β−1)/2−w(β−1)/2}2dw. (5)
###### Proof.

See Section S2.1 in the Supplementary Material. ∎

Ibragimov and Hasminskii (1981, Theorem VI.1.1) show that, in this case, as , but they do not identify . Similar results have appeared elsewhere in the literature on non-regular models; our condition (4) is basically the same as Condition  in Woodroofe (1974), which is basically the same as Assumption 9 in Smith (1985).

Turning to the general, non-regular multi-parameter case, where is an open subset of , defining Hellinger information requires some additional effort. In particular, non-regularity implies that the familiar local quadratic approximation of fails, so we should not expect to have an “information matrix” to describe the local behavior in such cases. In fact, depends locally on the direction along which approaches , so there is no “direction-free” summary of the local structure and, hence, no “information matrix”; see Remark 2. But this lack of a convenient quadratic approximation need not stop us from defining a suitable Hellinger information.

###### Definition 1.

Let be an open subset of , for , and let denote a generic direction, a -vector with . Suppose there exists such that, for all and all directions , the following limit exists and is neither 0 nor :

 limε→0h(θ;θ+εu)|ε|α=J(θ;u). (6)

Then, the following local approximation holds:

 h(θ;θ+εu)=J(θ;u)|ε|α+o(|ε|α),ε→0. (7)

This defines the index of regularity and the Hellinger information at in the direction of .

Since the approximation (7) is in terms of , it follows that , so really only depends on the line defined by . If , then there is only one line, i.e., , hence, for the scalar case, we can drop the argument entirely and write as described above. It is also worth pointing out that Definition 1 assumes that a single index suffices to describe the regularity of a model with a -dimensional parameter. This is appropriate for the kinds of regression models we have in mind here, but can be a limitation in other cases; see Remark 1 below.

As a quick example, let , where and . In this form, and are location and scale parameters, respectively. If is a generic vector on the unit circle, then , where has a form which is slightly too complicated to present here; see Section S1 in the Supplementary Material. This expression agrees with the familiar properties of Fisher information for location–scale models.

Although we do not define an “information matrix” in the non-regular case (see Remark 2), when the model is regular, i.e., when , there are still some connections between our Hellinger information and the familiar Fisher information. In particular, is a quadratic form involving the Fisher information and the direction . This gives an alternative explanation of how the regular models admit a separation of the dependence on and on the direction of departure from .

###### Proposition 2.

For a regular model, with , if denotes the Fisher information matrix, then .

Another useful and familiar feature of Fisher information that also holds for Hellinger information is the reparametrization formula (Proposition 3), which comes in handy for regression problems where the natural parameter is expressed as a function of covariates and another parameter.

### 3.2 Hellinger information inequality

We now return to our original setup where are independent, but not necessarily identically distributed, with , , and is an unknown parameter taking values in an open subset of for some . Let

denote the joint distribution of

. Motivated by the regression problems below, we assume that each has the same index of regularity, . Following our intuition from the regular case, define the Hellinger information at , in the direction of , based on the sample of size , as

 Jn(θ;u)=n∑i=1Ji(θ;u). (8)

where is the Hellinger information based on as described above. See Remark 3 for more on this additivity property. Theorem 1 below will establish a suitable connection between and the quality of an estimator, and this will provide the necessary foundation for defining optimal designs for non-regular models.

Suppose the goal is to estimate , where , , is sufficiently smooth. Let be an estimator of , and measure its quality by the risk

 Rψ(Tn,θ)=Enθ∥Tn−ψ(θ)∥2, (9)

the -vector version of mean square error, where expectation, , is with respect to . This covers the case where and , so that interest is in the full parameter , and the case where is a single component of and , as well as other intermediate cases. Next is the aforementioned lower bound on the risk in terms of the total Hellinger information.

###### Theorem 1.

Let consist of independent observations with , . Let denote the common index of regularity, and the corresponding Hellinger information in (8). Let be a differentiable function with full-rank derivative matrix , and let be any estimator of with risk function defined in (9). If , and

 limn→∞infu[n−1Jn(θ;u)]>0, (10)

then, for all large ,

 infTnsupϑ∈Bn(θ)Rψ(Tn,ϑ)≳[infu{∥Dψ(θ)u∥−αJn(θ;u)}]−2/α, (11)

where is the region whose boundary is determined by the union of over all directions .

###### Proof.

See Appendix A.1. ∎

Two very brief comments: first, the universal constant hidden in “” is known and given in the proof; second, there is nothing special about “3” in the definition of , any number strictly greater than 2 would suffice.

Some additional comments about the interpretation of Theorem 1 are in order. First, the reason for taking supremum over a small “neighborhood” of is that a lucky choice of has excellent performance at , but poor performance at a nearby . The theorem basically says that, if one looks at a locally uniform measure of risk, which prevents “cheating” towards or luck at a particular , then one cannot have smaller risk than that in the lower bound (11). The classical Cramér–Rao lower bound uses unbiasedness of the estimator to prevent this kind of cheating/luck.

To assess the sharpness of the bound in (11) when regularity conditions do not apply, consider the case where , so that is a scalar function. For the rate, if we consider the identically independently distributed case, so that , then it follows that the lower bound is of order , which agrees with the known minimax rate for estimators in non-regular models (Ibragimov and Hasminskii 1981, Sec. I.5). Therefore, the bound cannot be improved in terms of dependence on the sample size. To assess the quality of the lower bound in terms of its dependence on , if the observations come from , which has and , the maximum likelihood estimator is the sample maximum, and its mean square error is given by

 θ2n(n+1)2(n+2)+(θnn+1−θ)2.

Asymptotically, this expression is proportional to , which agrees with our lower bound. Therefore, up to universal constants, the bound in Theorem 1 is sharp. Whether there exists an estimator that can attain the bound exactly or asymptotically is unclear in general; see Remark 4.

It is worth stating the special case where as a corollary to Theorem 1. This reveals some connection to the classical Cramér–Rao bound, even though we do not have access to an information matrix, and demonstrates the generality of our result.

###### Corollary 1.

When , if has derivative matrix of rank , and is the positive definite Fisher information matrix, then the lower bound in (11) is proportional to

 λmax{Dψ(θ)In(θ)−1Dψ(θ)⊤},

where denotes the maximal eigenvalue of a matrix .

###### Proof.

See Section S2.2 in the Supplementary Material. ∎

For comparison to the classical setting, if we take , then the expression in the above display simplifies to

 λmax{In(θ)−1}=λ−1min{In(θ)}. (12)

Wanting the information matrix to have a large minimal eigenvalue is a familiar concept in the classical optimal design theory; see Section 4.

This and the previous subsection, along with the remarks in Section 3.3, establish some important properties and insights concerning our proposed Hellinger information. A difficulty that has not yet been addressed is the dependence of on the arbitrary direction . However, the lower bound in (11) is free of a direction, so it makes sense to formulate a direction-free Hellinger information based on that. For a non-regular model as formulated above, with index of regularity , we set the direction-free Hellinger information at , for interest parameter , as

 Jψn(θ)=infu{∥Dψ(θ)u∥−αJn(θ;u)}. (13)

In the special case where , this simplifies to

 Jn(θ)=infuJn(θ;u). (14)

Moreover, in the regular case with , it follows from Corollary 1 and, in particular, (12), that above is (proportional to) the smallest eigenvalue of the Fisher information matrix. Therefore, definition (13) seems very reasonable; more details are presented in Section 4.

### 3.3 Technical remarks

###### Remark 1.

Definition 1 does not allow to depend on , so each component of

, treated individually, must have the same index of regularity. To see this, consider an exponential distribution with location and rate parameters

and , respectively. If was fixed and only was unknown, then it is a regular problem and the above definition would hold with . Similarly, if was fixed and only was unknown, then the definition holds with according to Proposition 1. However, if both and are unknown, then the model does not satisfy the conditions of Definition 1. Consider two unit vectors and . If , then is in but is zero; likewise, if , then is in but is infinite. Therefore, the above definition cannot accommodate situations where the components of , treated individually, would have different regularity indices. But the design applications we have in mind in this paper fit naturally within a setting where all components have the same regularity; the more general case will be considered elsewhere.

###### Remark 2.

Our definition of Hellinger information coincides with that in Shemyakin (2014) for one-parameter models, but our perspectives differ when it comes to multi-parameter models. Shemyakin defines a “Hellinger information matrix” for non-regular problems, which seems to contradict our above claim that no such matrix is available, so some more detailed comments are necessary. Shemyakin makes no claim that his information matrix is related to the local behavior of , and we are unable to conclude definitively whether it is or is not. We do know, however, that is “bowl-shaped” (though not smooth) at each , so if such a matrix could describe the local behavior, then it ought to be non-negative definite. However, Shemyakin (2014, p. 931) admits that a general non-negative definiteness result has not been established for his information matrix. Without a non-negative definiteness result for his Hellinger information matrix, lower bounds like those in, e.g., Shemyakin (1991, 1992) may not be informative, and its use in defining optimal designs lacks justification.

###### Remark 3.

In (8) we defined the Hellinger information in an independent sample of size as , the sum of the individual Hellinger information measures. This, however, is not a choice made by us, it is a consequence of the proof of Theorem 1

. To see this, heuristically, start with the Hellinger distance between joint distributions

and , assuming independence. A straightforward calculation reveals

 H2(Pnθ,Pnϑ) =2−2n∏i=1∫{pi,θ(yi)pi,ϑ(yi)}1/2dyi =2−2exp{n∑i=1log[1−12H2(Pi,θ,Pi,ϑ)]}

Since for , if is sufficiently close to , then the exponent is approximately and then, by Taylor’s theorem applied to at , we conclude that

 H2(Pnθ,Pnϑ)≈n∑i=1H2(Pi,θ,Pi,ϑ).

Therefore, a local approximation of the left-hand side is roughly equal to a sum of local approximations on the right-hand side, which leads to (8).

###### Remark 4.

An important unanswered question in the above theory is if there are any estimators that are efficient in the sense that they attain the lower bound in Theorem 1 in some generality. In the simple example above, we showed that the bound is asymptotically attained, up to universal constants, by the sample maximum. General results about the rate of convergence in non-regular models are consistent with our lower bound, but, to our knowledge, more precise results concerning the asymptotic behavior of estimators in non-regular problems is limited to certain special cases. Our work here provides some motivation for further investigation of these asymptotic properties. Not having an estimator that provably attains the lower bound complicates our attempts to demonstrate the efficiency gains of our proposed optimal designs in Section 4, but a quality estimator is available in the applications we have in mind; see Section 5.3.

## 4 Optimal designs for non-regular models

### 4.1 Definition

The previous section built up a framework of information, based on a local approximation of the squared Hellinger distance, suitable for non-regular problems where Fisher information does not exist. Our motivation for building such a framework was to address the problem of optimal experimental designs in cases where the underlying statistical model is non-regular. This section defines what we mean by an optimal design for non-regular models, and provides some additional details about the Hellinger information that are particularly relevant to the design problem.

We start here with a slightly different setup than in the previous section, but quickly connect it back to the preceding. Let be independent observations, where has density function , for . That is, each has its own parameter , which we will assume is real-valued, as is typical in linear and generalized linear models. Then the design problem proceeds by expressing the unit-specific parameter as a given function of a common parameter and a vector of unit-specific covariates; here, of course, the covariates are constants that the investigator is able to set in any way he/she pleases, but preferably in a way that is “optimal” in some sense. By linking each to a common , we obtain the setup from previous sections, i.e., , independent, for .

The next result, stated in the context of , parallels a familiar one in the regular case for Fisher information. It aids in computing the Hellinger information under a reparametrization like the one described above.

###### Proposition 3.

Let be a density function depending on a scalar parameter , and suppose that the index of regularity is and the Hellinger information is . Define a new density , for , as where is a smooth function with non-vanishing gradient . Then also has index of regularity , and the corresponding Hellinger information at , in the direction of , is

 J(θ;u)=|˙g(θ)⊤u|α~J(g(θ)).
###### Proof.

See Section S2.3 in the Supplementary Material. ∎

From the general theory in Section 3, if are independent, then under the assumptions in Proposition 3, i.e., , the Hellinger information at , in direction of , is

 Jn(θ;u)=n∑i=1|˙gi(θ)⊤u|α~J(gi(θ)).

For the special case where for covariates , it is clear that depends on . For example, if are independent, with , where , then it follows from Propositions 1 and 3 that

 Jn(θ;u)=1+βr(β)βΓ(β)n∑i=1∣∣p∑k=0xkiuk+1∣∣β.

The Hellinger information’s dependence on the covariates is what makes our theory of optimal design possible.

In what follows, we focus exclusively on the case of , and the direction-free definition of Hellinger distance in (14), though this is only for simplicity. The same derivations can be carried out with any specific interest parameter in mind.

Following the now-standard approximate design theory put forth by Kiefer (1974), let

denote a discrete probability measure defined on the design space—the space where the covariates

live—with at most distinct atoms, representing the design itself. That is, the atoms of represent the specific design points, and the probabilities correspond to the weights (more details below). Next, with a slight abuse of our previous notation, we write to indicate that the Hellinger information in the direction depends on the design through the specific covariate values. For example, given design , then , where is the Hellinger information in the direction based on one observation taken at location . Following (14), the Hellinger information based on design is defined as

 Jξ(θ)=infuJξ(θ;u).

Naturally, the optimal design under this setup would be defined as the one that maximizes this measure of information.

###### Definition 2.

Under the non-regular model setup presented above, the optimal design is one which maximizes the Hellinger information, i.e.,

 ξ⋆=argmaxξJξ(θ).

For comparison to the classical design theory, property (12) implies that our optimal design in Definition 2, under a regular model, corresponds to an E-optimal design, one that maximizes the minimum eigenvalue of the Fisher information matrix. For the non-regular case, however, we do not have an information matrix, so it is not clear if other common notions of optimality, such as A- and D-optimality, have any meaning. For example, non-regularity will cause sampling distributions of estimators to be non-ellipsoidal, so we cannot expect the determinant of some information matrix to correspond to the volume of a confidence ellipsoid.

Definition 2 formulates a new class of optimal design problems, deserving further attention. As discussed briefly in Section 1, there is now a substantial literature on theory and computation related to the optimal design problem in regular cases, and we hope that this paper stimulates a parallel line of work with similar developments for non-regular cases. There are some similarities to the regular case, in particular, the Hellinger information is non-negative and additive like Fisher information. Also, the map is concave for fixed , i.e., for any two designs and and any ,

 Jwξ+(1−w)ξ′(θ)≥wJξ(θ)+(1−w)Jξ′(θ), (15)

which is important for numerical and/or analytical solution of the optimal design problem. The following gives some first results along these lines.

### 4.2 A general result for non-regular polynomial models

Motivated by the setup in Smith (1994), we consider a non-regular model of the form

 yi=g(xi,θ)+εi,i=1,…,n, (16)

where are scalars, is a degree- polynomial, , with , is an unknown parameter, and are independent and identically distributed with density given in (3) and known shape parameter . As is customary (e.g., Koenker and Hallock 2001), we will insist that the design points be centered at the origin, which puts a constraint on the design itself. In particular, we will consider the space of designs given by

 Ξ={ξ=(wi,xi):∑iwixi=0,xi∈[−A,A]},

i.e., designs on that are “balanced” in the sense that the mean value is 0, where is fixed and known.

The following result shows that, among balanced designs, the subclass of symmetric designs is complete in the sense that the maximum information over symmetric designs is the same as that over the larger class of balanced designs. This implies that the search for an optimal design can be simplified by restricting it to the smaller class of symmetric designs.

###### Theorem 2.

Let denote the set of all balanced designs that are also symmetric in the sense that if is a design point, then it assigns equal weight to both and . Then

 maxξ∈ΞsymJξ(θ)=maxξ∈ΞJξ(θ).
###### Proof.

See Appendix A.2. ∎

The next section applies this general result to identify optimal designs in some special cases of the non-regular polynomial regression model above. The two results, Propositions 4 and 5, suggest that there is a de la Garza phenomenon (e.g., de la Garza 1954) in the non-regular case as well, which would be an interesting theoretical topic to pursue in future work.

## 5 Optimal designs for some non-regular regression models

In this section, we apply the general result in Theorem 2 to identify optimal designs in two important special cases of the polynomial model, namely, linear and quadratic. Throughout we assume the model stated in (16), namely, that the regression model has non-negative errors with distribution having density of the form (3), with known shape parameter .

### 5.1 Linear model

Consider the linear version of (16), where . For linear models we have a strong intuition from the regular case as to what the optimal design might be. It turns out that the same result holds in the non-regular case as well.

###### Proposition 4.

The optimal design , according to Definition 2, for the non-regular linear regression model is the symmetric two-point design with weight on .

###### Proof.

See Section S2.4 in the Supplementary Material. ∎

### 5.2 Quadratic model

Consider a quadratic case where . Here we restrict our attention to the case where the errors in the model are exponential, .

###### Proposition 5.

For the quadratic model, with and the balanced design constraint, the optimal design , according to Definition 2, is one with three distinct points with respective weights for some .

###### Proof.

See Section S2.5 in the Supplementary Material. ∎

Although the proof of Proposition 5 holds only for the case, we expect that the result also holds for , and the numerical results in Figure 3 (b) support this conjecture. The practical importance is that it simplifies the search over to a search over the scalar . The weight at point of the optimal design—or the likely optimal design for the case of —depends on the value of and . Based on Proposition 5 and the definition of Hellinger information, the optimal weight can be obtained by solving the optimization problem

 πA(α)=argmaxπ∈[0,1]f(π), (17)

where is given by

This search for the optimal weight, , along with that over on the surface of the unit sphere, can be handled numerically.

Figure 1 shows for several values of . In particular, we see that the (likely) optimal designs put more weight on 0 as either or increases. Our optimal designs for non-regular regression models have a similar format to their E-optimal counterparts in the regular case. That is, a regular E-optimal design for quadratic regression over is given by

 {(−A,1−wA2),(0,wA),(A,1−wA2)},

and, for in , the corresponding values of are . From Figure 1, as anticipated by Corollary 1, we observe that for , matches the weight of the corresponding regular E-optimal design. This is explained by Corollary 1; when , optimal design under Hellinger information is the E-optimal design.

Henceforth, we call the regular E-optimal design counterpart of a non-regular model “regular-optimal.” For the non-regular linear model, based on Proposition 4, the optimal design coincides with the “regular-optimal” design. In the numerical results presented below, we compare optimal designs of non-regular quadratic models to their “regular-optimal” counterparts.

### 5.3 Numerical results

Here we show some numerical results to demonstrate the efficiency gain in using the proposed optimal designs over other reasonable designs. Recall our model is of the form (16) with non-negative errors having density (3), with known shape parameter .

One complication is that currently there are no results that identify an estimator whose risk attains the lower bound in Theorem 1. Consequently, we are currently unable to guarantee that minimizing this lower bound will result in improved estimation for any given estimator. But we do have a reasonable estimator, described next, and the results below do indicate that the design that minimizes the lower bound in Theorem 1 does indeed result in improved efficiency for this particular estimation.

For the class of non-regular polynomial regression problems in consideration here, Smith (1994)

proposed an estimator based on solving a linear programming problem: choosing

such that is maximized subject to the condition that for each . This estimator agrees with the maximum likelihood estimator in the case , has a convergence rate, which matches the one given by the lower bound in (11), and can be readily computed using the quantreg package in R (Koenker 2013). Moreover, as Smith (1994, p. 174) argues, it is generally superior to maximum likelihood in non-regular cases. For these reasons, comparisons of designs based on this estimator ought to be informative.

Figure 2 presents simulation results on the quality of estimation for the Hellinger optimal design versus 5-, 10-, and 15-point uniform designs for the non-regular linear models, while Figure 3 presents simulation results comparing Hellinger optimal design versus 5-point uniform design and the regular-optimal design. The study proceeds as follows. For each design space and candidate design, the -vector is simulated from the corresponding model, with the specified value of and , and then Smith’s estimator is computed. Repeat this process 1000 times and compute the Monte Carlo estimate of the risk as usual. This risk is the sum of mean square errors for each component of the parameter vector.

Figure 2 shows that, under different regularity conditions, the optimal design from Proposition 4 is superior in terms of risk. In particular, it is significantly better in the estimation of the slope, , whereas no design performs significantly better than the others in the estimation of the intercept. The results presented in Figure 3(a) are consistent with Proposition 5 in the case of . In each case, the optimal design performs significantly better than both the 5-point uniform design and the regular-optimal design, despite the similarity of the optimal and regular-optimal designs in terms of weight at point 0. Similarly, Figure 3(b) supports our intuition that Proposition 5 can be extended to cases with .

## 6 Conclusion

This paper aims to establish a framework for optimal design in the context of non-regular models where the Fisher information matrix does not exist. Towards this goal, we defined an alternative measure of information, based on a local approximation of the squared Hellinger distance between models, suitable for non-regular problems. The proposed Hellinger information has some close connection to the Fisher information when both exist and, more generally, the former has many of the familiar properties of the latter. In particular, in Theorem 1 we establish a parallel to the classical Cramér–Rao inequality which connects our proposed Hellinger information measure to the quality of estimators. This naturally leads to a notion of optimal designs in non-regular problems, i.e., the “optimal design” is one that minimizes the lower bound in Theorem 1.

The proposed optimal design framework introduces a new class of optimization problems to solve, what we have considered here is only the tip of the iceberg. However, the tools currently available in the optimal design literature for regular problems are expected to be useful here. For example, in a particular non-regular polynomial regression setting, we establish a theorem to simplify the numerical and/or analytical search for a particular optimal design, and we apply this general result in the linear and quadratic cases. Developing the theory and computational methods to handle more complex non-regular models, as well as identifying estimators that attain the lower bound (11), are interesting topics for future investigation.

Aside from creating a new class of design problems to be investigated, the developments here also shed light on how much our current understanding of design problems depends on the regularity of the models being considered. That is, beyond its value in helping us tackle specific cases in which regularity conditions do not apply, the study of non-regular problems also deepens our understanding of regularity itself and how it affects optimal design. For example, questions about the type of optimality criterion to consider (e.g., A- versus D- versus E-optimal) are apparently only relevant for those regular cases where the Fisher information matrix is exactly or approximately related to the dispersion matrix of an estimator. While this paper provides some important insights about non-regular models and corresponding optimal design problems, there is still much more to be done.

## Acknowledgments

The authors are grateful for the helpful suggestions from the Associate Editor referees, and Professor Arkady Shemyakin. The authors also thank Mr. Zhiqiang Ye for pointing out a mistake in the statement and proof of Proposition 1 in a previous version.

## Appendix A Proofs of theorems

### a.1 Proof of Theorem 1

The proof requires a connection between Hellinger distance and risk of an estimator. This first step is based in part on Section I.6 of Ibragimov and Hasminskii (1981), although our setup and conclusions are more general in certain ways. We summarize this in the following lemma, proved in the Supplementary Material.

###### Lemma 1.

For data , consider a model , with -density , indexed by a parameter . Let be the interest parameter, where . For an estimator of , the risk function for the estimator satisfies

 Rψ(T,θ)+Rψ(T,ϑ)≥min{1−h(θ;ϑ)4h(θ;ϑ),116}∥ψ(θ)−ψ(ϑ)∥2.

For the proof of Theorem 1, start with the squared Hellinger distance between joint distributions and , given by

 hn(θ;ϑ):=H2(Pnθ,Pnϑ)=2[1−n∏i=1{1−hi(θ;ϑ)2}],

where is the squared Hellinger distance between individual components. If and are sufficiently close, in the sense that for each , then, given the following inequalities,

 1−x≤−logxand−log(1−x)≤2x,x∈[0,1/2],

it follows that

 hn(θ;ϑ)≤−2n∑i=1log{1−hi(θ;ϑ)2}≤2n∑i=1hi(θ;ϑ). (18)

According to our assumption about local expansion of the individual ’s, if for a unit vector , then

 hn(θ;θ+εu)≤2Jn(θ;u)εα+o(nεα),ε→0.

When we take equal to , then we get

 hn(θ;θ+εn,uu)≤23+o(1),n→∞,

where the latter “” conclusion is justified by the assumption (10) about the rate of information accumulation. Therefore, for large enough , with , , it follows from the above lemma that

 Rψ(Tn,θ)+Rψ(Tn,ϑn,u)≥116∥ψ(θ)−ψ(ϑn,u)∥2.

Since is differentiable, there is a Taylor approximation at :

 ψ(θ)−ψ(ϑn,u)=Dψ(θ)(θ−ϑn,u)+o(∥θ−ϑn,u∥),

where the latter little-oh means a -vector whose entries are all of that magnitude. Plugging in the definition of gives

 ψ(θ)−ψ(θ+εn,uu)=−εn,uDψ(θ)u+o(εn,u),n→∞,

and, hence,

 ∥ψ(θ)−ψ(θ+εn,uu)∥2=ε2n,u∥Dψ(θ)u+o(1)∥2≥12ε2n,u∥Dψ(θ)u∥2.

Plugging in the definition of establishes that

 Rψ(Tn,θ+εn,uu)+Rψ(Tn,θ)≳∥Dψ(θ)u∥2Jn(θ;u)−2/α.

Also, the constant that has been absorbed in “” is . Finally, the claim (11) follows from the above display and the general fact that, for a function defined on a set , is smaller than .

### a.2 Proof of Theorem 2

Take any fixed design , and define a function

 L(u;x)=Jξ(θ;u)=M∑m=1wm∣∣p∑k=0xkmuk+1∣∣α.

The function does not depend on because it is based on the information in a location parameter problem, but it does depend implicitly on the component of the design . From the trivial identity,

 axkm=a(−1)k(−xm)k,any a∈R, any m, and any k,

it follows immediately that , for any unit vector , where , . Since this new vector is also a unit vector, we have

 minuL(u;x)=minvL(v;−x).

This implies that the reflected design —the one that replaces the original in with , but keeps the same weights—satisfies . Define the mixture design , which is symmetric by construction, and by concavity (15) satisfies

 Jξ†(θ) =minu{12Jξ(θ;u)+12Jξ′(θ;u)} ≥12minuJξ(θ;u)+12minuJξ′(θ;u).

We showed above that the two terms in the lower bound are equal and, consequently, . Therefore, for any design there exists a symmetric design with Hellinger information at least as big; hence, symmetric designs form a complete class.

## References

• Bernardo (1979) Bernardo, J.-M. (1979).

Reference posterior distributions for Bayesian inference.

J. Roy. Statist. Soc. Ser. B, 41(2):113–147. With discussion.
• Biedermann et al. (2009) Biedermann, S., Dette, H., and Hoffmann, P. (2009). Constrained optimal discrimination designs for Fourier regression models. Ann. Inst. Statist. Math., 61(1):143–157.
• Biedermann et al. (2011) Biedermann, S., Dette, H., and Woods, D. C. (2011). Optimal design for additive partially nonlinear models. Biometrika, 98(2):449–458.
• Biedermann et al. (2006) Biedermann, S., Dette, H., and Zhu, W. (2006). Optimal designs for dose-response models with restricted design spaces. J. Amer. Statist. Assoc., 101(474):747–759.
• Chernozhukov and Hong (2004) Chernozhukov, V. and Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5):1445–1480.
• Cousineau (2009) Cousineau, D. (2009). Fitting the three-parameter Weibull distribution: Review and evaluation of existing and new methods. IEEE Transactions on Dielectrics and Electrical Insulation, 16(1):281–288.
• de la Garza (1954) de la Garza, A. (1954). Spacing of information in polynomial regression. Ann. Math. Statistics, 25:123–130.
• Dette et al. (2008) Dette, H., Bretz, F., Pepelyshev, A., and Pinheiro, J. (2008). Optimal designs for dose-finding studies. J. Amer. Statist. Assoc., 103(483):1225–1237.
• Dette et al. (2018) Dette, H., Guchenko, R., Melas, V. B., and Wong, W. K. (2018). Optimal discrimination designs for semiparametric models. Biometrika, 105(1):185–197.
• Dette et al. (2017) Dette, H., Konstantinou, M., and Zhigljavsky, A. (2017). A new approach to optimal designs for correlated observations. Ann. Statist., 45(4):1579–1608.
• Dette and Melas (2011) Dette, H. and Melas, V. B. (2011). A note on the de la Garza phenomenon for locally optimal designs. Ann. Statist., 39(2):1266–1281.
• Dette et al. (2016) Dette, H., Pepelyshev, A., and Zhigljavsky, A. (2016). Optimal designs in regression with correlated errors. Ann. Statist., 44(1):113–152.
• Dette and Schorning (2013) Dette, H. and Schorning, K. (2013). Complete classes of designs for nonlinear regression models and principal representations of moment spaces. Ann. Statist., 41(3):1260–1267.
• Dette and Titoff (2009) Dette, H. and Titoff, S. (2009). Optimal discrimination designs. Ann. Statist., 37(4):2056–2082.
• Dror and Steinberg (2006) Dror, H. A. and Steinberg, D. M. (2006). Robust experimental design for multivariate generalized linear models. Technometrics, 48(4):520–529.
• Feller et al. (2017) Feller, C., Schorning, K., Dette, H., Bermann, G., and Bornkamp, B. (2017). Optimal designs for dose response curves with common parameters. Ann. Statist., 45(5):2102–2132.
• Gotwalt et al. (2009) Gotwalt, C. M., Jones, B. A., and Steinberg, D. M. (2009). Fast computation of designs robust to parameter uncertainty for nonlinear settings. Technometrics, 51(1):88–95.
• Harman and Benková (2017) Harman, R. and Benková, E. (2017). Barycentric algorithm for computing -optimal size- and cost-constrained designs of experiments. Metrika, 80(2):201–225.
• Hirose and Lai (1997) Hirose, H. and Lai, T. L. (1997). Inference from grouped data in three-parameter Weibull models with applications to breakdown-voltage experiments. Technometrics, 39(2):199–210.
• Ibragimov and Hasminskii (1981) Ibragimov, I. A. and Hasminskii, R. Z. (1981). Statistical Estimation, volume 16 of Applications of Mathematics. Springer-Verlag, New York-Berlin. Asymptotic theory, Translated from the Russian by Samuel Kotz.
• Kiefer (1974) Kiefer, J. (1974). General equivalence theory for optimum designs (approximate theory). Ann. Statist., 2:849–879.
• Koenker (2013) Koenker, R. (2013).

quantreg: Quantile Regression

.
R package version 5.05.
• Koenker and Hallock (2001) Koenker, R. and Hallock, K. (2001). Quantile regression: An introduction. Journal of Economic Perspectives, 15(4):43–56.
• Lehmann and Casella (1998) Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. Springer Texts in Statistics. Springer-Verlag, New York, second edition.
• Lin et al. (2018) Lin, Y., Martin, R., and Yang, M. (2018). Supplement to “on optimal designs for non-regular models”. DOI…
• Lindley (1956) Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist., 27:986–1005.
• López-Fidalgo et al. (2007) López-Fidalgo, J., Tommasi, C., and Trandafir, P. C. (2007). An optimal experimental design criterion for discriminating between non-normal models. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(2):231–242.
• Pollard (1997) Pollard, D. (1997). Another look at differentiability in quadratic mean. In Festschrift for Lucien Le Cam, pages 305–314. Springer, New York.
• Pollard (2005) Pollard, D. (2005). Asymptopia. Chapter 6 on “Hellinger differentiability,” http://www.stat.yale.edu/~pollard/Courses/607.spring05/handouts/DQM.pdf.
• Sagnol and Harman (2015) Sagnol, G. and Harman, R. (2015). Computing exact -optimal designs by mixed integer second-order cone programming. Ann. Statist., 43(5):2198–2224.
• Schorning et al. (2017) Schorning, K., Dette, H., Kettelhake, K., Wong, W. K., and Bretz, F. (2017). Optimal designs for active controlled dose-finding trials with efficacy-toxicity outcomes. Biometrika, 104(4):1003–1010.
• Shemyakin (2014) Shemyakin, A. (2014). Hellinger distance and non-informative priors. Bayesian Anal., 9(4):923–938.
• Shemyakin (1991) Shemyakin, A. E. (1991). Multidimensional integral inequalities of Rao-Cramér type for parametric families with singularities. Sibirsk. Mat. Zh., 32(4):204–215, 230.
• Shemyakin (1992) Shemyakin, A. E. (1992). On information inequalities in parametric estimation theory. Teor. Veroyatnost. i Primenen., 37(1):121–123.
• Smith (1985) Smith, R. L. (1985). Maximum likelihood estimation in a class of nonregular cases. Biometrika, 72(1):67–90.
• Smith (1994) Smith, R. L. (1994). Nonregular regression. Biometrika, 81(1):173–183.
• van der Vaart (2002) van der Vaart, A. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30(3):631–682. Dedicated to the memory of Lucien Le Cam.
• van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
• Waterhouse et al. (2008) Waterhouse, T. H., Woods, D. C., Eccleston, J. A., and Lewis, S. M. (2008). Design selection criteria for discrimination/estimation for nested models and a binomial response. J. Statist. Plann. Inference, 138(1):132–144.
• Woodroofe (1974) Woodroofe, M. (1974). Maximum likelihood estimation of translation parameter of truncated distribution. II. Ann. Statist., 2:474–488.
• Yang (2010) Yang, M. (2010). On the de la Garza phenomenon. Ann. Statist., 38(4):2499–2524.
• Yang et al. (2013) Yang, M., Biedermann, S., and Tang, E. (2013). On optimal designs for nonlinear models: a general and efficient algorithm. J. Amer. Statist. Assoc., 108(504):1411–1420.
• Yang and Stufken (2009) Yang, M. and Stufken, J. (2009). Support points of locally optimal designs for nonlinear models with two parameters. Ann. Statist., 37(1):518–541.
• Yang and Stufken (2012) Yang, M. and Stufken, J. (2012). Identifying locally optimal designs for nonlinear models: a simple extension with profound consequences. Ann. Statist., 40(3):1665–1681.
• Yu (2011) Yu, Y. (2011). D-optimal designs via a cocktail algorithm. Stat. Comput., 21(4):475–481.

## S  Supplementary material

### S1. A multi-parameter example

As an illustrative example, consider the case where is two-dimensional and is