1 Introduction
Optimal experimental design is a classical problem with substantial recent developments. For example, Biedermann et al. (2006), Dette et al. (2008), Feller et al. (2017), and Schorning et al. (2017) studied optimal designs for doseresponse models; Dette et al. (2016) and Dette et al. (2017) investigated optimal designs for correlated observations; Dror and Steinberg (2006) and Gotwalt et al. (2009) studied robustness issues in optimal designs; LópezFidalgo et al. (2007), Waterhouse et al. (2008), Biedermann et al. (2009), Dette and Titoff (2009), and Dette et al. (2018) studied optimal discrimination designs; Biedermann et al. (2011) studied optimal design for additive partially nonlinear models; Yu (2011), Yang et al. (2013), Sagnol and Harman (2015), and Harman and Benková (2017) investigated algorithms for deriving optimal designs; and Yang and Stufken (2009), Yang (2010), Dette and Melas (2011), Yang and Stufken (2012), and Dette and Schorning (2013) built a new theoretical framework for studying optimal designs. The focus of these developments has been exclusively on regular models that enjoy certain normal features asymptotically, such as generalized linear models. However, certain nonregular models may be appropriate in practical applications (e.g., Chernozhukov and Hong 2004; Hirose and Lai 1997; Cousineau 2009). In particular, Smith (1994)
describes a class of nonregular linear regression models,
where the error is nonnegative, which implies a nonregular model for , given , since its distribution has dependent support. Such models are useful if the goal is to study extremes; for example,
might represent the lower bound on remission time when a patient is subjected to treatment settings described by the vector
. To date, there is no literature on optimal designs for cases like this, and the goal of this paper is to fill this gap by developing an approach to optimal design in nonregular problems.Towards formulating a design problem in a nonregular model, the first obstacle is that the Fisher information matrix—the fundamental object in the classical optimal design context—does not exist. To overcome this, we draw inspiration from recent work on the development of noninformative priors in the Bayesian context, thereby backtracking the path taken by Lindley (1956) and Bernardo (1979) from information in an experiment to noninformative priors. In particular, Shemyakin (2014) proposes an alternative to Fisher information and generalizes the noninformative prior construction of Jeffreys. An important feature of the Fisher information is how it describes the local behavior of the Hellinger distance (see Section 2), leading to its connection to estimator quality via the information inequality. Unfortunately, the role that Shemyakin’s information plays in the local approximation of Hellinger distance for multiparameter models remains unclear; see Remark 2. Since a connection to the quality of estimators is essential to our efforts to define a meaningful notion of optimal designs, we take an alternative approach where the focus is on a local approximation of Hellinger distance.
We start by looking at the local behavior of the squared Hellinger distance between models and , for . In the regular cases, there is a local quadratic approximation to the squared distance and the Fisher information matrix appears in the approximating quadratic form. In nonregular problems, by definition, the squared Hellinger distance is not locally quadratic, so there is no reason to expect that an “information matrix” can be extracted from this approximation. In fact, not being differentiable in quadratic mean implies that the Hellinger distance is continuous at , but not differentiable, so important features of the local approximation will generally depend on both the magnitude and the direction of the departure of from . From the local Hellinger distance approximation for a given direction, we define a directiondependent Hellinger information, which is additive like Fisher information for independent data, and establish a corresponding information inequality that suitably lowerbounds the risk function of an arbitrary estimator along that direction. The directiondependence is removed via profiling, and the result is a locally minimax lower bound on the risk of arbitrary estimators, which is inversely related to our directionfree Hellinger information. Therefore, just like in the familiar Cramér–Rao inequality for regular models, larger Hellinger information means a smaller lower bound and, consequently, better estimation in terms of risk.
The established connection between our Hellinger information for nonregular models and the quality of estimators provides a natural path to approach the optimal design problem. In particular, our Hellinger information depends on the design, so we define the optimal design as one that maximizes the Hellinger information. The intuition, just like in the regular case, is that maximizing the information minimizes the lower bound on the risk, thereby leading to improved estimation. If the model happens to be regular, then our proposed optimal design corresponds to the classical Eoptimal design that maximizes the minimum eigenvalue of the Fisher information matrix, so the new approach at least has intuitive appeal. After formally defining the notion of optimal design in this context, we develop some novel theoretical results, in particular a complete class theorem for symmetric designs in the context of nonregular polynomial regression. This theorem, along with some special cases presented in Propositions
4–5, suggests the potential for a line of developments parallel to that for regular models.The remainder of the paper is organized as follows. Section 2 sets our notation and briefly reviews the Fisher information and its properties under regularity conditions. We relax those regularity conditions in Section 3 and develop a notion of Hellinger information for certain nonregular models. The main result of the paper, Theorem 1, establishes a connection between this Hellinger information and the quality of estimators, thus paving the way for a framework of optimal designs for nonregular models in Section 4. Some specific nonregular regression models are considered in Section 5, and we derive some analytical optimality results and some numerical demonstrations of the improved efficiency of the optimal designs over other designs. Some concluding remarks are given in Section 6 and proofs of the two main theorems are presented in Appendix A; the remaining details are given in the Supplementary Material (Lin et al. 2018).
2 Review of information in regular models
The proposed model assumes that the valued observations are independent, and the marginal distribution of is , where is a fixed and unknown parameter in . For example, might be a distribution that depends on both the parameter and a fixed covariate vector . We will further assume that, for each , has a density with respect to a fixed dominating finite measure . When the index is not important, and there is no risk of confusion, we will drop the index and write simply for the density function with respect to .
It is common to assume that the model is regular in the sense that is smooth for each , and that derivatives of expectations can be evaluated by interchanging differentiation and integration. For example, under conditions (6.6) in Lehmann and Casella (1998), one can define the Fisher information matrix , whose entry is given by
(1) 
The Fisher information matrix can be defined in broader generality for families of distributions with a differentiability in quadratic mean property (e.g., Pollard 1997; van der Vaart 1998). That is, assume that there exists a function , typically the gradient of , taking values in , such that
where denotes the norm. Then the Fisher information matrix exists and is given by the formula . If we let denote the Hellinger distance and define as
then the above condition amounts to being locally quadratic:
Therefore, a model is regular if the squared Hellinger distance is locally approximately quadratic, with the Fisher information matrix characterizing that quadratic approximation. This is the description of Fisher information that we will attempt to extend to the nonregular case below.
Recall, also, that Fisher information is additive under independence. That is, if are independent, with , regular as above for each , then the Fisher information in the sample of size satisfies
where is the Fisher information matrix in (1) based on alone. This property has a nice interpretation: larger samples have more information.
Under differentiability in quadratic mean, one can prove an information inequality
which states that, for any unbiased estimator
ofwith finite second moment, the variance is lowerbounded and satisfies
where is the gradient of ; see Pollard (2005). The information inequality above, and its various extensions, establishes a fundamental connection between the quality of an estimator—in this case, the variance of an unbiased estimator—and the Fisher information matrix. This connection has been essential to the development of optimal design theory and practice since the quality of an estimator can be “optimized” by choosing a design that makes the quadratic form in the lower bound as small as possible, or the Fisher information as large as possible.
Finally, differentiability in quadratic mean implies local asymptotic normality (e.g., van der Vaart 1998, Theorem 7.2) which is almost all one needs to show that maximum likelihood estimators are efficient in the sense that they attain the information inequality lower bound (e.g., van der Vaart 1998, Theorem 7.12). Therefore, in sufficiently regular problems, there is a general procedure for constructing highquality estimators, and that the quality of such estimators is controlled by the Fisher information matrix. The remainder of this paper is concerned with nonregular cases and, unfortunately, these differ from their regular counterparts in several fundamental ways. First, the Fisher information is not welldefined in nonregular cases, so we have no general way of measuring the quality of estimators. Second, one cannot rely on maximum likelihood for constructing good estimators. For example, Le Cam writes (see van der Vaart 2002, p. 674)
The author is firmly convinced that a recourse to maximum likelihood is justifiable only when one is dealing with families of distributions that are extremely regular. The cases in which maximum likelihood estimates are readily obtainable and have been proved to have good properties are extremely restricted.
Therefore, to achieve our goals, we need a measure of information that is flexible enough to handle nonregular problems and is connected to estimation quality in general, but does not depend on a particular estimator. The Hellinger information, defined in Section 3.1, will meet these criteria and will provide a basis for defining optimal designs in nonregular problems.
3 Information in nonregular models
3.1 Definition and basic properties
To start, we consider the scalar case with . Suppose that there exists a constant such that, for each , the limit exists, is finite, and nonzero. If such an exists, then it must be unique; but there are cases where existence fails, e.g., when is not identifiable, so that for all sufficiently small . The case corresponds to differentiable in quadratic mean and, hence, “regular,” while corresponds to “nonregular.” Differentiability of or lack thereof determines a model’s regularity, so the largest value can take is 2; otherwise, the limit is infinite. From the above limit, there is a local approximation,
(2) 
This resembles the local Hölder condition considered in Ibragimov and Hasminskii (1981, Section I.6). We call the index of regularity and the Hellinger information. Of course, if , then is proportional to , the Fisher information. Next are a few quick examples, all with .

If , , then .

If , , then .

If , , then .
A class of nonregular models of particular interest to us here are those in Smith (1994) based on location shifts of distributions supported on the positive halfline. Consider a density on that satisfies
(3) 
where and . For example, the gamma and Weibull families, with shape parameter and scale , have and , respectively. The next result identifies the regularity index and the Hellinger information for this class of location parameter problems, with . It shows that need not be an integer and the Hellinger information, like Fisher’s, is constant in location models. When , the model is regular—with and the Fisher information defined as usual—so we focus here on the nonregular case with .
Proof.
See Section S2.1 in the Supplementary Material. ∎
Ibragimov and Hasminskii (1981, Theorem VI.1.1) show that, in this case, as , but they do not identify . Similar results have appeared elsewhere in the literature on nonregular models; our condition (4) is basically the same as Condition in Woodroofe (1974), which is basically the same as Assumption 9 in Smith (1985).
Turning to the general, nonregular multiparameter case, where is an open subset of , defining Hellinger information requires some additional effort. In particular, nonregularity implies that the familiar local quadratic approximation of fails, so we should not expect to have an “information matrix” to describe the local behavior in such cases. In fact, depends locally on the direction along which approaches , so there is no “directionfree” summary of the local structure and, hence, no “information matrix”; see Remark 2. But this lack of a convenient quadratic approximation need not stop us from defining a suitable Hellinger information.
Definition 1.
Let be an open subset of , for , and let denote a generic direction, a vector with . Suppose there exists such that, for all and all directions , the following limit exists and is neither 0 nor :
(6) 
Then, the following local approximation holds:
(7) 
This defines the index of regularity and the Hellinger information at in the direction of .
Since the approximation (7) is in terms of , it follows that , so really only depends on the line defined by . If , then there is only one line, i.e., , hence, for the scalar case, we can drop the argument entirely and write as described above. It is also worth pointing out that Definition 1 assumes that a single index suffices to describe the regularity of a model with a dimensional parameter. This is appropriate for the kinds of regression models we have in mind here, but can be a limitation in other cases; see Remark 1 below.
As a quick example, let , where and . In this form, and are location and scale parameters, respectively. If is a generic vector on the unit circle, then , where has a form which is slightly too complicated to present here; see Section S1 in the Supplementary Material. This expression agrees with the familiar properties of Fisher information for location–scale models.
Although we do not define an “information matrix” in the nonregular case (see Remark 2), when the model is regular, i.e., when , there are still some connections between our Hellinger information and the familiar Fisher information. In particular, is a quadratic form involving the Fisher information and the direction . This gives an alternative explanation of how the regular models admit a separation of the dependence on and on the direction of departure from .
Proposition 2.
For a regular model, with , if denotes the Fisher information matrix, then .
Another useful and familiar feature of Fisher information that also holds for Hellinger information is the reparametrization formula (Proposition 3), which comes in handy for regression problems where the natural parameter is expressed as a function of covariates and another parameter.
3.2 Hellinger information inequality
We now return to our original setup where are independent, but not necessarily identically distributed, with , , and is an unknown parameter taking values in an open subset of for some . Let
denote the joint distribution of
. Motivated by the regression problems below, we assume that each has the same index of regularity, . Following our intuition from the regular case, define the Hellinger information at , in the direction of , based on the sample of size , as(8) 
where is the Hellinger information based on as described above. See Remark 3 for more on this additivity property. Theorem 1 below will establish a suitable connection between and the quality of an estimator, and this will provide the necessary foundation for defining optimal designs for nonregular models.
Suppose the goal is to estimate , where , , is sufficiently smooth. Let be an estimator of , and measure its quality by the risk
(9) 
the vector version of mean square error, where expectation, , is with respect to . This covers the case where and , so that interest is in the full parameter , and the case where is a single component of and , as well as other intermediate cases. Next is the aforementioned lower bound on the risk in terms of the total Hellinger information.
Theorem 1.
Let consist of independent observations with , . Let denote the common index of regularity, and the corresponding Hellinger information in (8). Let be a differentiable function with fullrank derivative matrix , and let be any estimator of with risk function defined in (9). If , and
(10) 
then, for all large ,
(11) 
where is the region whose boundary is determined by the union of over all directions .
Proof.
See Appendix A.1. ∎
Two very brief comments: first, the universal constant hidden in “” is known and given in the proof; second, there is nothing special about “3” in the definition of , any number strictly greater than 2 would suffice.
Some additional comments about the interpretation of Theorem 1 are in order. First, the reason for taking supremum over a small “neighborhood” of is that a lucky choice of has excellent performance at , but poor performance at a nearby . The theorem basically says that, if one looks at a locally uniform measure of risk, which prevents “cheating” towards or luck at a particular , then one cannot have smaller risk than that in the lower bound (11). The classical Cramér–Rao lower bound uses unbiasedness of the estimator to prevent this kind of cheating/luck.
To assess the sharpness of the bound in (11) when regularity conditions do not apply, consider the case where , so that is a scalar function. For the rate, if we consider the identically independently distributed case, so that , then it follows that the lower bound is of order , which agrees with the known minimax rate for estimators in nonregular models (Ibragimov and Hasminskii 1981, Sec. I.5). Therefore, the bound cannot be improved in terms of dependence on the sample size. To assess the quality of the lower bound in terms of its dependence on , if the observations come from , which has and , the maximum likelihood estimator is the sample maximum, and its mean square error is given by
Asymptotically, this expression is proportional to , which agrees with our lower bound. Therefore, up to universal constants, the bound in Theorem 1 is sharp. Whether there exists an estimator that can attain the bound exactly or asymptotically is unclear in general; see Remark 4.
It is worth stating the special case where as a corollary to Theorem 1. This reveals some connection to the classical Cramér–Rao bound, even though we do not have access to an information matrix, and demonstrates the generality of our result.
Corollary 1.
When , if has derivative matrix of rank , and is the positive definite Fisher information matrix, then the lower bound in (11) is proportional to
where denotes the maximal eigenvalue of a matrix .
Proof.
See Section S2.2 in the Supplementary Material. ∎
For comparison to the classical setting, if we take , then the expression in the above display simplifies to
(12) 
Wanting the information matrix to have a large minimal eigenvalue is a familiar concept in the classical optimal design theory; see Section 4.
This and the previous subsection, along with the remarks in Section 3.3, establish some important properties and insights concerning our proposed Hellinger information. A difficulty that has not yet been addressed is the dependence of on the arbitrary direction . However, the lower bound in (11) is free of a direction, so it makes sense to formulate a directionfree Hellinger information based on that. For a nonregular model as formulated above, with index of regularity , we set the directionfree Hellinger information at , for interest parameter , as
(13) 
In the special case where , this simplifies to
(14) 
Moreover, in the regular case with , it follows from Corollary 1 and, in particular, (12), that above is (proportional to) the smallest eigenvalue of the Fisher information matrix. Therefore, definition (13) seems very reasonable; more details are presented in Section 4.
3.3 Technical remarks
Remark 1.
Definition 1 does not allow to depend on , so each component of
, treated individually, must have the same index of regularity. To see this, consider an exponential distribution with location and rate parameters
and , respectively. If was fixed and only was unknown, then it is a regular problem and the above definition would hold with . Similarly, if was fixed and only was unknown, then the definition holds with according to Proposition 1. However, if both and are unknown, then the model does not satisfy the conditions of Definition 1. Consider two unit vectors and . If , then is in but is zero; likewise, if , then is in but is infinite. Therefore, the above definition cannot accommodate situations where the components of , treated individually, would have different regularity indices. But the design applications we have in mind in this paper fit naturally within a setting where all components have the same regularity; the more general case will be considered elsewhere.Remark 2.
Our definition of Hellinger information coincides with that in Shemyakin (2014) for oneparameter models, but our perspectives differ when it comes to multiparameter models. Shemyakin defines a “Hellinger information matrix” for nonregular problems, which seems to contradict our above claim that no such matrix is available, so some more detailed comments are necessary. Shemyakin makes no claim that his information matrix is related to the local behavior of , and we are unable to conclude definitively whether it is or is not. We do know, however, that is “bowlshaped” (though not smooth) at each , so if such a matrix could describe the local behavior, then it ought to be nonnegative definite. However, Shemyakin (2014, p. 931) admits that a general nonnegative definiteness result has not been established for his information matrix. Without a nonnegative definiteness result for his Hellinger information matrix, lower bounds like those in, e.g., Shemyakin (1991, 1992) may not be informative, and its use in defining optimal designs lacks justification.
Remark 3.
In (8) we defined the Hellinger information in an independent sample of size as , the sum of the individual Hellinger information measures. This, however, is not a choice made by us, it is a consequence of the proof of Theorem 1
. To see this, heuristically, start with the Hellinger distance between joint distributions
and , assuming independence. A straightforward calculation revealsSince for , if is sufficiently close to , then the exponent is approximately and then, by Taylor’s theorem applied to at , we conclude that
Therefore, a local approximation of the lefthand side is roughly equal to a sum of local approximations on the righthand side, which leads to (8).
Remark 4.
An important unanswered question in the above theory is if there are any estimators that are efficient in the sense that they attain the lower bound in Theorem 1 in some generality. In the simple example above, we showed that the bound is asymptotically attained, up to universal constants, by the sample maximum. General results about the rate of convergence in nonregular models are consistent with our lower bound, but, to our knowledge, more precise results concerning the asymptotic behavior of estimators in nonregular problems is limited to certain special cases. Our work here provides some motivation for further investigation of these asymptotic properties. Not having an estimator that provably attains the lower bound complicates our attempts to demonstrate the efficiency gains of our proposed optimal designs in Section 4, but a quality estimator is available in the applications we have in mind; see Section 5.3.
4 Optimal designs for nonregular models
4.1 Definition
The previous section built up a framework of information, based on a local approximation of the squared Hellinger distance, suitable for nonregular problems where Fisher information does not exist. Our motivation for building such a framework was to address the problem of optimal experimental designs in cases where the underlying statistical model is nonregular. This section defines what we mean by an optimal design for nonregular models, and provides some additional details about the Hellinger information that are particularly relevant to the design problem.
We start here with a slightly different setup than in the previous section, but quickly connect it back to the preceding. Let be independent observations, where has density function , for . That is, each has its own parameter , which we will assume is realvalued, as is typical in linear and generalized linear models. Then the design problem proceeds by expressing the unitspecific parameter as a given function of a common parameter and a vector of unitspecific covariates; here, of course, the covariates are constants that the investigator is able to set in any way he/she pleases, but preferably in a way that is “optimal” in some sense. By linking each to a common , we obtain the setup from previous sections, i.e., , independent, for .
The next result, stated in the context of , parallels a familiar one in the regular case for Fisher information. It aids in computing the Hellinger information under a reparametrization like the one described above.
Proposition 3.
Let be a density function depending on a scalar parameter , and suppose that the index of regularity is and the Hellinger information is . Define a new density , for , as where is a smooth function with nonvanishing gradient . Then also has index of regularity , and the corresponding Hellinger information at , in the direction of , is
Proof.
See Section S2.3 in the Supplementary Material. ∎
From the general theory in Section 3, if are independent, then under the assumptions in Proposition 3, i.e., , the Hellinger information at , in direction of , is
For the special case where for covariates , it is clear that depends on . For example, if are independent, with , where , then it follows from Propositions 1 and 3 that
The Hellinger information’s dependence on the covariates is what makes our theory of optimal design possible.
In what follows, we focus exclusively on the case of , and the directionfree definition of Hellinger distance in (14), though this is only for simplicity. The same derivations can be carried out with any specific interest parameter in mind.
Following the nowstandard approximate design theory put forth by Kiefer (1974), let
denote a discrete probability measure defined on the design space—the space where the covariates
live—with at most distinct atoms, representing the design itself. That is, the atoms of represent the specific design points, and the probabilities correspond to the weights (more details below). Next, with a slight abuse of our previous notation, we write to indicate that the Hellinger information in the direction depends on the design through the specific covariate values. For example, given design , then , where is the Hellinger information in the direction based on one observation taken at location . Following (14), the Hellinger information based on design is defined asNaturally, the optimal design under this setup would be defined as the one that maximizes this measure of information.
Definition 2.
Under the nonregular model setup presented above, the optimal design is one which maximizes the Hellinger information, i.e.,
For comparison to the classical design theory, property (12) implies that our optimal design in Definition 2, under a regular model, corresponds to an Eoptimal design, one that maximizes the minimum eigenvalue of the Fisher information matrix. For the nonregular case, however, we do not have an information matrix, so it is not clear if other common notions of optimality, such as A and Doptimality, have any meaning. For example, nonregularity will cause sampling distributions of estimators to be nonellipsoidal, so we cannot expect the determinant of some information matrix to correspond to the volume of a confidence ellipsoid.
Definition 2 formulates a new class of optimal design problems, deserving further attention. As discussed briefly in Section 1, there is now a substantial literature on theory and computation related to the optimal design problem in regular cases, and we hope that this paper stimulates a parallel line of work with similar developments for nonregular cases. There are some similarities to the regular case, in particular, the Hellinger information is nonnegative and additive like Fisher information. Also, the map is concave for fixed , i.e., for any two designs and and any ,
(15) 
which is important for numerical and/or analytical solution of the optimal design problem. The following gives some first results along these lines.
4.2 A general result for nonregular polynomial models
Motivated by the setup in Smith (1994), we consider a nonregular model of the form
(16) 
where are scalars, is a degree polynomial, , with , is an unknown parameter, and are independent and identically distributed with density given in (3) and known shape parameter . As is customary (e.g., Koenker and Hallock 2001), we will insist that the design points be centered at the origin, which puts a constraint on the design itself. In particular, we will consider the space of designs given by
i.e., designs on that are “balanced” in the sense that the mean value is 0, where is fixed and known.
The following result shows that, among balanced designs, the subclass of symmetric designs is complete in the sense that the maximum information over symmetric designs is the same as that over the larger class of balanced designs. This implies that the search for an optimal design can be simplified by restricting it to the smaller class of symmetric designs.
Theorem 2.
Let denote the set of all balanced designs that are also symmetric in the sense that if is a design point, then it assigns equal weight to both and . Then
Proof.
See Appendix A.2. ∎
The next section applies this general result to identify optimal designs in some special cases of the nonregular polynomial regression model above. The two results, Propositions 4 and 5, suggest that there is a de la Garza phenomenon (e.g., de la Garza 1954) in the nonregular case as well, which would be an interesting theoretical topic to pursue in future work.
5 Optimal designs for some nonregular regression models
In this section, we apply the general result in Theorem 2 to identify optimal designs in two important special cases of the polynomial model, namely, linear and quadratic. Throughout we assume the model stated in (16), namely, that the regression model has nonnegative errors with distribution having density of the form (3), with known shape parameter .
5.1 Linear model
Consider the linear version of (16), where . For linear models we have a strong intuition from the regular case as to what the optimal design might be. It turns out that the same result holds in the nonregular case as well.
Proposition 4.
The optimal design , according to Definition 2, for the nonregular linear regression model is the symmetric twopoint design with weight on .
Proof.
See Section S2.4 in the Supplementary Material. ∎
5.2 Quadratic model
Consider a quadratic case where . Here we restrict our attention to the case where the errors in the model are exponential, .
Proposition 5.
For the quadratic model, with and the balanced design constraint, the optimal design , according to Definition 2, is one with three distinct points with respective weights for some .
Proof.
See Section S2.5 in the Supplementary Material. ∎
Although the proof of Proposition 5 holds only for the case, we expect that the result also holds for , and the numerical results in Figure 3 (b) support this conjecture. The practical importance is that it simplifies the search over to a search over the scalar . The weight at point of the optimal design—or the likely optimal design for the case of —depends on the value of and . Based on Proposition 5 and the definition of Hellinger information, the optimal weight can be obtained by solving the optimization problem
(17) 
where is given by
This search for the optimal weight, , along with that over on the surface of the unit sphere, can be handled numerically.
Figure 1 shows for several values of . In particular, we see that the (likely) optimal designs put more weight on 0 as either or increases. Our optimal designs for nonregular regression models have a similar format to their Eoptimal counterparts in the regular case. That is, a regular Eoptimal design for quadratic regression over is given by
and, for in , the corresponding values of are . From Figure 1, as anticipated by Corollary 1, we observe that for , matches the weight of the corresponding regular Eoptimal design. This is explained by Corollary 1; when , optimal design under Hellinger information is the Eoptimal design.
Henceforth, we call the regular Eoptimal design counterpart of a nonregular model “regularoptimal.” For the nonregular linear model, based on Proposition 4, the optimal design coincides with the “regularoptimal” design. In the numerical results presented below, we compare optimal designs of nonregular quadratic models to their “regularoptimal” counterparts.
5.3 Numerical results
Here we show some numerical results to demonstrate the efficiency gain in using the proposed optimal designs over other reasonable designs. Recall our model is of the form (16) with nonnegative errors having density (3), with known shape parameter .
One complication is that currently there are no results that identify an estimator whose risk attains the lower bound in Theorem 1. Consequently, we are currently unable to guarantee that minimizing this lower bound will result in improved estimation for any given estimator. But we do have a reasonable estimator, described next, and the results below do indicate that the design that minimizes the lower bound in Theorem 1 does indeed result in improved efficiency for this particular estimation.
For the class of nonregular polynomial regression problems in consideration here, Smith (1994)
proposed an estimator based on solving a linear programming problem: choosing
such that is maximized subject to the condition that for each . This estimator agrees with the maximum likelihood estimator in the case , has a convergence rate, which matches the one given by the lower bound in (11), and can be readily computed using the quantreg package in R (Koenker 2013). Moreover, as Smith (1994, p. 174) argues, it is generally superior to maximum likelihood in nonregular cases. For these reasons, comparisons of designs based on this estimator ought to be informative.Figure 2 presents simulation results on the quality of estimation for the Hellinger optimal design versus 5, 10, and 15point uniform designs for the nonregular linear models, while Figure 3 presents simulation results comparing Hellinger optimal design versus 5point uniform design and the regularoptimal design. The study proceeds as follows. For each design space and candidate design, the vector is simulated from the corresponding model, with the specified value of and , and then Smith’s estimator is computed. Repeat this process 1000 times and compute the Monte Carlo estimate of the risk as usual. This risk is the sum of mean square errors for each component of the parameter vector.
Figure 2 shows that, under different regularity conditions, the optimal design from Proposition 4 is superior in terms of risk. In particular, it is significantly better in the estimation of the slope, , whereas no design performs significantly better than the others in the estimation of the intercept. The results presented in Figure 3(a) are consistent with Proposition 5 in the case of . In each case, the optimal design performs significantly better than both the 5point uniform design and the regularoptimal design, despite the similarity of the optimal and regularoptimal designs in terms of weight at point 0. Similarly, Figure 3(b) supports our intuition that Proposition 5 can be extended to cases with .
6 Conclusion
This paper aims to establish a framework for optimal design in the context of nonregular models where the Fisher information matrix does not exist. Towards this goal, we defined an alternative measure of information, based on a local approximation of the squared Hellinger distance between models, suitable for nonregular problems. The proposed Hellinger information has some close connection to the Fisher information when both exist and, more generally, the former has many of the familiar properties of the latter. In particular, in Theorem 1 we establish a parallel to the classical Cramér–Rao inequality which connects our proposed Hellinger information measure to the quality of estimators. This naturally leads to a notion of optimal designs in nonregular problems, i.e., the “optimal design” is one that minimizes the lower bound in Theorem 1.
The proposed optimal design framework introduces a new class of optimization problems to solve, what we have considered here is only the tip of the iceberg. However, the tools currently available in the optimal design literature for regular problems are expected to be useful here. For example, in a particular nonregular polynomial regression setting, we establish a theorem to simplify the numerical and/or analytical search for a particular optimal design, and we apply this general result in the linear and quadratic cases. Developing the theory and computational methods to handle more complex nonregular models, as well as identifying estimators that attain the lower bound (11), are interesting topics for future investigation.
Aside from creating a new class of design problems to be investigated, the developments here also shed light on how much our current understanding of design problems depends on the regularity of the models being considered. That is, beyond its value in helping us tackle specific cases in which regularity conditions do not apply, the study of nonregular problems also deepens our understanding of regularity itself and how it affects optimal design. For example, questions about the type of optimality criterion to consider (e.g., A versus D versus Eoptimal) are apparently only relevant for those regular cases where the Fisher information matrix is exactly or approximately related to the dispersion matrix of an estimator. While this paper provides some important insights about nonregular models and corresponding optimal design problems, there is still much more to be done.
Acknowledgments
The authors are grateful for the helpful suggestions from the Associate Editor referees, and Professor Arkady Shemyakin. The authors also thank Mr. Zhiqiang Ye for pointing out a mistake in the statement and proof of Proposition 1 in a previous version.
Appendix A Proofs of theorems
a.1 Proof of Theorem 1
The proof requires a connection between Hellinger distance and risk of an estimator. This first step is based in part on Section I.6 of Ibragimov and Hasminskii (1981), although our setup and conclusions are more general in certain ways. We summarize this in the following lemma, proved in the Supplementary Material.
Lemma 1.
For data , consider a model , with density , indexed by a parameter . Let be the interest parameter, where . For an estimator of , the risk function for the estimator satisfies
For the proof of Theorem 1, start with the squared Hellinger distance between joint distributions and , given by
where is the squared Hellinger distance between individual components. If and are sufficiently close, in the sense that for each , then, given the following inequalities,
it follows that
(18) 
According to our assumption about local expansion of the individual ’s, if for a unit vector , then
When we take equal to , then we get
where the latter “” conclusion is justified by the assumption (10) about the rate of information accumulation. Therefore, for large enough , with , , it follows from the above lemma that
Since is differentiable, there is a Taylor approximation at :
where the latter littleoh means a vector whose entries are all of that magnitude. Plugging in the definition of gives
and, hence,
Plugging in the definition of establishes that
Also, the constant that has been absorbed in “” is . Finally, the claim (11) follows from the above display and the general fact that, for a function defined on a set , is smaller than .
a.2 Proof of Theorem 2
Take any fixed design , and define a function
The function does not depend on because it is based on the information in a location parameter problem, but it does depend implicitly on the component of the design . From the trivial identity,
it follows immediately that , for any unit vector , where , . Since this new vector is also a unit vector, we have
This implies that the reflected design —the one that replaces the original in with , but keeps the same weights—satisfies . Define the mixture design , which is symmetric by construction, and by concavity (15) satisfies
We showed above that the two terms in the lower bound are equal and, consequently, . Therefore, for any design there exists a symmetric design with Hellinger information at least as big; hence, symmetric designs form a complete class.
References

Bernardo (1979)
Bernardo, J.M. (1979).
Reference posterior distributions for Bayesian inference.
J. Roy. Statist. Soc. Ser. B, 41(2):113–147. With discussion.  Biedermann et al. (2009) Biedermann, S., Dette, H., and Hoffmann, P. (2009). Constrained optimal discrimination designs for Fourier regression models. Ann. Inst. Statist. Math., 61(1):143–157.
 Biedermann et al. (2011) Biedermann, S., Dette, H., and Woods, D. C. (2011). Optimal design for additive partially nonlinear models. Biometrika, 98(2):449–458.
 Biedermann et al. (2006) Biedermann, S., Dette, H., and Zhu, W. (2006). Optimal designs for doseresponse models with restricted design spaces. J. Amer. Statist. Assoc., 101(474):747–759.
 Chernozhukov and Hong (2004) Chernozhukov, V. and Hong, H. (2004). Likelihood estimation and inference in a class of nonregular econometric models. Econometrica, 72(5):1445–1480.
 Cousineau (2009) Cousineau, D. (2009). Fitting the threeparameter Weibull distribution: Review and evaluation of existing and new methods. IEEE Transactions on Dielectrics and Electrical Insulation, 16(1):281–288.
 de la Garza (1954) de la Garza, A. (1954). Spacing of information in polynomial regression. Ann. Math. Statistics, 25:123–130.
 Dette et al. (2008) Dette, H., Bretz, F., Pepelyshev, A., and Pinheiro, J. (2008). Optimal designs for dosefinding studies. J. Amer. Statist. Assoc., 103(483):1225–1237.
 Dette et al. (2018) Dette, H., Guchenko, R., Melas, V. B., and Wong, W. K. (2018). Optimal discrimination designs for semiparametric models. Biometrika, 105(1):185–197.
 Dette et al. (2017) Dette, H., Konstantinou, M., and Zhigljavsky, A. (2017). A new approach to optimal designs for correlated observations. Ann. Statist., 45(4):1579–1608.
 Dette and Melas (2011) Dette, H. and Melas, V. B. (2011). A note on the de la Garza phenomenon for locally optimal designs. Ann. Statist., 39(2):1266–1281.
 Dette et al. (2016) Dette, H., Pepelyshev, A., and Zhigljavsky, A. (2016). Optimal designs in regression with correlated errors. Ann. Statist., 44(1):113–152.
 Dette and Schorning (2013) Dette, H. and Schorning, K. (2013). Complete classes of designs for nonlinear regression models and principal representations of moment spaces. Ann. Statist., 41(3):1260–1267.
 Dette and Titoff (2009) Dette, H. and Titoff, S. (2009). Optimal discrimination designs. Ann. Statist., 37(4):2056–2082.
 Dror and Steinberg (2006) Dror, H. A. and Steinberg, D. M. (2006). Robust experimental design for multivariate generalized linear models. Technometrics, 48(4):520–529.
 Feller et al. (2017) Feller, C., Schorning, K., Dette, H., Bermann, G., and Bornkamp, B. (2017). Optimal designs for dose response curves with common parameters. Ann. Statist., 45(5):2102–2132.
 Gotwalt et al. (2009) Gotwalt, C. M., Jones, B. A., and Steinberg, D. M. (2009). Fast computation of designs robust to parameter uncertainty for nonlinear settings. Technometrics, 51(1):88–95.
 Harman and Benková (2017) Harman, R. and Benková, E. (2017). Barycentric algorithm for computing optimal size and costconstrained designs of experiments. Metrika, 80(2):201–225.
 Hirose and Lai (1997) Hirose, H. and Lai, T. L. (1997). Inference from grouped data in threeparameter Weibull models with applications to breakdownvoltage experiments. Technometrics, 39(2):199–210.
 Ibragimov and Hasminskii (1981) Ibragimov, I. A. and Hasminskii, R. Z. (1981). Statistical Estimation, volume 16 of Applications of Mathematics. SpringerVerlag, New YorkBerlin. Asymptotic theory, Translated from the Russian by Samuel Kotz.
 Kiefer (1974) Kiefer, J. (1974). General equivalence theory for optimum designs (approximate theory). Ann. Statist., 2:849–879.

Koenker (2013)
Koenker, R. (2013).
quantreg: Quantile Regression
. R package version 5.05.  Koenker and Hallock (2001) Koenker, R. and Hallock, K. (2001). Quantile regression: An introduction. Journal of Economic Perspectives, 15(4):43–56.
 Lehmann and Casella (1998) Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. Springer Texts in Statistics. SpringerVerlag, New York, second edition.
 Lin et al. (2018) Lin, Y., Martin, R., and Yang, M. (2018). Supplement to “on optimal designs for nonregular models”. DOI…
 Lindley (1956) Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist., 27:986–1005.
 LópezFidalgo et al. (2007) LópezFidalgo, J., Tommasi, C., and Trandafir, P. C. (2007). An optimal experimental design criterion for discriminating between nonnormal models. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(2):231–242.
 Pollard (1997) Pollard, D. (1997). Another look at differentiability in quadratic mean. In Festschrift for Lucien Le Cam, pages 305–314. Springer, New York.
 Pollard (2005) Pollard, D. (2005). Asymptopia. Chapter 6 on “Hellinger differentiability,” http://www.stat.yale.edu/~pollard/Courses/607.spring05/handouts/DQM.pdf.
 Sagnol and Harman (2015) Sagnol, G. and Harman, R. (2015). Computing exact optimal designs by mixed integer secondorder cone programming. Ann. Statist., 43(5):2198–2224.
 Schorning et al. (2017) Schorning, K., Dette, H., Kettelhake, K., Wong, W. K., and Bretz, F. (2017). Optimal designs for active controlled dosefinding trials with efficacytoxicity outcomes. Biometrika, 104(4):1003–1010.
 Shemyakin (2014) Shemyakin, A. (2014). Hellinger distance and noninformative priors. Bayesian Anal., 9(4):923–938.
 Shemyakin (1991) Shemyakin, A. E. (1991). Multidimensional integral inequalities of RaoCramér type for parametric families with singularities. Sibirsk. Mat. Zh., 32(4):204–215, 230.
 Shemyakin (1992) Shemyakin, A. E. (1992). On information inequalities in parametric estimation theory. Teor. Veroyatnost. i Primenen., 37(1):121–123.
 Smith (1985) Smith, R. L. (1985). Maximum likelihood estimation in a class of nonregular cases. Biometrika, 72(1):67–90.
 Smith (1994) Smith, R. L. (1994). Nonregular regression. Biometrika, 81(1):173–183.
 van der Vaart (2002) van der Vaart, A. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30(3):631–682. Dedicated to the memory of Lucien Le Cam.
 van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
 Waterhouse et al. (2008) Waterhouse, T. H., Woods, D. C., Eccleston, J. A., and Lewis, S. M. (2008). Design selection criteria for discrimination/estimation for nested models and a binomial response. J. Statist. Plann. Inference, 138(1):132–144.
 Woodroofe (1974) Woodroofe, M. (1974). Maximum likelihood estimation of translation parameter of truncated distribution. II. Ann. Statist., 2:474–488.
 Yang (2010) Yang, M. (2010). On the de la Garza phenomenon. Ann. Statist., 38(4):2499–2524.
 Yang et al. (2013) Yang, M., Biedermann, S., and Tang, E. (2013). On optimal designs for nonlinear models: a general and efficient algorithm. J. Amer. Statist. Assoc., 108(504):1411–1420.
 Yang and Stufken (2009) Yang, M. and Stufken, J. (2009). Support points of locally optimal designs for nonlinear models with two parameters. Ann. Statist., 37(1):518–541.
 Yang and Stufken (2012) Yang, M. and Stufken, J. (2012). Identifying locally optimal designs for nonlinear models: a simple extension with profound consequences. Ann. Statist., 40(3):1665–1681.
 Yu (2011) Yu, Y. (2011). Doptimal designs via a cocktail algorithm. Stat. Comput., 21(4):475–481.
S Supplementary material
S1. A multiparameter example
As an illustrative example, consider the case where is twodimensional and is
Comments
There are no comments yet.