I Introduction
Estimation after model selection arises in a variety of problems in signal processing, communication, and multivariate data analysis [1, 2, 3, 4, 5, 6, 7, 8]. In postmodelselection estimation the common practice is to select a model from a pool of candidate models and then, in the second stage, estimate the unknown parameters associated with the selected model. For example, in directionofarrival (DOA) estimation, first, the number of sources is selected, and then, the DOA of each detected source is estimated [9, 10, 11]. The selection in this case is usually based on information theoretic criteria, such as the Akaike’s Information Criterion (AIC) [12], the Minimum Description Length (MDL) [13], and the generalized information criterion (GIC) [14]. In regression models [15, 16], the significant predictors are identified, and then, the corresponding coefficients of the selected model are typically estimated by the least squares method. A special case of estimation after model selection arises in the problem of estimating a sparse unknown parameter vector from noisy measurements. Sparse estimation has been analyzed intensively in the past few years, and has already given rise to numerous successful signal processing algorithms (see, e.g. [17, 18]). In particular, in greedy compressive sensing algorithms [19, 20], the support set of the signals is selected, based on some kind of selection criteria, and then the associated nonzero values, i.e. the signal coefficients, are estimated. Thus, the problem of nonBayesian sparse vector recovery can be interpreted as a special case of estimation after model selection.
The oracle CramrRao Bound (CRB), which assumes perfect knowledge of the model, is commonly used for performance analysis and system design in these cases (see, e.g. [21, 22, 23]). However, the oracle CRB does not take the prescreening process and the fact that the true model is unknown into account and, thus, it is not a tight bound [22]. A more significant problem is the fact that the estimation is based on the same dataset utilized in the model selection step. The datadriven selection process creates “selection bias” and produces a model that is itself stochastic, and this stochastic aspect is not accounted for by classical nonBayesian estimation theory [24]
. For example, it is shown that ignoring the model selection step may lead to invalid analysis, such as noncovering confidence intervals
[25, 26]. As a consequence, statistical inferential guarantees derived from classical theory, such as the CRB, are not valid outside the asymptotic region, nor can they predict the threshold. Despite the importance of estimation after model selection and its widespread use in signal processing, the impact of the model selection procedure on the fundamental limits of estimation performance for general parametric models is not well understood.
Ia Summary of results
In this paper we investigate the postmodelselection estimation performance for a given selection rule, when the estimated parameters belong to a model that has been selected from a set of candidate models. We assume that the databased selection criterion is known in advance and we analyze the postmodelselection performance for this specific criterion. We further assume coherency of the considered estimators, i.e. estimators that force the deselected parameters to zero. In order to characterize the estimation performance we introduce the meansquaredselectederror (MSSE) criterion, as a performance measure, and derive the concept of selectiveunbiasedness, by using the nonBayesian Lehmannunbiasedness definition [27]. Then we develop a new postmodelselection Cram
rRaotype lower bound, named selective CRB, on the MSSE of any coherent and selective unbiased estimator. As a special case, we derive the proposed selective CRB for the setting in which a deterministic sparse vector is to be estimated from a small number of noisy measurements. The selective CRB is examined in simulations for a linear regression problem and for sparse estimation, and is shown in both to be a valid bound also outside the asymptotic region, while the oracle CRB is not, and to be tighter than the SMSCRB from
[1].IB Related works
The majority of work on selective inference in mathematical statistics literature is concerned with constructing confidence intervals [24, 25, 28, 29, 30, 31, 32, 33, 34], testing after model selection [35, 36], and postselection maximum likelihood (ML) estimation [37, 36]. These works usually considered specific models, such as linear models, and specific estimators, such as Mestimators [25] or the Lasso method, for any selection rule. The current paper provides a general nonBayesian estimation framework for any parametric model and unbiased estimators, but with specific model selection procedures.
In the context of signal processing, the works in [38] and [39] investigate Bayesian estimation after the detection of an unknown data region of interest. A novel CRB on the conditional MSE is developed in [40, 41] for the problem of postdetection estimation. However, in [38, 39, 40, 41], the useful data is selected and not the model. In [42, 43, 44, 45], we developed the CRB and estimation methods for models whose “parameters of interest” are selected based on the data, i.e. estimation after parameter selection, in which the model is perfectly known. In contrast, in the case presented here, the measurement model is assumed to be unknown and is selected from a finite collection of competing models. Thus, the bound from [42, 43, 44, 45] is irrelevant for estimation after model selection. In addition, it should be emphasized that the considered architecture is wellspecified and is different from the important problem of the development of performance bounds for estimation with a misspecified (or mismatched) model [46, 47, 48, 49], in which the estimation is based on a continuous deviation from the true model [50]. In the considered scenario, however, we know the full finite set of candidate models that can be assumed. Thus, in the proposed approach the estimation errors are from specific categories and can be averaged along these models.
To the best of our knowledge, the only existing bound in this context is the pioneering work of Sando, Mitra, and Stoica in [1], which presents a CRBtype bound for estimation after model order selection, named here as SMSCRB. The SMSCRB is based on some restrictive assumptions on the selection rule and on averaging the Fisher information matrices (FIMs) over the different models. As a result, it is not a tight bound, as presented in the simulations herein. In addition to this bound, for the special case of sparse vector estimation, the associated constrained CRB (CCRB) [51, 52] is reduced to the CRB of the oracle estimator, [23, 53], which assumes perfect knowledge of the support and is noninformative outside the asymptotic region. The effects of random compression on the CRB have been studied in [54]. However, in our context, the compression matrix is assumed to be known.
IC Organization and notations
The remainder of the paper is organized as follows: Section II presents the mathematical model for the problem of estimation after model selection. In Section III the proposed selective CRB is derived, with its marginal version. In Section IV, we develop the selective CRB for the special case of sparse vector estimation. The performance of the proposed bound is evaluated in simulations in Section V. Finally, our conclusions can be found in Section VI.
In the rest of this paper, we denote vectors by boldface lowercase letters and matrices by boldface uppercase letters. The operators , , and denote the transpose, inverse, and trace operators, respectively. For a matrix with a full column rank, , where
is the identity matrix of order
. The th element of the vector , the th element of the matrix , and the submatrix of are denoted by , , and , respectively. The notation implies that is a positivesemidefinite matrix, where and are positivesemidefinite matrices of the same size. The gradient of a vector function, , of , , is a matrix in , with the th element equal to , where and . For any index set, , is the dimensional subvector of containing the elements indexed by , where and denote the set’s cardinality and complement set, respectively. The notation stands for a submatrix of consisting of the columns indexed by , denotes the indicator function of an event , and the number of nonzero entries in is denoted by . Finally, and represent the expected value and the conditional expected value parameterized by a deterministic parameter .Ii Estimation after model selection
We consider a random observation vector, , where is the observation space. We assume that
is distributed according to the probability distribution function (pdf)
, where and are an unknown deterministic parameter vector and its associated unknown support, respectively. We assume in the following that this true pdf, , belongs to a known set of pdfs, , . Each pdf in this set is parameterized by its own unknown parameter vector, . The competing models can be nested, or not nested, and overlapped or not (see, e.g. p. 36 in [6]). We denote the associated set of models by .In this paper we are interested in the estimation of based on . Since the observation pdf is only known to belong to a set of candidate models, a model selection approach is conducted before the estimation. We take this model selection for granted and analyze the consequent estimation. Estimation after model selection, which is presented schematically in Fig. 1, consists of two stages: first, a certain model is selected according to a predetermined datadriven selection rule, , such as AIC or MDL, which is assumed here to be a deterministic function of . Then, in the second stage, the unknown parameter vector, , is estimated based on the same data, . We denote by the selected support according to the selection rule, . We denote the probability of selecting the th model as
where this probability is computed with respect to (w.r.t.) the true pdf, . We assume that the deterministic sets , , is a partition of . By using Bayes rule it can be verified that
(1) 
Let be an estimator of , based on a random observation vector, , i.e.
, with a bounded second moment. The usual practice in postmodelselection estimation is to force the deselected parameters to zero and estimate the parameters that belong to the selected support. The following is a formal definition for this commonlyused practice, named here “coherency”, which is defined w.r.t. the selection rule.
Definition 1.
An estimator, , is said to be a coherent estimator of w.r.t. the selection rule, , if
(2) 
The supports of the unknown parameter vectors, ,
, differ in size. In order to compare between the estimation errors in different models in the following, we introduce the zeropadded vectors and their associated support matrices, where the zeropadding in this paper is always to the length of the true parameter vector,
.Definition 2.
For an arbitrary vector, , and any candidate support, , the vector , is a zero padded, length vector, whose nonzero elements correspond to the elements of . The associated diagonal matrix represents the true support of and its diagonal elements are given by
(3) 
for any .
According to this definition, only estimation errors that belong to the true parameter vector and to the estimated nonzero parameters are relevant in the resultant zeropadded vector. The following example demonstrates our notation.
Example 1.
Let us consider a case with , i.e. , , and with the support . Then, and the estimated support is . According to Definition 2, the zeropadded estimation error vector for this case is and is a diagonal matrix with on its diagonal.
In this paper, we are interested in analyzing the performance of coherent estimators, as defined in Definition 1. Therefore, we use the following selectedsquareerror (SSE) matrix cost function:
(4) 
The corresponding mean SSE (MSSE) is given by
(5)  
(6) 
where the last equality is obtained by using (1) and the law of total expectation.
The marginal MSSE on a specific parameter, , is given by the th diagonal element of the MSSE, that is
(7) 
, where is the set of all the models in which the parameter is a part of the support. Similarly, is the set of indexes of all models, for which the parameter is zero. It can be seen that
(8) 
is the probability that the parameter has been selected by the considered selection rule. Thus, (II) can be written as
(9) 
It can be seen that the SSE cost function from (4) only takes into account the estimation errors of the elements of the true unknown parameter vector, i.e. belongs to the support , and that are also not forced to zero by the selection stage. The rationale behind this cost function is that the estimation errors of the deselected parameters (that belong to the true model) are only determined by the selection rule, and cannot be reduced by any coherent estimator. Thus, what is important for designing and analyzing postmodelselection estimators are only the estimation errors of the selected parameters (that are also in the true model) that can be controlled. The relation between the MSSE and MSE is described in Subsection IIIB.
Iii Selective CRB
In this section, a CRBtype lower bound for estimation after model selection is derived. The proposed bound is a lower bound on the MSSE of any coherent and selective unbiased estimator, where selective unbiasedness is defined in Section IIIA by using the concept of Lehmann unbiasedness. Section IIIB shows the relation between the MSE and the MSSE. This relation can be used for obtaining a lower bound on the MSE of coherent and selective unbiased estimators directly from any CRBtype lower bound on the MSSE. The main contribution of this work, the selective CRB, is presented in Section IIIC, followed by important special cases, in Section IIID. An early derivation of the selective CRB for a scalar cost function appears in [55].
Iiia Selective unbiasedness
In order to exclude trivial estimators, the meanunbiasedness constraint is commonly used in nonBayesian parameter estimation [56]. However, this constraint is inappropriate for estimation after model selection, since we are interested only in errors of the selected parameters and since the databased model selection step induces bias [24]. Lehmann [27] proposed a generalization of the unbiasedness concept based on the considered cost function. In our previous work (p. 13 in [57]) we extended the scalar Lehmann unbiasedness definition to the general case of a matrix cost function, as follows.
Definition 3.
The estimator, , is said to be a uniformly unbiased estimator of in the Lehmann sense w.r.t. the positive semidefinite matrix cost function, , if
(10) 
where is the parameter space.
Lehmann unbiasedness conditions for various cost functions can be found in [42, 43, 44, 58, 59, 60]. The following proposition defines the selective unbiasedness property of estimators w.r.t. the SSE matrix cost function and the selection rule.
Proposition 1.
An estimator, , is an unbiased estimator for the problem of estimating the true parameter vector, , in the Lehmann sense w.r.t. the SSE matrix defined in (4), and the selection rule, , iff
(11) 
for all , such that .
Proof:
The proof appears in Appendix A. ∎
It should be noted that while the considered estimators are length vectors, the selective unbiasedness restricts only the values that are in the intersection of the true and estimated support, i.e. only restricts the values of . That is, under this definition, only the estimator of the true parameters that have not been forced to be zero should be unbiased.
The condition in (11) is equivalent to the requirement that all the scalar estimators of the parameters from the true model satisfy
(12) 
for all , such that . Moreover, by multiplying (12) by and summing over the models that include the parameter , , we obtain the requirement that the scalar estimators are conditionally unbiased, conditioned on the event that they have been selected by the considered selection rule, i.e.
(13) 
In addition, by multiplying (11) by and summing over the candidate models, , we obtain the following necessary condition for selective unbiasedness:
(14) 
It can be seen that selective unbiasedness is defined as a function of the specific selection rule. In the following, an estimator, , is said to be a selective unbiased estimator for the problem of estimating and given model selection rule, , if (11) is satisfied.
IiiB Relation between MSE and MSSE
In this subsection we describe the relation between the MSE of the true parameter vector of any estimator,
(15) 
and MSSE for coherent and selective unbiased estimators from (II). First, by using Definition 2, the estimation error of the true parameter vector, , can be decomposed w.r.t. the selected support, , and its complementary, , as follows:
(16) 
where all the vectors in (16) have the same dimension, . By substituting (16) in the MSE matrix from (15), one obtains
(17) 
The coherency property from (2) implies that
or, after a zeropadding approach, that
(18) 
By substituting (18) in (IIIB), we obtain that for any coherent estimator, the MSE satisfies
(19) 
Now, by using the law of total expectation, it can be seen that for any selective unbiased estimator the second term on the r.h.s. of (IIIB) satisfies
(20) 
where the last equality is obtained by substituting the selection unbiasedness property from (11). Similarly, by using the law of total expectation, the last term of the r.h.s. of (IIIB) satisfies
(21) 
By substituting (IIIB) and (21) in (IIIB) we obtain that the MSE of a coherent and selective unbiased estimator is given by
(22) 
That is, the MSE in (IIIB) is the sum of the MSSE and an additional term, , which is only a function of the selection rule and is not affected by the estimator, . Therefore, for postmodelselection estimation, the significant part of the MSE from the point of view of estimation, is the MSSE. Moreover, by deriving a CRBtype lower bound on the MSSE we also obtain a lower bound on the MSE of any coherent and selective unbiased estimator. Finally, by substituting (II) in the th diagonal element of the MSE from (IIIB), we obtain that the marginal MSE on a specific parameter, , is given by
(23) 
.
IiiC Selective CRB
Obtaining the estimator with the minimum MSSE among all coherent and selective unbiased estimators is usually intractable. Thus, lower bounds on the MSSE and MSE of any coherent and selective unbiased estimator are useful for performance analysis and system design. In the following, a novel CRB for estimation after model selection, named here selective CRB, is derived. To this end, we define the following postmodelselection likelihood gradient vectors:
(24) 
. The vectors are all dimensional vectors. The marginal selective FIM be defined as
(25) 
. Next, we define the following regularity conditions:

The postmodelselection likelihood gradient vectors, , , exist and the selective FIMs, , , are welldefined, nonsingular, and nonzero matrices, .

The operations of integration w.r.t. and differentiation w.r.t. can be interchanged, as follows:
(26) .
The following theorem presents the proposed selective CRB.
Theorem 1.
Let the regularity conditions C.1C.2 be satisfied and be a coherent and selective unbiased estimator of the problem of estimating , for a given selection rule, . Then, the MSSE satisfies
(27) 
where the selective CRB is given by
(28) 
in which is the zeroone diagonal matrix, defined in (3), and is the th selective FIM, defined in (25). Furthermore, the MSE from (IIIB) is bounded by
(29) 
Proof:
The proof appears in Appendix B. ∎
The MSE and MSSE bounds in Theorem 1 are matrix bounds. As such, they imply in particular the associated marginal bounds on the diagonal elements and on the trace. That is, by using the th element of the MSSE from (II) and the bound from (27)(28) we obtain the marginal selective CRB on the MSSE of the th element of :
(30) 
, where the last equality is obtained by substituting (3). Similarly, using the th element of the MSE from (IIIB) and the matrix MSE bound from (1) implies the following marginal MSE bounds:
(31) 
, where is defined in (8). Summing (IIIC), over , , we obtain the associated selective CRB on the trace MSE,
(32)  
Finally, the following Lemma presents an alternative formula of the selective FIM. This formula can be more tractable for some estimation problems.
Lemma 1.
Assuming that Conditions C.1C.2 are satisfied in addition to the following regularity conditions:

The second derivatives of w.r.t. the elements of exist and are bounded and continuous .

The integral, , is twice differentiable under the integral sign, , .
Then, the th selective FIM in (25) satisfies
(33)  
, .
Proof:
The proof appears in Appendix C. ∎
IiiD Special cases
IiiD1 Single model
When only a single model is assumed, i.e. , then there is only one possible selection and thus, and for any selection rule. In this case, the SSE, selective unbiasedness, and selective CRB are reduced to the MSE, meanunbiasedness, and CRB for estimating . Thus, the proposed paradigm generalizes the conventional nonBayesian parameter estimation, which assumes a known generative model.
IiiD2 Nested models and the relation to SMSCRB
A model class is nested if smaller models are always special cases of larger models. Thus, in this special case we assume a model order selection problem in which , , where is the true model, i.e. . In this case, it can be verified that
(34) 
By substituting (34) in the selective CRB from (28), we obtain
(35) 
where
(36) 
The SMSCRB bound from [1] was developed for the problem of model order selection with nested models under the assumptions:

The order selection rule is such that asymptotically , for any . Hence, asymptotically we allow only possible overestimation of the order by the considered model selection rule.

The FIMs under the th candidate model,
(37) are nonsingular matrices for any .
Under Assumptions A.1A.2, the SMSCRB is given by [1]
(38) 
where
(39) 
It can be seen that the proposed selective CRB for nested models from (IIID2) has a similar structure to the SMSCRB bound from (38). However, the dimensions of the SMSCRB matrix bound is while the proposed matrix bound has the dimension under the true model, . The proposed selective CRB accounts for both overestimation and underestimation of the model order, while the SMSCRB accounts only for overestimation. The selective CRB is based on different selective FIM for each model, while the SMSCRB is based on averaging over the FIMs of the different candidate models, as can be seen from comparing (36) and (39). Finally, our bound is not limited to the problem of model order selection and is shown to be tighter than the SMSCRB in simulations.
IiiD3 Randomized selection rule
In this degenerated case, we consider a random selection rule, which is independent of the data and satisfies , , where satisfy , , and . Thus, the derivative of the log of the probability of selection of the th model w.r.t. vanishes, i.e.
(40) 
In addition, since the selection is independent of the observation vector, , then
Comments
There are no comments yet.