Cramer-Rao Bound for Estimation After Model Selection and its Application to Sparse Vector Estimation

by   Elad Meir, et al.
Ben-Gurion University of the Negev

In many practical parameter estimation problems, such as coefficient estimation of polynomial regression and direction-of-arrival (DOA) estimation, model selection is performed prior to estimation. In these cases, it is assumed that the true measurement model belongs to a set of candidate models. The data-based model selection step affects the subsequent estimation, which may result in a biased estimation. In particular, the oracle Cramer-Rao bound (CRB), which assumes knowledge of the model, is inappropriate for post-model-selection performance analysis and system design outside the asymptotic region. In this paper, we analyze the estimation performance of post-model-selection estimators, by using the mean-squared-selected-error (MSSE) criterion. We assume coherent estimators that force unselected parameters to zero, and introduce the concept of selective unbiasedness in the sense of Lehmann unbiasedness. We derive a non-Bayesian Cramer-Rao-type bound on the MSSE and on the mean-squared-error (MSE) of any coherent and selective unbiased estimators. As an important special case, we illustrate the computation and applicability of the proposed selective CRB for sparse vector estimation, in which the selection of a model is equivalent to the recovery of the support. Finally, we demonstrate in numerical simulations that the proposed selective CRB is a valid lower bound on the performance of the post-model-selection maximum likelihood estimator for general linear model with different model selection criteria, and for sparse vector estimation with one-step thresholding. It is shown that for these cases the selective CRB outperforms the existing bounds: oracle CRB, averaged CRB, and the SMS-CRB from [1].



There are no comments yet.


page 1


New Cramer-Rao-Type Bound for Constrained Parameter Estimation

Non-Bayesian parameter estimation under parametric constraints is encoun...

Robustness Analysis of the Data-Selective Volterra NLMS Algorithm

Recently, the data-selective adaptive Volterra filters have been propose...

On MMSE and MAP Denoising Under Sparse Representation Modeling Over a Unitary Dictionary

Among the many ways to model signals, a recent approach that draws consi...

A Generalized Focused Information Criterion for GMM

This paper proposes a criterion for simultaneous GMM model and moment se...

On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding

We study the distributions of the LASSO, SCAD, and thresholding estimato...

Robustness in sparse linear models: relative efficiency based on robust approximate message passing

Understanding efficiency in high dimensional linear models is a longstan...

Bootstrapped Adaptive Threshold Selection for Statistical Model Selection and Estimation

A central goal of neuroscience is to understand how activity in the nerv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Estimation after model selection arises in a variety of problems in signal processing, communication, and multivariate data analysis [1, 2, 3, 4, 5, 6, 7, 8]. In post-model-selection estimation the common practice is to select a model from a pool of candidate models and then, in the second stage, estimate the unknown parameters associated with the selected model. For example, in direction-of-arrival (DOA) estimation, first, the number of sources is selected, and then, the DOA of each detected source is estimated [9, 10, 11]. The selection in this case is usually based on information theoretic criteria, such as the Akaike’s Information Criterion (AIC) [12], the Minimum Description Length (MDL) [13], and the generalized information criterion (GIC) [14]. In regression models [15, 16], the significant predictors are identified, and then, the corresponding coefficients of the selected model are typically estimated by the least squares method. A special case of estimation after model selection arises in the problem of estimating a sparse unknown parameter vector from noisy measurements. Sparse estimation has been analyzed intensively in the past few years, and has already given rise to numerous successful signal processing algorithms (see, e.g. [17, 18]). In particular, in greedy compressive sensing algorithms [19, 20], the support set of the signals is selected, based on some kind of selection criteria, and then the associated nonzero values, i.e. the signal coefficients, are estimated. Thus, the problem of non-Bayesian sparse vector recovery can be interpreted as a special case of estimation after model selection.

The oracle Cramr-Rao Bound (CRB), which assumes perfect knowledge of the model, is commonly used for performance analysis and system design in these cases (see, e.g. [21, 22, 23]). However, the oracle CRB does not take the prescreening process and the fact that the true model is unknown into account and, thus, it is not a tight bound [22]. A more significant problem is the fact that the estimation is based on the same dataset utilized in the model selection step. The data-driven selection process creates “selection bias” and produces a model that is itself stochastic, and this stochastic aspect is not accounted for by classical non-Bayesian estimation theory [24]

. For example, it is shown that ignoring the model selection step may lead to invalid analysis, such as non-covering confidence intervals

[25, 26]

. As a consequence, statistical inferential guarantees derived from classical theory, such as the CRB, are not valid outside the asymptotic region, nor can they predict the threshold. Despite the importance of estimation after model selection and its widespread use in signal processing, the impact of the model selection procedure on the fundamental limits of estimation performance for general parametric models is not well understood.

I-a Summary of results

In this paper we investigate the post-model-selection estimation performance for a given selection rule, when the estimated parameters belong to a model that has been selected from a set of candidate models. We assume that the data-based selection criterion is known in advance and we analyze the post-model-selection performance for this specific criterion. We further assume coherency of the considered estimators, i.e. estimators that force the deselected parameters to zero. In order to characterize the estimation performance we introduce the mean-squared-selected-error (MSSE) criterion, as a performance measure, and derive the concept of selective-unbiasedness, by using the non-Bayesian Lehmann-unbiasedness definition [27]. Then we develop a new post-model-selection Cram

r-Rao-type lower bound, named selective CRB, on the MSSE of any coherent and selective unbiased estimator. As a special case, we derive the proposed selective CRB for the setting in which a deterministic sparse vector is to be estimated from a small number of noisy measurements. The selective CRB is examined in simulations for a linear regression problem and for sparse estimation, and is shown in both to be a valid bound also outside the asymptotic region, while the oracle CRB is not, and to be tighter than the SMS-CRB from


I-B Related works

The majority of work on selective inference in mathematical statistics literature is concerned with constructing confidence intervals [24, 25, 28, 29, 30, 31, 32, 33, 34], testing after model selection [35, 36], and post-selection maximum likelihood (ML) estimation [37, 36]. These works usually considered specific models, such as linear models, and specific estimators, such as M-estimators [25] or the Lasso method, for any selection rule. The current paper provides a general non-Bayesian estimation framework for any parametric model and unbiased estimators, but with specific model selection procedures.

In the context of signal processing, the works in [38] and [39] investigate Bayesian estimation after the detection of an unknown data region of interest. A novel CRB on the conditional MSE is developed in [40, 41] for the problem of post-detection estimation. However, in [38, 39, 40, 41], the useful data is selected and not the model. In [42, 43, 44, 45], we developed the CRB and estimation methods for models whose “parameters of interest” are selected based on the data, i.e. estimation after parameter selection, in which the model is perfectly known. In contrast, in the case presented here, the measurement model is assumed to be unknown and is selected from a finite collection of competing models. Thus, the bound from [42, 43, 44, 45] is irrelevant for estimation after model selection. In addition, it should be emphasized that the considered architecture is well-specified and is different from the important problem of the development of performance bounds for estimation with a misspecified (or mismatched) model [46, 47, 48, 49], in which the estimation is based on a continuous deviation from the true model [50]. In the considered scenario, however, we know the full finite set of candidate models that can be assumed. Thus, in the proposed approach the estimation errors are from specific categories and can be averaged along these models.

To the best of our knowledge, the only existing bound in this context is the pioneering work of Sando, Mitra, and Stoica in [1], which presents a CRB-type bound for estimation after model order selection, named here as SMS-CRB. The SMS-CRB is based on some restrictive assumptions on the selection rule and on averaging the Fisher information matrices (FIMs) over the different models. As a result, it is not a tight bound, as presented in the simulations herein. In addition to this bound, for the special case of sparse vector estimation, the associated constrained CRB (CCRB) [51, 52] is reduced to the CRB of the oracle estimator, [23, 53], which assumes perfect knowledge of the support and is non-informative outside the asymptotic region. The effects of random compression on the CRB have been studied in [54]. However, in our context, the compression matrix is assumed to be known.

I-C Organization and notations

The remainder of the paper is organized as follows: Section II presents the mathematical model for the problem of estimation after model selection. In Section III the proposed selective CRB is derived, with its marginal version. In Section IV, we develop the selective CRB for the special case of sparse vector estimation. The performance of the proposed bound is evaluated in simulations in Section V. Finally, our conclusions can be found in Section VI.

In the rest of this paper, we denote vectors by boldface lowercase letters and matrices by boldface uppercase letters. The operators , , and denote the transpose, inverse, and trace operators, respectively. For a matrix with a full column rank, , where

is the identity matrix of order

. The th element of the vector , the th element of the matrix , and the submatrix of are denoted by , , and , respectively. The notation implies that is a positive-semidefinite matrix, where and are positive-semidefinite matrices of the same size. The gradient of a vector function, , of , , is a matrix in , with the th element equal to , where and . For any index set, , is the -dimensional subvector of containing the elements indexed by , where and denote the set’s cardinality and complement set, respectively. The notation stands for a submatrix of consisting of the columns indexed by , denotes the indicator function of an event , and the number of non-zero entries in is denoted by . Finally, and represent the expected value and the conditional expected value parameterized by a deterministic parameter .

Ii Estimation after model selection

We consider a random observation vector, , where is the observation space. We assume that

is distributed according to the probability distribution function (pdf)

, where and are an unknown deterministic parameter vector and its associated unknown support, respectively. We assume in the following that this true pdf, , belongs to a known set of pdfs, , . Each pdf in this set is parameterized by its own unknown parameter vector, . The competing models can be nested, or not nested, and overlapped or not (see, e.g. p. 36 in [6]). We denote the associated set of models by .

In this paper we are interested in the estimation of based on . Since the observation pdf is only known to belong to a set of candidate models, a model selection approach is conducted before the estimation. We take this model selection for granted and analyze the consequent estimation. Estimation after model selection, which is presented schematically in Fig. 1, consists of two stages: first, a certain model is selected according to a predetermined data-driven selection rule, , such as AIC or MDL, which is assumed here to be a deterministic function of . Then, in the second stage, the unknown parameter vector, , is estimated based on the same data, . We denote by the selected support according to the selection rule, . We denote the probability of selecting the th model as

where this probability is computed with respect to (w.r.t.) the true pdf, . We assume that the deterministic sets , , is a partition of . By using Bayes rule it can be verified that

Fig. 1: Estimation after model selection: The measurement vector, , is generated based on the pdf, , which belongs to the set of candidate models, . Then, in the first processing stage, a model is selected according to a predetermined selection rule, . In the second stage, the unknown parameter vector, , is estimated based on the observation vector, , and the model selection output, .

Let be an estimator of , based on a random observation vector, , i.e.

, with a bounded second moment. The usual practice in post-model-selection estimation is to force the deselected parameters to zero and estimate the parameters that belong to the selected support. The following is a formal definition for this commonly-used practice, named here “coherency”, which is defined w.r.t. the selection rule.

Definition 1.

An estimator, , is said to be a coherent estimator of w.r.t. the selection rule, , if


The supports of the unknown parameter vectors, ,

, differ in size. In order to compare between the estimation errors in different models in the following, we introduce the zero-padded vectors and their associated support matrices, where the zero-padding in this paper is always to the length of the true parameter vector,


Definition 2.

For an arbitrary vector, , and any candidate support, , the vector , is a zero padded, -length vector, whose nonzero elements correspond to the elements of . The associated diagonal matrix represents the true support of and its diagonal elements are given by


for any .

According to this definition, only estimation errors that belong to the true parameter vector and to the estimated nonzero parameters are relevant in the resultant zero-padded vector. The following example demonstrates our notation.

Example 1.

Let us consider a case with , i.e. , , and with the support . Then, and the estimated support is . According to Definition 2, the zero-padded estimation error vector for this case is and is a diagonal matrix with on its diagonal.

In this paper, we are interested in analyzing the performance of coherent estimators, as defined in Definition 1. Therefore, we use the following selected-square-error (SSE) matrix cost function:


The corresponding mean SSE (MSSE) is given by


where the last equality is obtained by using (1) and the law of total expectation.

The marginal MSSE on a specific parameter, , is given by the th diagonal element of the MSSE, that is


, where is the set of all the models in which the parameter is a part of the support. Similarly, is the set of indexes of all models, for which the parameter is zero. It can be seen that


is the probability that the parameter has been selected by the considered selection rule. Thus, (II) can be written as


It can be seen that the SSE cost function from (4) only takes into account the estimation errors of the elements of the true unknown parameter vector, i.e. belongs to the support , and that are also not forced to zero by the selection stage. The rationale behind this cost function is that the estimation errors of the deselected parameters (that belong to the true model) are only determined by the selection rule, and cannot be reduced by any coherent estimator. Thus, what is important for designing and analyzing post-model-selection estimators are only the estimation errors of the selected parameters (that are also in the true model) that can be controlled. The relation between the MSSE and MSE is described in Subsection III-B.

Iii Selective CRB

In this section, a CRB-type lower bound for estimation after model selection is derived. The proposed bound is a lower bound on the MSSE of any coherent and selective unbiased estimator, where selective unbiasedness is defined in Section III-A by using the concept of Lehmann unbiasedness. Section III-B shows the relation between the MSE and the MSSE. This relation can be used for obtaining a lower bound on the MSE of coherent and selective unbiased estimators directly from any CRB-type lower bound on the MSSE. The main contribution of this work, the selective CRB, is presented in Section III-C, followed by important special cases, in Section III-D. An early derivation of the selective CRB for a scalar cost function appears in [55].

Iii-a Selective unbiasedness

In order to exclude trivial estimators, the mean-unbiasedness constraint is commonly used in non-Bayesian parameter estimation [56]. However, this constraint is inappropriate for estimation after model selection, since we are interested only in errors of the selected parameters and since the data-based model selection step induces bias [24]. Lehmann [27] proposed a generalization of the unbiasedness concept based on the considered cost function. In our previous work (p. 13 in [57]) we extended the scalar Lehmann unbiasedness definition to the general case of a matrix cost function, as follows.

Definition 3.

The estimator, , is said to be a uniformly unbiased estimator of in the Lehmann sense w.r.t. the positive semidefinite matrix cost function, , if


where is the parameter space.

Lehmann unbiasedness conditions for various cost functions can be found in [42, 43, 44, 58, 59, 60]. The following proposition defines the selective unbiasedness property of estimators w.r.t. the SSE matrix cost function and the selection rule.

Proposition 1.

An estimator, , is an unbiased estimator for the problem of estimating the true parameter vector, , in the Lehmann sense w.r.t. the SSE matrix defined in (4), and the selection rule, , iff


for all , such that .


The proof appears in Appendix A. ∎

It should be noted that while the considered estimators are -length vectors, the selective unbiasedness restricts only the values that are in the intersection of the true and estimated support, i.e. only restricts the values of . That is, under this definition, only the estimator of the true parameters that have not been forced to be zero should be unbiased.

The condition in (11) is equivalent to the requirement that all the scalar estimators of the parameters from the true model satisfy


for all , such that . Moreover, by multiplying (12) by and summing over the models that include the parameter , , we obtain the requirement that the scalar estimators are conditionally unbiased, conditioned on the event that they have been selected by the considered selection rule, i.e.


In addition, by multiplying (11) by and summing over the candidate models, , we obtain the following necessary condition for selective unbiasedness:


It can be seen that selective unbiasedness is defined as a function of the specific selection rule. In the following, an estimator, , is said to be a selective unbiased estimator for the problem of estimating and given model selection rule, , if (11) is satisfied.

Iii-B Relation between MSE and MSSE

In this subsection we describe the relation between the MSE of the true parameter vector of any estimator,


and MSSE for coherent and selective unbiased estimators from (II). First, by using Definition 2, the estimation error of the true parameter vector, , can be decomposed w.r.t. the selected support, , and its complementary, , as follows:


where all the vectors in (16) have the same dimension, . By substituting (16) in the MSE matrix from (15), one obtains


The coherency property from (2) implies that

or, after a zero-padding approach, that


By substituting (18) in (III-B), we obtain that for any coherent estimator, the MSE satisfies


Now, by using the law of total expectation, it can be seen that for any selective unbiased estimator the second term on the r.h.s. of (III-B) satisfies


where the last equality is obtained by substituting the selection unbiasedness property from (11). Similarly, by using the law of total expectation, the last term of the r.h.s. of (III-B) satisfies


By substituting (III-B) and (21) in (III-B) we obtain that the MSE of a coherent and selective unbiased estimator is given by


That is, the MSE in (III-B) is the sum of the MSSE and an additional term, , which is only a function of the selection rule and is not affected by the estimator, . Therefore, for post-model-selection estimation, the significant part of the MSE from the point of view of estimation, is the MSSE. Moreover, by deriving a CRB-type lower bound on the MSSE we also obtain a lower bound on the MSE of any coherent and selective unbiased estimator. Finally, by substituting (II) in the th diagonal element of the MSE from (III-B), we obtain that the marginal MSE on a specific parameter, , is given by



Iii-C Selective CRB

Obtaining the estimator with the minimum MSSE among all coherent and selective unbiased estimators is usually intractable. Thus, lower bounds on the MSSE and MSE of any coherent and selective unbiased estimator are useful for performance analysis and system design. In the following, a novel CRB for estimation after model selection, named here selective CRB, is derived. To this end, we define the following post-model-selection likelihood gradient vectors:


. The vectors are all -dimensional vectors. The marginal selective FIM be defined as


. Next, we define the following regularity conditions:

  1. The post-model-selection likelihood gradient vectors, , , exist and the selective FIMs, , , are well-defined, nonsingular, and nonzero matrices, .

  2. The operations of integration w.r.t. and differentiation w.r.t. can be interchanged, as follows:



The following theorem presents the proposed selective CRB.

Theorem 1.

Let the regularity conditions C.1-C.2 be satisfied and be a coherent and selective unbiased estimator of the problem of estimating , for a given selection rule, . Then, the MSSE satisfies


where the selective CRB is given by


in which is the zero-one diagonal matrix, defined in (3), and is the th selective FIM, defined in (25). Furthermore, the MSE from (III-B) is bounded by


The proof appears in Appendix B. ∎

The MSE and MSSE bounds in Theorem 1 are matrix bounds. As such, they imply in particular the associated marginal bounds on the diagonal elements and on the trace. That is, by using the th element of the MSSE from (II) and the bound from (27)-(28) we obtain the marginal selective CRB on the MSSE of the th element of :


, where the last equality is obtained by substituting (3). Similarly, using the th element of the MSE from (III-B) and the matrix MSE bound from (1) implies the following marginal MSE bounds:


, where is defined in (8). Summing (III-C), over , , we obtain the associated selective CRB on the trace MSE,


Finally, the following Lemma presents an alternative formula of the selective FIM. This formula can be more tractable for some estimation problems.

Lemma 1.

Assuming that Conditions C.1-C.2 are satisfied in addition to the following regularity conditions:

  1. The second derivatives of w.r.t. the elements of exist and are bounded and continuous .

  2. The integral, , is twice differentiable under the integral sign, , .

Then, the th selective FIM in (25) satisfies


, .


The proof appears in Appendix C. ∎

Iii-D Special cases

Iii-D1 Single model

When only a single model is assumed, i.e. , then there is only one possible selection and thus, and for any selection rule. In this case, the SSE, selective unbiasedness, and selective CRB are reduced to the MSE, mean-unbiasedness, and CRB for estimating . Thus, the proposed paradigm generalizes the conventional non-Bayesian parameter estimation, which assumes a known generative model.

Iii-D2 Nested models and the relation to SMS-CRB

A model class is nested if smaller models are always special cases of larger models. Thus, in this special case we assume a model order selection problem in which , , where is the true model, i.e. . In this case, it can be verified that


By substituting (34) in the selective CRB from (28), we obtain




The SMS-CRB bound from [1] was developed for the problem of model order selection with nested models under the assumptions:

  1. The order selection rule is such that asymptotically , for any . Hence, asymptotically we allow only possible overestimation of the order by the considered model selection rule.

  2. The FIMs under the th candidate model,


    are nonsingular matrices for any .

Under Assumptions A.1-A.2, the SMS-CRB is given by [1]




It can be seen that the proposed selective CRB for nested models from (III-D2) has a similar structure to the SMS-CRB bound from (38). However, the dimensions of the SMS-CRB matrix bound is while the proposed matrix bound has the dimension under the true model, . The proposed selective CRB accounts for both overestimation and underestimation of the model order, while the SMS-CRB accounts only for overestimation. The selective CRB is based on different selective FIM for each model, while the SMS-CRB is based on averaging over the FIMs of the different candidate models, as can be seen from comparing (36) and (39). Finally, our bound is not limited to the problem of model order selection and is shown to be tighter than the SMS-CRB in simulations.

Iii-D3 Randomized selection rule

In this degenerated case, we consider a random selection rule, which is independent of the data and satisfies , , where satisfy , , and . Thus, the derivative of the log of the probability of selection of the th model w.r.t. vanishes, i.e.


In addition, since the selection is independent of the observation vector, , then