Parameter estimation is a fundamental problem in many areas of science and engineering. The goal is to recover an unknown deterministic parameter
given realizations of random variableswhose distribution depends on . An estimator
is typically designed by minimizing some distance between the observed variables and their statistics, e.g., least squares, maximum likelihood estimation (MLE) or Method of Moments (MoM)[kay1993fundamentals]. Their performance is often measured in terms of mean squared error (MSE) and bias that depend on the unknown. MLE is asymptotically a Minimum Variance Unbiased Estimator (MVUE) for any value of . In the last decade, there have been many works suggesting to design estimators using deep learning. The latter provide a promising tradeoff between computational complexity and accuracy, and are useful when classical methods as MLE are intractable. Learning based estimators directly minimize the average MSE with respect to a given dataset. Their performance may deteriorate with respect to other values of . To close this gap, we propose Bias Constrained Estimators (BCE) where we add a squared bias term to the learning loss in order to ensure unbiasedness for any value of .
The starting point to this paper is the definitions of MSE and bias which differ between fields. Classical statistics define the metrics as a function of the unknown (see for example Eq. 2.25 in [friedman2017elements], Chapter 2 in. [kay1993fundamentals] or [lehmann1948completeness]):
The “M” in the MSE and in all the expectations in (I) are with respect to to the distribution of parameterized by . The unknowns are not integrated out and the metrics are functions of . The goal of parameter estimation is to minimize the MSE for any value of . This problem is ill-defined as different estimators are better for different values of . Therefore, most of this field focuses on minimizing the MSE only among unbiased estimators, that have zero bias for any value of . An estimator which is optimal in this sense is called an MVUE. As explained above, the popular MLE is indeed asymptotically MVUE.
Learning theory and Bayesian statistics also often use the acronym MSE with a slightly different definition (see for example Chapter 10 in[kay1993fundamentals] or Chapter 2.4 in [friedman2017elements]). Bayesian methods model
as a random vector and use the expectation of MSEwith respect to its prior. To avoid confusion, we refer to this metric as BMSE :
Unlike MSE, BMSE is not a function of . BMSE has a well defined minima but it is optimal on average with respect to and depends on the chosen prior.
In the last years, there is a growing body of works on learning based approximations to MLE when it is computationally intractable [ongie2020deep, schlemper2018stochastic, dong2015image, diskin2017deep, yang2019snr, naimipour2020unfolded, izacard2019data, dreifuerst2021signalnet]. This trend opens a window of opportunities for low cost and accurate estimators. Yet, there is subtle issue with the MSE definitions. Deep learning based algorithms minimize the BMSE and are not optimal for specific values of . Partial solutions to this issue include uninformative priors in special cases [jaynes1968prior] and minmax approaches that are often too pessimistic [eldar2005minimax]. To close this gap, BCE tries to minimize the BMSE while promoting unbiasedness. We prove that BCE converges to an MVUE. Namely, given a rich enough architecture and a sufficiently large number of samples, BCE is asymptotically unbiased and achieves the lowest possible MSE for any value of . Numerical experiments clearly support the theory and show that BCE leads to near-MVUE estimators for all . Estimators based on BMSE alone are better on average but can be bad for specific values of .
To gain more intuition into BCE we consider the cases of linear networks and linear generative models. In these settings, BCE has a closed form solution which summarizes the story nicely. On the classical side, the MVUE in linear models is the Least Squares estimator (which is also the Gaussian MLE). On the Bayesian machine learning side, the minimum BMSE estimator is known as linear regression. By penalizing the bias, BCE allows us to learn MVUE-like regressors. The formulas clearly illustrate how BCE reduces the dependency on the training prior. Indeed, we also show that BCE can be used as a regularizer against overfitting in inverse problems where the signal to noise ratio (SNR) is large but the number of available labels is limited.
A second motivation to BCE is in the context of averaging estimators in test time. In this setting the goal is to learn a single network that will be applied to multiple inputs and then take their average as the final output. This is the case, for example, in a sensor network where multiple independent measurements of the same phenomena are available. Each sensor applies the network locally and sends its estimate to a fusion center. The global center then uses the average of the estimates as the final estimate [li2009distributed]. Averaging in test-time has also become standard in image classification where the same network is applied to multiple crops at inference time [krizhevsky2012imagenet]. In such settings, unbiasednees of the local estimates is a goal on its own, as it is a necessary condition for asymptotic consistency of the global estimate. BCE enforces this condition and improves accuracy. We demonstrate this advantage in the context of data augmentation in test time on the CIFAR10 image classification dataset.
I-a Related Work
Deep learning for estimation: Deep learning based estimators have been proposed for inverse problems [ongie2020deep], robust regression [diskin2017deep], SNR estimation [yang2019snr], phase retrieval [naimipour2020unfolded], frequency estimation [izacard2019data, dreifuerst2021signalnet] and more. In contrast to standard data-driven machine learning, these networks are trained on synthetic data generated from a well specified probabilistic model. The unknown parameters do not follow any known prior, but a fictitious label prior is devised in order to generate a training set. The resulting estimators are optimal on average with respect to the fake prior, but have no guarantee otherwise. BCE directly addresses these works and allows to achieve near optimal performance among all unbiased estimators for arbitrary parameters (independently of the chosen fake prior).
Fairness: BCE is intimately related to the topics of “fairness” and “out of distribution (OOD) generalization” in machine learning. The topics are related both in the terminology and in the solutions. Fair learning tries to eliminate biases in the training set and considers properties that need to be protected [agarwal2019fair]. OOD works introduce an additional “environment” variable and the goal is to train a model that will generalize well on new unseen environments [creager2021environment, maity2020there]. Among the proposed solutions are distributionally robust optimization [bagnell2005robust] which is a type of minmax estimator, as well as invariant risk minimization [arjovsky2019invariant] and calibration constraints [wald2021calibration], both of which are reminiscent of BCE. A main difference is that in our work the protected properties are the labels themselves. Another core difference is that in parameter estimation we assume full knowledge of the generative model, whereas the above works are purely data-driven and discriminative.
Averaging in test-time: Averaging of multiple estimates in test-time is used for different applications where centralized estimation is difficult or not possible. This includes distributed estimation due to large datasets [zhang2013communication], aggregation of sequential measurements in object tracking [ho2012bias], averaging scores on different crops and flips in image classification [krizhevsky2012imagenet, simonyan2014very] or detection confidence of the same object on different frames in video object detection [han2016seq]. A key idea for improving the accuracy of the averaged estimate is bias reduction of the local estimators [ho2012bias, luengo2015bias]. BCE gives the ability to use this concept in the framework of deep learning based estimators.
We use regular symbols for scalars, bold symbols for vector and bold capital letters for matrices. Throughout the paper, we use conditional sampling: First the parameters are chosen or randomly generated. Then, measurements are generated using . To simplify the expressions, we use the following notations for empirical means:
where is an arbitrary function.
Ii Biased Constrained Estimation
Consider a random vector
whose probability distribution is parameterized by an unknown deterministic vectorin some region . We are interested in the case in which is a deterministic variable without any prior distribution. We assume that depends on , either via a likelihood function or a generative model where is a (possibly stochastic) known transformation. Given this information, our goal is to estimate given .
The recent deep learning revolution has led many to apply machine learning methods for estimation [dong2015image, ongie2020deep, gabrielli2017introducing, rudi2020parameter, dua2011artificial]. For this purpose, a representative dataset
is collected. Then, a class of possible estimators is chosen, and the EMMSE estimator is defined as the solution to:
The main contribution of this paper is an alternative BCE approach in which we minimize the empirical MSE along with an empirical squared bias regularization. For this purpose, we collect an enhanced dataset
where are all associated with the same . BCE is then defined as the solution to:
The objective function of the above optimization problem will be referred as the "BCE loss". In the next sections, we will show that BCE leads to better approximations of MVUE/MLE than EMMSE. It resembles the classical methods in parameter estimation. We also show that it is beneficial when multiple local estimators are averaged together. Finally, it is can be also used as regularization in a Bayesian setting when the prior is not known exactly.
Iii BCE for Approximating MVUE
The main motivation to BCE is as an approximation to MVUE (or an asymptotic MLE) when the latter are difficult to compute. To understand this idea, we recall the basic definitions in classical (non-Bayesian) parameter estimation. Briefly, a core challenge in this field is that the unknowns are deterministic parameters without any prior distributions. The bias, variance and MSE all depend on the unknown parameter and cannot be minimized directly for all parameter values simultaneously. A traditional workaround is to consider only unbiased estimators and seek an MVUE. It is well known that, asymptotically and under mild conditions, MLE is an MVUE. In practice, computing the MLE often involves difficult or intractable optimization problems and there is ongoing research for efficient approximations. In this section, we show that BCE allows us to learn estimators with fixed and prescribed computational complexity that are also asymptotically MVUE.
Classical parameter estimation is not data-driven but based on a generative model. To enjoy the benefits of data-driven learning, it is common to generate synthetic data using the model and train a network on this data [gabrielli2017introducing, rudi2020parameter, dua2011artificial]. Specifically, is assumed random with some fictitious prior such that for all . An artificial dataset is synthetically generated according to and the true . Standard practice is to then learn the network using EMMSE in (5). This leads to an estimator which is optimal on average with respect to a fictitious prior. In contrast, we propose to generate a synthetic data with multiple for the same and use the BCE optimization in (7). The complete method is provided in Algorithm 1 below.
We note that the BCE objective penalizes the average squared bias across the training set. Intuitively, to achieve MVUE performance we need to penalize the bias for any possible value of which would be much more difficult both statistically and computationally (e.g., using a minmax approach). Nonetheless, our analysis below proves that simply penalizing the average is sufficient.
To analyze BCE, we begin with two definitions.
An estimator is called unbiased if it satisfies BIAS for all .
An MVUE is an unbiased estimator that has a variance lower than or equal to that of any other unbiased estimator for all values of .
An MVUE exists in simple models and MLE is asymptotically near MVUE [kay1993fundamentals]
. The following theorem suggests that deep learning based estimators with BCE loss functions are also near MVUE.
An MVUE denoted by exists within .
The fake prior is non-singular, i.e., for all .
The variance of the MVUE estimate under the fake prior is finite: .
Then, BCE coincides with the MVUE for sufficiently large , and .
Before proving the theorem, we note that the first assumption can be met by choosing a sufficiently expressive and universal hypothesis class. The second and third assumptions are technical and less important in practice.
The main idea of the proof is that because the squared bias is not negative, it is equal to zero for all if and only if its expectation over any non-singular prior is zero. Thus, taking to infinity enforces a solution that is unbiased for any value of and among the unbiased solutions only the MSE (which is now equal to the variance) term in the BCE is left and thus the solution is the MVUE.
For sufficiently large and , (7) converges to its population form:
We now define Suppose that is the MVUE. Thus:
First, assume the BCE is biased for some :
for some . Thus
Since is finite, we get a contradiction
for sufficiently large , where is the BCE loss.
Second, assume BCE is an unbiased estimator. But by the definition of MVUE
Together, the BCE loss of the MVUE is smaller than the BCE whether it is biased or not.∎
To summarize, EMMSE is optimal on average with respect to it training set. Otherwise, it can be arbitrarily bad (see experiments in Section VI). On the other hand, BCE is optimal for any value of among the unbiased estimators and thus in problems where an efficient estimator exists, it achieves the Cramer Rao Bound [kay1993fundamentals] making its performance predictable on any value of .
Iv Linear BCE
In this section, we focus on linear BCE (LBCE). This results in a closed form solution which is computationally efficient and interesting to analyze. LBCE is defined by the class of linear estimators:
This class is easy to fit, store and implement. In fact, it has a simple closed form solution.
The LBCE is given by where
assuming the inverse exists. LBCE coincides with the classical linear MMSE estimator, where the true expectations are replaced by the empirical ones:
Note that the second term in the inverse is easy to compute by sampling from .
The proof is based on calculating the gradient of the BCE objective function and equalizing it to zero. The full proof is in the Supplementary Material. ∎
Theorem 2 is applicable to any likelihood model, yet it is interesting to consider the classical case of linear models [kay1993fundamentals]:
where is a noise vector such that . More precisely, it is sufficient to consider the case where the conditional mean of is linear in . Here, we focus on an “inverse problem” where we have access to a dataset of realizations of and a stochastic generator of . By choosing a large enough , the conditional means converge to their population values and .
Consider observations with a linear condition mean
For , LBCE reduces to
and we assume all are invertible. Taking yields the seminal Weighted Least Squares (WLS) estimator111In fact, (16) is slightly more general than WLS which assumes that does note depend on .
LBCE in (14) clearly shows that increasing reduces the dependence on the specific values used in training. In some sense, BCE acts as a regularizer which is useful in finite sample settings. To gain more intuition on this effect, consider the following toy problem:
LBCE reduces to
where . Assume that an estimator is learned using random samples from the prior where is an SNR parameter. Let BMSE denote the Bayesian MSE defined in (2) but where the expectation is also with respect to random sampling in training (details and proof in the Supplementary Material). Then, we have the following result.
For a large enough , there exist such that yields a smaller BMSE than that of .
Intuitively, the prior is less important in high SNR, and it makes sense to use BCE and be less sensitive to it. On the other hand, in low SNR, the prior is important and is useful even when its estimate is erroneous.
For completeness, it is interesting to compare BCE to classical regularizers. The most common approach relies on an
penalty, also known as ridge regression[shalev2014understanding]. Surprisingly, in the above inverse problem setting with finite and , ridge regression does not help and even increases the BMSE. Ridge regression involves the following optimization problem:
and its solution is
In high SNR, we already showed that BCE is beneficial and to get this effect one must resort to a negative ridge regularization which is highly unstable. Interestingly, this is reminiscent of known results in linear models with uncertainty where Total Least Squares (TLS) coincides with a negative ridge regularization [wiesel2008linear].
V BCE for Averaging
In this section, we consider a different motivation to BCE where unbiasedness is a goal on its own. Specifically, BCEs are advantageous in scenarios where multiple estimators are combined together for improved performance. Typical examples include sensor networks where each sensor provides a local estimate of the unknown, or computer vision applications where the same network is applied on different crops of a given image to improve accuracy[krizhevsky2012imagenet, szegedy2015going].
The underlying assumption in BCE for averaging is that we have access, both in training and in test times, to multiple associate with the same . This assumption holds in learning with synthetic data (e.g., Algorithm 1), or real world data with multiple views or augmentations. In any case, the data structure allows us to learn a single network which will be applied to each of them and then output their average as summarized in Algorithm 2. The following theorem then proves that BCE results in asymptotic consistency.
Let be the solution to BCE in (7).
Define the global estimator as
where is the number of local estimators at test time.
Consider the case in which independent and identically distributed (i.i.d.) measurements of the same are available. Thus BCE with a sufficiently large , and , allows consistent estimation as increases if an unbiased estimator with finite variance exists within the hypothesis class.
Following the proof of theorem 1, BCE with a sufficiently large , and results in a unbiased estimator, if one exists within the hypothesis class. The global metrics satisfy:
Thus, the global variance decreases with , whereas the global bias remains constant, and for an unbiased local estimator it is equal to zero. ∎
In this section, we present numerical experiments results. We focus on the main ideas and conclusions; all the implementation details are available in the Supplementary Material.
Vi-a SNR Estimation
Our first experiment addresses a non-convex estimation problem of a single unknown. The unknown is scalar and therefore we can easily compute the MLE and visualize its performance. Specifically, we consider non-data-aided SNR estimation [alagha2001cramer]. The received signal is
are equi-probable binary symbols,is an unknown signal and is a white Gaussian noise with unknown variance denoted by and . The goal is to estimate the SNR defined as . Different estimators were proposed to this problem, including MLE [wiesel2002non] and method of moments [pauluzzi2000comparison]. For our purposes, we train a network based on synthetic data using EMMSE and BCE.
Figure 1 compares the MSE and the bias of the different estimators as a function of the SNR. It is evident that BCE is a better approximation of MLE than EMMSE. EMMSE is very biased towards a narrow regime of the SNR. This is because the MSE scales as the square of the SNR and the average MSE loss is dominated by the large MSE examples. For completeness, we also plot the MSE in terms of inverse SNR:
Functional invariance is a well known and attractive property of MLE. The figure shows that both MLE and BCE are robust to the inverse transformation, whereas EMMSE is unstable and performs poorly in low SNR.
Vi-B Structured Covariance Estimation
Our second experiment considers a more interesting high dimensional structured covariance estimation. Specifically, we consider the estimation of a sparse covariance matrix [chaudhuri2007estimation]. The measurement model is
are unknown parameters. We train a neural network usingusing both EMMSE and BCE. Computing the MLE is non-trivial in these settings and therefore we compare the performance to the theoretical asymptotic variance defined by the Cramer Rao Bound (CRB) [kay1993fundamentals]:
where is the Fisher Information Matrix (FIM)
CRB depends on the specific values of . Therfore we take random realizations and provide scatter plots ordered according to the CRB value.
Figure 2 presents the results of two simulations. In the first, we generate data according to the training distribution . As expected, EMMSE which was optimized for this distribution provides the best MSEs. In the second, we generate data according to a different distribution . Here the performance of EMMSE significantly deteriorates. In both cases, BCE is near MVUE and provides MSEs close to the CRB while ignoring the prior distribution used in training.
Vi-C BCE as Regularization
Our third experiment considers the use of LBCE as a regularizer in a linear model with additive Gaussian noise. Theorem 4 analyzed this scalar case, and here we address the high dimensional case using numerical simulations. The underlying model is
Dimensions are with and with , and is a rank deficient. We assume a limited number of training samples , but full knowledge of the measurement and noise models that allow
. We train three linear regressors using the EMMSE loss, EMMSE plus Ridge loss and BCE. Optimal regularization hyperparameters are chosen using a large validation set.
Figure 3 shows the resulting MSEs as a function of . As predicted by the theory, BCE significantly improves the results when is small. The optimal ridge parameter is negative and provides a negligible gain.
Vi-D Image Classification with Soft Labels
Our fourth experiment considers BCE for averaging in the context of image classification. We consider the popular CIFAR10 dataset. This paper focuses on regression rather than classification. Therefore we consider soft labels as proposed in the knowledge distillation technique [hinton2015distilling]. The soft labels are obtained using a strong “teacher” network from [yu2018deep]. To exploit the benefits of averaging we rely on data augmentation in the training and test phases [krizhevsky2012imagenet]. For augmentation, we use random cropping and flipping. We train two small “student” convolution networks with identical architectures using the EMMSE loss and the BCE loss. More precisely, following other distillation works, we also add a hard valued cross entropy term to the losses.
Figure 4 compares the accuracy of EMMSE vs BCE as a function of the number of test-time data augmentation crops. It can be seen that while on a single crop EMMSE achieves a slightly better accuracy, BCE achieves better results when averaged over many crops in the test phase.
In recent years, deep neural networks (DNN) are replacing classical algorithms for estimation in many fields. While DNNs give remarkable improvement in performance "on average", in some situations one would prefer to use classical frequentist algorithms that have guarantees on their performance on any value of the unknown parameters. In this work we show that when an efficient estimator exists, deep neural networks can be used to learn it using a bias constrained loss, provided that the architecture is expressive enough. Further work will be extension to different objective functions in addition to the MSE and applying the concept of BCE on real world problems.
This research was partially supported by ISF grant 2672/21
X Proofs of Theorems
X-a Proof of Theorem 2
We insert in (8) and get:
Now we take the derivative with respect to and compare to zero:
Using , yields
completing the proof.
X-B Proof of Theorem 3
We assume that is positive definite and we can define it square root which is also positive definite. In addition, denote and thus and =0. Thus, we insert in (14) and get:
completing the proof.
X-C Proof of Theorem 4
In this case, (14 of the main paper) becomes:
where and . The estimator and its error depend on the training set only through :
The proof continues by showing that for small enough the derivative of the expectation of (31) over with respect to in is negative, and there exist some such that
The derivative of (31) with a respect to is:
In particular, in high SNR we have
Therefore if we take the expectation over training set:
for all distributions of positive variable with mean 1, and the equality is only if is deterministic and equal to 1, completing the proof.
Xi Implementation Details of the Experiments
Xi-a SNR Estimation
We train a simple fully connected model with one hidden layer. First the data is normalized by the second moment and then the input is augmented by hand crafted features: the fourth and sixth moments and different functions of them. We train the network using synthetic data in which the mean is sampled uniformly in and then SNR is sampled uniformly in (which corresponds to ). The data is generated independently in each batch. We trained the model using the standard MSE loss, and using BCE with . We use ADAM solver with a multistep learninig scheduler. We use batch sizes of and as defined (8).
Xi-B Structured Covariance Estimation
We train a neural network for estimation the covariance matrix with the following architecture: First the sample covariance is calculated from the input
as it is the sufficient statistic for the covariance in Gaussian distribution. Also a vectoris initialized to and a vector is initialized to zero. Next, a one hidden layer fully connected network with concatenated input of and is used to predict a modification and for the vectors and respectively, such that = and similarly for . Then an updated covariance is calculated using and equation (26). The process is repeated (with the updated and as an input to the fully connected network) for 50 iterations. The final covariance is the output of the network. The network in is tranied on synthetic data in which the covariance the paramaters of the covariance are generated uniformly in their valid region and then different
’s are generated from a normal distribution with a zero mean and the generated covariance. We use an ADAM solver with a "ReduceLROnPlateau" scheduler with the desired loss (BCE loss of BCE and MSE loss for EMMSE) on a synthetic validation set. We trained the model using the standard MSE loss, and using BCE with. We use batch sizes of and as defined (8).
Xi-C BCE as Regularization
The underlying model is in the experiment is:
Dimensions are with and with . The covariance matrix of the noise is non diagonal and satisfies =1. The covariance matrix of the prior of
is non diagonal in which five of the eigenvalues are equal toand the other are equal to , that is it is approximately low rank. In order to take the expectation over the training set, the experiment is repeated 500 times with independently sampled training sets (that is, 500 time different samples of the prior). We assume a limited number of training samples , but full knowledge of the measurement. Thus we use (14) for BCE (and for EMMSE as a special case) and:
The best was found using a grid search over 100 values of in for BCE and in for Ridge (as the optimal parameter was found experimentally to be in these regions in the above setup.
Xi-D Image Classification with Soft Labels
We generate soft labels using a "teacher" network. Specifically, we work on the CIFAR10 dataset, and use a DLA architecture [yu2018deep]
which achieves 0.95 accuracy as a teacher. Our student network is a very small convolutional neural network (CNN) with two convolutions layers with 16 and 32 channels respectively and a single fully connected layer. We now use the following notations: The original dataset is a set oftriplets of images , a one-hot vector of hard labels and a vector of soft labels . In the augmented data, different images are generated randomly from each original image using random cropping and flipping. The output of the network for the class is denoted by and the vector of "probabilities" is defined by:
where is a "temperature" that controls the softness of the probabilities. We define the following loss functions: