1 Introduction
In many areas of science and engineering, we are faced with the task of estimating or inferring certain parameters from heterogenous data sources. These heterogenous data sources can have data of different qualities for the task of estimation: the data from certain data sources can be more noisy or more distorted than those from other data sources. For example, in sensor networks monitoring trajectories of moving targets, the sensing data from different sensors can have different signal to noise ratios, depending on factors such as distances between the moving target and the sensors, and precisions of sensors. As another example, in political polling, the polling data can come from diverse demographic groups, and the polling data from different demographic groups can have different levels of noises and distortions for a particular qusestionaire.
Even from within a single data source, one can also obtain data of different qualities for inference, through different sensing modalities of different costs. For example, when we use sensors with higher precisions to sense data from a given data source, we can obtain data of higher quality, but at a higher sensing cost. Moreover, when we try to estimate the parameters of interest, we often operate under sensing cost constraints, namely the total costs spent on obtaining data from heterogenous data sources cannot be above a certain threshold.
This raises the natural question: “Under given cost constraints, how do we perform optimal estimation of the parameters of interest, from heterogenous data?” By “optimal estimation”, we mean minimizing the estimation error in terms of certain performance metrics, such as minimizing the mean squared error.
In this paper, we propose a generic framework to answer the question above, namely to optimally estimate the parameters of interest from heterogenous data, under certain cost constraints. In particular, we consider how to optimally allocate sensing resources to obtain data of heterogeneous qualities from heterogeneous data sources, to achieve the highest fidelity in parameter estimation.
Our research is partially motivated by the actual results of the 2016 United States presidential election of Tuesday, November 8, 2016, and the polling results before the election which are mostly contradictory to the actual election results. Donald Trump was lagging behind in nearly all opinion polls leading up to the 2016 United States presidential election, but Donald Trump surprisingly won the presidential election with his 306 electoral votes (statebystate tallies, without accounting for faithless electors) versus Hillary Clinton’s 232 electoral votes (statebystate tallies, without accounting for faithless electors). Right before the election in 2016, some polling analysts were very confident about the prediction that Hillary Clinton would win the US presidency: there was nearly a unanimity among forecasters in predicting a Clinton victory. A notable example is that, neuroscientist and polling analyst Sam Wang, one of the founders of Princeton Election Consortium, predicted that a greater than 99 chance of a Clinton victory in his Bayesian model [2, 3], as seen in Wang’s election morning blog post titled “Final Projections: Clinton 323 EV, 51 Democratic Senate seats, GOP House” [4, 1]. As an anecdote, before the election, being very confident with his predictions that Hillary Clinton would win the election, Dr. Wang made a promise to eat a bug if Donald J. Trump won more than 240 Electoral College votes, which he later kept by eating a cricket with honey on CNN [10].
The contrast between most poll predictions and the actual results of the 2016 presidential election was so dramatic that it was surprising and puzzling to many pollsters. In fact, the actual election results differed from the polling results evidently, sometimes dramatically, both nationally and statewise. Donald Trump performed better in the fiercely competitive battlegroup Midwestern states where the polls predicted Trump had an advantage, such as Iowa, Ohio, and Missouri, than expected. Trump also won Wisconsin, Michigan and Pennsylvania, which were considered part of the blue firewall. For example, let us consider the final polling average published by Real Clear Politics on November 7, 2016. The poll average showed that, in Wisconsin, Clinton had a +6.5 advantage over Trump, while the actual election result showed that Trump had a +0.7 advantage over Clinton; the poll average showed that, in Michigan, Clinton had a +3.4 advantage over Trump, while the actual election showed Trump had a +0.3 advantage over Clinton; the poll average showed that, in Pennsylvania, Clinton had a +1.9 advantage over Trump, while the actual election result had Trump +0.3 on top. In Iowa, the poll average showed that Trump had a +3.9 advantage over Clinton, while the actual election result showed that Trump’s advantage greatly increased to +9.5; in Missouri, the poll average showed that Trump had a +9.5 advantage over Clinton, while the actual election result showed that Trump’s advantage greatly increased to +18.5; and in Minnesota, Clinton had a +6.2 advantage over Trump, while the actual election result showed that Clinton’s advantage significantly shrunk to +1.5. While, in most of the states where the polling results are evidently different from the actual voting results, Trump outperformed the polling results, Clinton also outperformed the polling results in a small number of states, such as in the states of Nevada, Colorado, and New Mexico. Figure 1, cited from [5], shows the difference between the final polling average published by Real Clear Politics [6] on November the 7th, and the final voting results in 16 states. It is worth mentioning, among dozens of polls, the UPI/CVoter poll and the University of Southern California/Los Angeles Times poll were the only two polls that often predicted a Trump popular vote victory or showed a nearly tied election.
The dramatic and consistent differences between the polling results and the actual election returns, both nationally and statewise, cannot be explained by the “margin of errors” of these polling results. This indicates that there are significant and systematic errors in the polling results. This contrast was also alarming, considering election predictions had already had access to big data, and had applied advanced big data analytics techniques. So it is imperative to understand why the predictions from polling were terribly off.
Due to the significance of the United States presidential elections, this raises the following important questions: 1) why most opinion polls were not accurate in 2016? and 2) how to improve the accuracies of opinion polls? While there are many possible explanations for the inaccuracies of the opinion polls for the 2016 presidential election, in this paper, we look at the possibility that the collected opinion data in polling were distorted and noisy, and heterogeneous in noises and distortions, across different demographic groups. For example, supporters for a candidate might be embarrassed to tell the truth, and thus more likely to lie in polling, when their friends and/or local/national news media are vocal supporters for the opposite candidate.
In this paper, we study and explain the inaccuracy of most opinion polls through the lens of information theory. We first propose a general framework of parameter estimation in information science, called clean sensing (polling), which performs optimal parameter estimation with sensing cost constraints, from heterogeneous and potentially distorted data sources. We then cast the opinion polling as a problem of parameter estimation from potentially distorted heterogeneous data sources, and derive the optimal polling strategy using heterogenous and possibly distorted data under cost constraints. Our results show that a larger number of data samples do not necessarily lead to better polling accuracy. The optimal sensing (polling) strategy instead optimally allocates sensing resources over heterogenous data sources, and, moreover, for a particular data source, the optimal sensing strategy should strike an optimal balance between the quality of data samples, and the number of data samples.
As a byproduct of this research, we derive a series of new lower bounds on the mean squared errors of unbiased and biased parameter estimators, in the general setting of parameter estimations. These new lower bounds can be tighter than the classical CramérRao bound (CRB) and ChapmanRobbins bound. Our derivations are via studying the Lagrange dual problems of certain convex programs, and the classical CramérRao bound (CRB) and ChapmanRobbins bound follow naturally from our results for special cases of these convex programs.
The rest of this paper is organized as follows. In Section 2, we introduce the problem formulation of parameter estimation using potentially distorted data from heterogeneous data sources, namely the problem of clean sensing. In Section 4
, we cast finding the optimal sensing strategies in parameter estimation using heterogeneous data sources as explicit mathematical optimization problems. In Section
5, we derive asymptotically optimal solutions to the optimization problems of finding the optimal sensing strategies for clean sensing. In Section 6, we consider clean sensing for the special case of Gaussian random variables. In Section
7, we cast the problem of opinion polling in political elections as a problem of clean sensing, and give a possible explanation for why the polling for the 2016 presidential election were not accurate. We also derive the optimal polling strategies under potentially distorted data to achieve the smallest polling error. In Section 8, we derive new lower bounds on the meansquared errors of parameter estimators, which can be tighter than the classical CramérRao bound and ChapmanRobbins bound. Our derivations are via solving the Lagrange dual problems of certain convex programs, and are of independent interest.2 Problem Formulation
In this section, we introduce the problem formulation, and model setup for clean sensing (polling). Suppose we want to estimate a parameter
(which can be a scalar or a vector), or a function of the parameter, say,
. We assume that there are heterogeneous data sources, where is a positive integer. From each of heterogeneous data sources, say, the th data source, we obtain samples, where . We denote the samples from the th data source as , , …, and , and these samples take values from domain . We assume that cost was spent on acquiring the th sample from the th data source, where , and . We assume that we take action in sampling from the heterogenous data sources. We assume that under action , the samples , , …, , , , , …, , , …, , , …, follow distribution . This distribution depends on the parameter , and the action .In this paper, without loss of generality, we assume that under action , the samples across data sources are independent, and samples from a single data source are independent. Hence we can express the distribution as follows:
where
is the probability distribution of
, namely the th sample from the the data source, with and . We can of course also extend the analysis to more general cases where the samples are not independent.In this paper, we consider the following problem: under a budget on the total cost for sensing (polling), what is the optimal action to guarantee the most accurate estimation of or its function? More specifically, determining the sampling action means determining the number of data samples from each data source, and determining the cost spent on obtaining each data sample from each data source. We consider the nonsequential setting, where the sampling action is predetermined before the actual sampling action happens. In this paper, we use the meansquared error to quantify the accuracy of the estimation of or a function of . We can also extend this work to use other performance metrics than the meansquared error, such as those concerning the distribution of the estimation error or the tail bound on the estimation error.
3 Related Works
In [9], the authors considered controlled sensing for sequential hypothesis testing with the freedom of selecting sensing actions with different costs for best detection performance. Compared with [9], our work is different in three aspects: 1) in [9], the authors were considering a sequential hypothesis testing problem, while in this paper of ours, we consider a parameter estimation problem; 2) in[9], the authors worked with samples from a single data source (possibly having different distributions under different sampling actions), while in our paper, we consider heterogenous data sources where samples follow distributions determined not only by the sampling actions but also by the types of data sources; 3) in our paper, we consider continuouslyvalued sampling actions whose costs can take continuous values, compared with discretelyvalued sampling actions with discretelyvalued costs in [9].
In [11], the authors considered designs of experiments for sequential estimation of a function of several parameters, using data from heterogeneous sources (types of experiments), with a budget constraint on the total number of samples. In [11], each data sample from each data source always requires a unit cost to obtain, and the observer does not have the freedom of controlling the data quality of any individual sample. Compared with [11], in this paper, each data sample can require a different or variable cost to obtain, depending on the type of data source involved, and the specific sampling action used to obtain that data sample. Moreover, in this paper, the quality of each data sample depends both on the type of data source and on the sampling action used to obtain that data sample. For example, in this paper, we have the freedom of not only optimizing the number of data samples for each data source, but also optimizing the effort (cost) spent on obtaining a particular data sample from a particular data source (depending on the costquality tradeoff of that data source); while in [11], one only has the freedom of choosing the number of data samples from each data source. In [11], each data source (type of experiment) reveals information about only one element of the parameter vector ; while in this work, a sample from a data source can possibly reveal information about several elements of the parameter vector .
4 Clean Sensing: Optimal Estimation Using Heterogenous Data under Cost Constraints
In this section, we introduce the framework of clean sensing, namely optimal estimation using heterogenous data under cost constraints. As explained in Section 2, we assume that we spend cost on acquiring the th sample from the th data source, and, given the costs spent on acquiring each data sample, all the samples are independent of each other. Our goal is to optimally allocate the sensing resources to each data source and each data sample, in order to minimize the CramérRao bound on the meansquared error of parameter estimations. One can also extend this framework to minimize other types of bounds on the meansquared error of parameter estimation.
We consider a parameter column vector denoted by
Under the parameter vector
, we assume that probability density function of an observation sample is given by
. Let be an estimator of any vector function of parameters, (note that we can also consider cases where the be of a different dimension), and denote its expectation vector by .The Fisher information matrix is a matrix with its element in the th row and th column defined as
We note that, the CramérRao bound on the estimation error relies on some regularity conditions on the probability density function, , and the estimator . For a scalar parameter , the CramérRao bound depends on two weak regularity conditions on the probability density function, , and the estimator . The first condition is that the Fisher information is always defined; namely, for all such that ,
exists, and is finite. The second condition is that the operations of integration with respect to and differentiation with respect to can be interchanged in the expectation of , namely,
whenever the righthand side is finite.
For the multivariate parameter vector , the CramérRao bound then states that the covariance matrix of satisfies
where is the Jacobian matrix with the element in the th row and th column as .
We define the Fisher information matrix of from the th sample of the th data source as (sometimes when the context is clear, we abbreviate it as ) Thus, we have the Fisher information matrix of for data from the th data source as
since the Fisher information matrices are additive for independent observations of the same parameters. Since we assume that the samples from different data sources are also independently generated, the Fisher information matrix of considering all the data sources is given by
which we abbreviate as in the following derivations.
Without loss of generality, we consider estimating a scalar function of . To estimate
, we can lower bound the variance of the estimate of
bywhere is a scalar. If the estimate of is unbiased, namely , we can lower bound the meansquared error of the estimate of by
The goal of clean sensing is to minimize the error of parameter estimation from heterogeneous data sources. Suppose we require the estimation of a function of parameter to be unbiased, one way to minimize the meansquared error of the estimation is to design sensing strategies which minimize the CramérRao lower bound on the estimate. Mathematically, we are trying to solve the following optimization problem:
(1)  
(2)  
(3)  
(4)  
(5) 
where is the total budget, and is the set of nonnegative integers.
We notice that this optimization depends on knowledge of the parameter vector . However, before sampling begins, we have limited knowledge of the parameters. Depending on the goals of estimation, we can change the objective function of the optimization problem (5) using the limited knowledge of the parameters. For example, if we know in advance a prior distribution of
, we would like to minimize the expectation of the CramérRao lower bound for an unbiased estimator over the prior distribution. Mathematically, we formulate the corresponding optimization problem as
(6)  
(7)  
(8)  
(9)  
(10) 
where is the total budget. If we instead know in advance the parameter to be estimated belongs to a set , we can also try to minimize the worstcase CramérRao lower bound for an unbiased estimator over . We can thus write the minimax optimization problem as
(11)  
(12)  
(13)  
(14)  
(15) 
5 Optimal Sensing Strategy for Independent Heterogenous Data Sources with Diagonal Fisher Information Matrices
In this section, we will investigate the optimal sensing strategy for independent heterogenous data sources with diagonal Fisher information matrices. In this section, we assume that, under every possible action ,
One can show that under this assumption, the Fisher information matrices
is a diagonal matrix, where is a Fisher information matrix based on observation . Moreover, for every and , is a diagonal matrix, for which only the th element of the diagonal, denoted by , can possibly be nonzero.
Theorem 5.1.
Let us consider estimating a function of an unknown parameter vector of dimension , using data from independent heterogenous data sources, where is a positive integer. From the th data source, we obtain samples, where . We denote the samples from the th data source as , , …, and . We assume that cost was spent on acquiring the th sample from the th data source, where , and . We further assume that , , …, and are mutually independent.
We let be a Fisher information matrix based on observation , as a function of and cost . We assume that only reveals information about ; namely, we assume that, for every and , is a diagonal matrix, for which only the th element of the diagonal, denoted by as a function of and , can possibly be nonzero.
We assume, under a cost , the function satisfies the following conditions:

is a nondecreasing function in for ;

is well defined for ;

achieves its minimum value at a finite , and the corresponding minimum value is denoted by .
Let be an unbiased estimator of a function of parameter vector . Let be the total allowable budget for acquiring samples from the data sources. When , the smallest possible achievable CramérRao lower bound on the meansquared error of is given by
Moreover, when , for the optimal sensing strategy that achieves this smallest possible CramérRao lower bound on the the meansquared error, we have the optimal cost allocated for the th data source, denoted by , satisfies:
The optimal cost associated with obtaining the th sample of the th source is given by
The number of samples obtained from the th data source satisfies
Proof.
We assume that budget is allocated to obtain samples from the th data source, namely
Then we can conclude that
(21)  
(22)  
(23)  
(24)  
(25) 
where in (23) we use the fact that achieves its minimum value at a finite , and the corresponding minimum value is denoted by .
Moreover, we claim that there exists a strategy of allocating budget to samples in such a way that
(26)  
(27) 
In fact, one can take
samples, and we spend to obtain each of the samples except for the last sample, on which we spend . Then we have
(28)  
(29)  
(30)  
(31)  
(32) 
In summary, we have
Then the smallest possible achievable CramérRao lower bound , as defined by the optimal objective function of the following optimization problem,
(33)  
(34)  
(35)  
(36)  
(37) 
can be lower bounded by the optimal value of the following optimization problem:
(38)  
(39)  
(40)  
(41) 
We can solve (41) through its Lagrange dual problem and the KarushKuhnTucker conditions (please refer to Appendix 10.1). For the optimal solution, we have obtained that
and, moreover, under the optimal , the optimal value of (41) is given by
Now let us consider upper bounding the smallest possible achievable CramérRao lower bound , as defined by the following optimization problem,
(42)  
(43)  
(44)  
(45)  
(46) 
We note that, because
the optimal objective value of (46) can be upper bounded by the optimal objective value of the following optimization problem (47):
(47) 
We further notice that the optimal objective value of (47) is upper bounded by the objective value of the following optimization problem (47):
Comments
There are no comments yet.