Despite the availability of large datasets, in many applications, collecting labels for all data points is not possible due to measurement constraints. We consider the problem of measurement-constrained regression where we are given a large pool of data points but can only observe a small set of labels. Classical experimental design approaches in statistical literature Pukelsheim (1993) have investigated this problem, but the proposed solutions tend to be often combinatorial. In this work, we investigate computationally tractable methods for selecting data points to label from a given pool of design points in measurement-constrained regression.
Despite the simplicity and wide applicability of OLS, in practice it may not be possible to obtain the full
-dimensional response vectordue to measurement constraints. It is then a common practice to select a small subset of rows (e.g., rows) in so that the statistical efficiency of regression on the selected subset of design points is maximized. Compared to the classical experimental design problem (Pukelsheim, 1993) where can be freely designed, in this work we consider the setting where the selected design points must come from an existing (finite) design pool .
Below we list three example applications where such measurement constraints are relevant:
Example 1 (Material synthesis)
Material synthesis experiments are time-consuming and expensive, whose results are sensitive to experimental setting features such as temperature, duration and reactant ratio. Given hundreds, or even thousands of possible experimental settings, it is important to select a handful of representative settings such that a model can be built with maximized statistical efficiency to predict quality of the outcome material from experimental features. In this paper we consider such an application of low-temperature microwave-assisted thin film crystallization (Reeja-Jayan et al., 2012; Nakamura et al., 2017) and demonstrate the effectiveness of our proposed algorithms.
Example 2 (CPU benchmarking)
Central processing units (CPU) are vital to the performance of a computer system. It is an interesting statistical question to understand how known manufacturing parameters (clock period, cache size, etc.) of a CPU affect its execution time (performance) on benchmark computing tasks (Ein-Dor and Feldmesser, 1987). As the evaluation of real benchmark execution time is time-consuming and costly, it is desirable to select a subset of CPUs available in the market with diverse range of manufacturing parameters so that the statistical efficiency is maximized by benchmarking for the selected CPUs.
Example 3 (Wind speed prediction)
In Chen et al. (2015) a data set is created to record wind speed across a year at established measurement locations on highways in Minnesota, US. Due to instrumentation constraints, wind speed can only be measured at intersections of high ways, and a small subset of such intersections is selected for wind speed measurement in order to reduce data gathering costs.
In this work, we primarily focus on the linear regression model (though extensions to generalized linear models and functions of linear models are also considered later)
where is the design matrix, is the response and
are homoscedastic Gaussian noise with variance. is a -dimensional regression model that one wishes to estimate. We consider the “large-scale, low-dimensional” setting where both and , and has full column rank. A common estimator is the Ordinary Least Squares (OLS) estimator:
The mean square error of the estimated regression coefficients . Under measurement constraints, it is well-known that the statistically optimal subset for estimating the regression coefficients is given by the A-optimality criterion (Pukelsheim, 1993):
Despite the statistical optimality of Eq. (1), the optimization problem is combinatorial in nature and the optimal subset is difficult to compute. A brute-force search over all possible subsets of size requires operations, which is infeasible for even moderately sized designs .
In this work, we focus on computationally tractable methods for experiment selection that achieve near-optimal statistical efficiency in linear regression models. We consider two experiment selection models: the with replacement model where each design point (row of ) can be selected more than once with independent noise involved for each selection, and the without replacement model where distinct row subsets are required. We propose two computationally tractable algorithms: one sampling based algorithm that achieves approximation of the statistically optimal solution for the with replacement model and, when is well-conditioned, the algorithm also works for the without replacement model. In the “soft budget” setting , the approximation ratio can be further improved to . We also propose a greedy method that achieves approximation for the without replacement model regardless of the conditioning of the design pool .
2 Problem formulation and backgrounds
We first give a formal definition of the experiment selection problem in linear regression models: [experiment selection problem in linear regression models] Let be a known design matrix with full column rank and be the subset budget, with . An experiment selection problem aims to find a subset of size , either deterministically or randomly, then observes , where each coordinate of
is i.i.d. Gaussian random variable with zero mean and equal covariance, and then generates an estimateof the regression coefficients based on . Two types of experiment selection algorithms are considered:
With replacement: is a multi-set which allows duplicates of indices. Note that fresh (independent) noise is imposed after, and independent of, experiment selection, and hence duplicate design points have independent noise. We use to denote the class of all with replacement experiment selection algorithms.
Without replacement: is a standard set that may not have duplicate indices. We use to denote the class of all without replacement experiment selection algorithms.
|Ma et al. (2015)||with rep.||additive111“Additive” means that the statistical error of the resulting estimator cannot be bounded by a multiplicative factor of the minimax optimal error.||-222The leverage score sampling method in Ma et al. (2015) does not have rigorous approximation guarantees in terms of or . However, the bounds in that paper establish that leverage score sampling can be worse or better than uniform sampling under different settings.||asymptotic|
|Avron and Boutsidis (2013)||with rep.||additive|
|, or if -approximation desired|
). Sampling based algorithms can be with or without replacement and are randomized, which succeed with at least 0.8 probability.
As evaluation criterion, we consider the mean square error , where is an estimator of with and as inputs. We study computationally tractable algorithms that approximately achieves the minimax rate of convergence over or :
Formally, we give the following definition: [-approximate algorithm] Fix and . We say an algorithm (either deterministic or randomized) is a -approximate algorithm if for any with full column rank, produces a subset with size and an estimate in polynomial time such that, with probability at least 0.8,
Here both expectations are taken over the randomness in the noise variables and the inherent randomness in . Table 1 gives an overview of approximation ratio for algorithms proposed in this paper. We remark that the combinatorial A-optimality solution of Eq. (1) upper bounds the minimax risk (since minimaxity is defined over deterministic algorithms as well), hence the approximation guarantees also hold with respect to the combinatorial A-optimality objective.
2.1 Related work
There has been an increasing amount of work on fast solvers for the general least-square problem . Most of existing work along this direction (Woodruff, 2014; Dhillon et al., 2013; Drineas et al., 2011; Raskutti and Mahoney, 2015) focuses solely on the computational aspects and do not consider statistical constraints such as limited measurements of . A convex optimization formulation was proposed in Davenport et al. (2015) for a constrained adaptive sensing problem, which is a special case of our setting, but without finite sample guarantees with respect to the combinatorial problem. In Horel et al. (2014) a computationally tractable approximation algorithm was proposed for the D-optimality criterion of the experimental design problem. However, the core idea in Horel et al. (2014) of pipage rounding an SDP solution (Ageev and Sviridenko, 2004) is not applicable in our problem because the objective function we consider in Eq. (4) is not submodular.
Popular subsampling techniques such as leverage score sampling (Drineas et al., 2008) were studied in least square and linear regression problems (Zhu et al., 2015; Ma et al., 2015; Chen et al., 2015). While computation remains the primary subject, measurement constraints and statistical properties were also analyzed within the linear regression model (Zhu et al., 2015). However, none of the above-mentioned (computationally efficient) methods achieve near minimax optimal statistical efficiency in terms of estimating the underlying linear model , since the methods can be worse than uniform sampling which has a fairly large approximation constant for general . One exception is Avron and Boutsidis (2013), which proposed a greedy algorithm that achieves an error bound of estimation of that is within a multiplicative factor of the minimax optimal statistical efficiency. The multiplicative factor is however large and depends on the size of the full sample pool , as we remark in Table 1 and also in Sec. 3.4.
Another related area is active learning (Chaudhuri et al., 2015; Hazan and Karnin, 2015; Sabato and Munos, 2014), which is a stronger setting where feedback from prior measurements can be used to guide subsequent data selection. Chaudhuri et al. (2015) analyzes an SDP relaxation in the context of active maximum likelihood estimation. However, the analysis in Chaudhuri et al. (2015) only works for the with replacement model and the two-stage feedback-driven strategy proposed in Chaudhuri et al. (2015) is not available under the experiment selection model defined in Definition 2 where no feedback is assumed.
For a matrix , we use to denote the induced -norm of . In particular, and . denotes the Frobenius norm of . Let
be the singular values of, sorted in descending order. The condition number is defined as . For sequences of random variables and , we use to denote converges in probability to . We say if and if . For two -dimensional symmetric matrices and , we write if for all , and if for all .
3 Methods and main results
We describe two computationally feasible algorithms for the experiment selection problem, both based on a continuous convex optimization problem. Statistical efficiency bounds are presented for both algorithms, with detailed proofs given in Sec. 7.
3.1 Continuous optimization and minimax lower bounds
We consider the following continuous optimization problem, which is a convex relaxation of the combinatorial A-optimality criterion of Eq. (1):
Note that the constraint is only relevant for the without replacement model and for the with replacement model we drop this constraint in the optimization problem. It is easy to verify that both the objective and the feasible set in Eq. (4) are convex, and hence the global optimal solution of Eq. (4) can be obtained using computationally tractable algorithms. In particular, we describe an SDP formulation and a practical projected gradient descent algorithm in Appendix B and D, both provably converge to the global optimal solution of Eq. (4) with running time scaling polynomially in and .
We first present two facts, which are proved in Sec. 7.
Let and be feasible solutions of Eq. (4) such that for all . Then , with equality if and only if .
We remark that the inverse monotonicity of in implies second fact, which can potentially be used to understand sparsity of in later sections.
The following theorem shows that the optimal solution of Eq. (4) lower bounds the minimax risk defined in Eq. (2). Its proof is placed in Sec. 7. Let and be the optimal objective values of Eq. (4) for with replacement and without replacement, respectively. Then for ,
Despite the fact that Eq. (4) is computationally tractable, its solution is not a valid experiment selection algorithm under a measurement budget of because there can be much more than components in that are not zero. In the following sections, we discuss strategies for choosing a subset of rows with (i.e., soft constraint) or (i.e., hard constraint), using the solution to the above.
3.2 Sampling based experiment selection: soft size constraint
We first consider a weaker setting where soft constraint is imposed on the size of the selected subset ; in particular, it is allowed that with high probability, meaning that a constant fraction of over-selection is allowed. The more restrictive setting of hard constraint is treated in the next section.
A natural idea of obtaining a valid subset of size is by sampling from a weighted row distribution specified by . Let and define distributions and for as
Note that both and sum to one because and .
Under the without replacement setting the distribution is straightforward: is proportional to the optimal continuous weights ; under the with replacement setting, the sampling distribution takes into account leverage scores (effective resistance) of each data point in the conditioned covariance as well. Later analysis (Theorem 3.2) shows that it helps with the finite-sample condition on . Figure 1 gives details of the sampling based algorithms for both with and without replacement settings.
The following proposition bounds the size of in high probability: For any with probability at least it holds that . That is, . Apply Markov’s inequality and note that an additional samples need to be added due to the ceiling operator in with replacement sampling.
The sampling procedure is easy to understand in an asymptotic sense: it is easy to verify that and , for both with and without replacement settings. Note that by feasibility constraints and hence is a valid distribution for all
. For the with replacement setting, by weak law of large numbers,as and hence by continuous mapping theorem. A more refined analysis is presented in Theorem 3.2 to provide explicit conditions under which the asymptotic approximations are valid and on the statistical efficiency of as well as analysis under the more restrictive without replacement regime.
Fix as an arbitrarily small accuracy parameter. Suppose the following conditions hold:
Here and denotes the conditional number of . Then with probability at least the subset OLS estimator satisfies
We adapt the proof technique of Spielman and Srivastava in their seminal work on spectral sparsification of graphs (Spielman and Srivastava, 2011). More specifically, we prove the following stronger “two-sided” result which shows that is a spectral approximation of with high probability, under suitable conditions.
Under the same conditions in Theorem 3.2, it holds that with probability at least that
where and .
3.3 Sampling based experiment selection: hard size constraint
In some applications it is mandatory to respect a hard subset size constraint; that is, a randomized algorithm is expected to output that satisfies almost surely, and no over-sampling is allowed. To handle such hard constraints, we revise the algorithm in Figure 1 as follows:
We have the following theorem, which mimics Theorem 3.2 but with weaker approximation bounds: Suppose the following conditions hold:
Then with probability at least 0.8 the subset estimator satisfies
The following lemma is key to the proof of Theorem 3.3. Unlike Lemma 3.2, in Lemma 6 we only prove one side of the spectral approximation relation, which suffices for our purposes. To handle without replacement, we cite matrix Bernstein for combinatorial matrix sums in (Mackey et al., 2014). Define . Suppose the following conditions hold:
Then with probability at least the following holds:
where for with replacement and for without replacement.
Finally, we need to relate conditions on in Lemma 6 to interpretable conditions on subset budget : Let be an arbitrarily small fixed failure probability. The with probability at least we have that for with replacement and for without replacement. For with replacement we have and for without replacement we have . Applying Markov’s inequality on for and/or we complete the proof of Lemma 2. Combining Lemmas 6 and 2 with and note that almost surely (because ), we prove Theorem 3.3.
3.4 Greedy experiment selection
Avron and Boutsidis (2013) proposed an interesting greedy removal algorithm (outlined in Figure 3) and established the following result: Suppose of size is obtained by running algorithm in Figure 3 with an initial subset , . Both and are standard sets (i.e., without replacement). Then
In Avron and Boutsidis (2013) the greedy removal procedure in Figure 3 is applied to the entire design set , which gives approximation guarantee . This results in an approximation ratio of as defined in Eq. (2), by applying the trivial bound , which is tight for a design that has exactly non-zero rows.
To further improve the approximation ratio, we consider applying the greedy removal procedure with equal to the support of ; that is, . Because under the without replacement setting, we have the following corollary: Let be the support of and suppose . Then
It is thus important to upper bound the support size . With the trivial bound of we recover the approximation ratio by applying Figure 3 to . In order to bound away from , we consider the following assumption imposed on :
Define mapping as , where denotes the th coordinate of a -dimensional vector and if and otherwise. Denote as the affine version of . For any distinct rows of , their mappings under are linear independent.
Assumption 3.1 is essentially a general-position assumption, which assumes that no design points in lie on a degenerate affine subspace after a specific quadratic mapping. Like other similar assumptions in the literature (Tibshirani, 2013), Assumption 3.1 is very mild and almost always satisfied in practice, for example, if each row of is independently sampled from absolutely continuous distributions.
We are now ready to state the main lemma bounding the support size of . if Assumption 3.1 holds.
Lemma 3.1 is established by an interesting observation into the properties of Karush-Kuhn-Tucker (KKT) conditions of the optimization problem Eq. (4), which involves a linear system with variables. The complete proof of Lemma 3.1 is given in Sec. 7.5. To contrast the results in Lemma 3.1
with classical rank/support bounds in SDP and/or linear programming (e.g. the Pataki’s bound(Pataki, 1998)), note that the number of constraints in the SDP formulation of Eq. (4) (see also Appendix B) is linear in , and hence analysis similar to (Pataki, 1998) would result in an upper bound of that scales with , which is less useful for our analytical purpose.
Combining results from both Lemma 3.1 and Corollary 3.4 we arrive at the following theorem, which upper bounds the approximation ratio of the greedy removal procedure in Figure 3 initialized by the support of . Let be the optimal solution of the without replacement version of Eq. (4) and be the output of the greedy removal procedure in Figure 3 initialized with . If and Assumption 3.1 holds then the subset OLS estimator satisfies
Under a slightly stronger condition that , the approximation ratio can be simplified to . In addition, if , meaning that near-optimal experiment selection is achievable with computationally tractable methods if design points are allowed in the selected subset.
3.5 Interpretable subsampling example: anisotropic Gaussian design
Even though we consider a fixed pool of design points so far, here we use an anisotropic Gaussian design example to demonstrate that a non-uniform sampling can outperform uniform sampling even under random designs, and to interpret the conditions required in previous analysis. Let
be i.i.d. distributed according to an isotropic Gaussian distribution.
We first show that non-uniform weights could improve the objective . Let be the uniformly weighted solution of , corresponding to selecting each row of uniformly at random. We then have
On the other hand, let for some universal constant and , . By Markov inequality, . Define weighted solution as normalized such that . 111Note that may not be the optimal solution of Eq. (4); however, it suffices for the purpose of the demonstration of improvement resulting from non-uniform weights. Then
Here in the last inequality we apply Lemma A. Because by Jensen’s inequality, we conclude that in general , and the gap is larger for ill-conditioned covariance . This example shows that uneven weights in helps reducing the trace of inverse of the weighted covariance .
Under this model, we also simplify the conditions for the without replacement model in theorem 3.2 and 3.3. Because , it holds that . In addition, by simple algebra . Using a very conservative upper bound of by sampling rows in uniformly at random and apply weak law of large numbers and the continuous mapping theorem, we have that . In addition, . Subsequently, the condition is implied by
Essentially, the condition is reduced to . The linear dependency on is necessary, as we consider the low-dimensional linear regression problem and would imply an infinite mean-square error in estimation of . We also remark that the condition is scale-invariant, as and share the same quantity .
We discuss possible extension of our results beyond estimation of in the linear regression model.
3.6.1 Generalized linear models
In a generalized linear model satisfies for some known link function . Under regularity conditions (Van der Vaart, 2000), the maximum-likelihood estimator satisfies , where is the Fisher’s information matrix:
Here both expectations are taken over conditioned on and the last equality is due to the sufficiency of . The experiment selection problem is then formulated to select a subset of size , either with or without duplicates, that minimizes .
It is clear from Eq. (8) that the optimal subset depends on the unknown parameter , which itself is to be estimated. This issue is known as the design dependence problem for generalized linear models (Khuri et al., 2006). One approach is to consider locally optimal designs (Khuri et al., 2006; Chernoff, 1953), where a consistent estimate of is first obtained on an initial design subset 222Notice that a consistent estimate can be obtained using much fewer points than an estimate with finite approximation guarantee. and then is supplied to compute a more refined design subset to get the final estimate . With the initial estimate available, one may apply transform defined as
Note that under regularity conditions