1 Introduction
Optimal design is a classical problem in statistics [5]
with many applications from diversity sampling to machine learning. Optimal design has many different criteria, such as
A,D,E,Voptimality, which correspond to different objectives to be optimized. In this work, we focus in optimality. We refer the reader to [11] and references therein for applications of optimal design and other optimality criteria.The problem of optimal design can be defined as follows. We are given input vectors where is in dimensions and a budget , and the goal is to find a subset of size that minimizes the trace of (if does not span full rank, we ignore the
zero eigenvalues in calculating harmonic mean of the eigenvalues of
). Approximation algorithms for optimal design include approximation by volume sampling [2], approximation for by a connection of optimal design with matrix sparsification [12], approximation for by regret minimization [1], and approximation for and for using a variant of local search and greedy algorithms [9]. The best approximation known in the regime with large is obtained by [11] as follows.Theorem 1.1 ([11]).
There exists a polynomialtime approximation algorithm for optimal design problem for .
The result follows from solving the convex relaxation of optimal design and sampling a set with proportional volume sampling based on the fractional solution obtained from the relaxation. Nikolov et al. [11] show that approximation guarantee of optimal design follows from approximately independent distribution and that a general class of hardcore distributions is approximately independent. Finally, they show that proportional volume sampling can be efficiently implemented and is, indeed, a hardcore distribution, which conclude the proof of the approximation.
In this work, we generalize this approach to the regularized optimal design problem, where one aims to minimizes the trace of where is the
identity matrix. The problem is motivated from the use of ridge regression, a variant of linear regression with an
regularization penalty, to find the best linear estimator. We define
nearpairwise independent distributions, and show that they also include a general class of hardcore distributions, and that nearpairwise independence implies approximation guarantee for regularized optimal design. Finally, we define regularized proportional volume sampling and show its nearpairwise independence property and its polynomialtime implementation. All of these results imply the approximation to regularized optimal design, which is our main result and is stated as follows.Theorem 1.2.
There exists a polynomialtime approximation algorithm for regularized optimal design problem for . In fact, the approximation ratio is where and as .
The exact approximation ratio and constants in the bound of can be found in Theorem 6.1. Our analysis follows similarly as the one in [11], which heavily involves elementary symmetric polynomials of eigenvalues of the matrix . The key idea in extending the previous results to regularized optimal design is the fact that an elementary symmetric polynomial of eigenvalues of are sums of elementary symmetric polynomials of eigenvalues of . We then carefully group these polynomials and bound each of those groups using similar but more complicated inequalities from [11].
1.1 Related Work
For related work to optimal design and its approximation algorithms, we refer the reader to [11] and references therein. Here, we focus on work related to regularized optimal design, when one uses ridge regression in place of linear regression to find a linear estimator in optimal design.
Ridge regression or regularized regression is introduced by Hoerl and Kennard [7] to ensure a unique solution of linear regression when a data matrix is singular, i.e., when the training data points do not span full dimensions. Ridge regression has been applied to many practical problems [10] and is one of classical linear methods for regression in machine learning [6].
Derezinski and Warmuth [3] introduced regularized volume sampling, and their results imply approximation for regularized optimal design. The linear dependence on in the approximation ratio is a result of their bound of that compares to rather than to for an optimal of the problem as in our work. We compare their result to ours in more details in Appendix A.
1.2 Organization
In Section 2, we provide background on optimal design and the motivation and definition of the regularized optimal design problem. In Section 3, we describe our algorithm based on convex relaxation and regularized proportional volume sampling. In Section 4, we state nearpairwise independence property and prove its sufficiency to approximate regularized optimal design. In Section 5, we show that regularized proportional volume sampling is hardcore, and that hardcore distributions are nearpairwise independent. In Section 6, we state and prove our main technical result, namely the approximation of regularized optimal design. In Section 7, we show a polynomialtime implementation of regularized proportional volume sampling. We note in Appendix A the comparison of regularized volume sampling [3, 4] and our regularized proportional volume sampling. Appendix B contains derivations of formula deferred from the main body.
2 Notation, Background, and Motivation of Regularized AOptimal Design
Let be the by matrix of input vectors . We use the notation , a matrix of column vectors for , and a matrix of column vectors for . Let be the label (or response) column vector, and is the column vector . Denote the sets of all subsets of of size and at most , respectively. Let be the degree elementary symmetric polynomial in the variables , i.e., . By convention, for any and for . For any positive semidefinite matrix , we define to be , where is the vector of eigenvalues of . Denote the identity matrix of dimension , , and the dot product of two matrices of the same dimension. We denote
the multivariate Gaussian distribution with mean
and covariance .Different optimality criteria of optimal design can be viewed as different scalarizations of the matrix , such as the trace of the inverse as in design, or the determinant in design. One motivation on which we focus in this work for design is the squared error of the estimator in linear model. In linear model, we assume that where
’s are independent Gaussian noise with mean zero and variance
. We want to pick to obtain labels which provide as much information as possible to best estimate .Linear Regression.
One choice to estimate is by minimizing the sum of squared errors on the labeled samples:
(1) 
which is also called linear regression. This estimate is also known to be the maximum likelihood estimate (with no prior). The expected squared error of this estimator from is (see Appendix B for its derivation). Hence, to get as useful predictor as possible, one can minimize , which is a motivation to the design objective.
Ridge Regression.
Suppose we estimate by minimizing the sum of squared errors on the labeled samples with an additional regularization parameter :
(2) 
which is also called ridge regression. Ridge regression with
increases the stability the linear regression against the outlier, and forces the optimization problem to have a unique solution when
does not span fullrank which makes linear regression illdefined. When , the problem reverts to standard linear regression. It is also known that is the maximum likelihood estimate of linear model given the Gaussian prior . The expected squared error of from is(3) 
Settings  



Settings  



We summarize the distribution of the predictor or model error, and the prediction error with respect to a data matrix in dimensions, , of the ridge regression estimate in Tables 1 and 2. Some optimality criteria concern prediction error; for example, optimal design minimizes the expected squared norm of with . We note that in general, we may also assume is a random Gaussian vector with (instead of ), and the results in this work still hold; the errors to be minimized will be upper bounded by as if . The derivation of Tables 1 and 2 can be found in Appendix B.
Bounding the Error of Ridge Regression Predictor.
The challenge to upperbound (3) is the secondorder term . One way to address this is to consider only the firstorder term . For example, Derezinski and Warmuth [3] assume that , which gives , and then we have
(4) 
The righthand side of (4) now contains only the firstorder term , which can be easier to optimize. For example, results in [3, 4] imply an approximation for the objective . To the best of our knowledge, it is an open question whether there is an approximation algorithm that directly bounds without any assumption on .
2.1 Regularized Optimal Design
The upperbound of the expected squared predictor error in (4) is similar to the Aoptimal design objective , and we follow Derezinski and Warmuth [3] in using it as an objective to be optimized. In particular, we define the regularized optimal design problem as, given input vectors in dimensions, positive integer , and , we find a subset of size to minimize
(5) 
regularized Generalized Ratio Objective.
Similar to the generalized ratio objective in [11], we can also define its regularized counterpart. The generalized ratio objective is the ratio of elementary symmetric polynomials of eigenvalues of , which captures both  and design problems. Given , the goal is to choose a subset of size to minimize
(6) 
Hence, one can also define regularized generalized ratio objective as
(7) 
3 Regularized Proportional Volume Sampling Algorithm
Recall that we denote () the set of all subsets of size (of size ). Given , and a distribution over , we define the regularized proportional volume sampling with measure to be the distribution over where for all . Given , we say a distribution over is hardcore with parameter if for all . Denote the spectral norm of matrix .
To solve regularized optimal design, we solve the convex relaxation of the optimization problem, namely
(8)  
(9)  
(10) 
where , to get a fractional solution . Note that convexity follows from the convexity of function over the set of all PSD matrices . Then, we sample a set by regularized proportional volume sampling with hardcore measure , where the parameter of the measure depends on the fractional solution . The summary of the algorithm is in Algorithm 1. We choose in such a way to obtained the desired approximation result. The approximation and motivation to how we set can be found in Section 6.
4 Reduction of Approxibility to NearPairwise Independence
In this section, we show that an approximation guarantee of regularized proportional volume sampling with measure reduces to showing a property on which we called nearpairwise independence, stated formally in Theorem 4.3. We first define nearpairwise independence of a distribution.
Definition 4.1.
Let be a distribution on . Let . We say is ()nearpairwise independent with respect to if for all each of size at most ,
(11) 
We omit the phrase "with respect to " when the context is clear. Before we prove the main result, we make some calculation which will be used later.
Lemma 4.2.
For any PSD matrix and ,
(12) 
and
(13) 
Proof.
Let be eigenvalues of . Then we have
which proves the first equality. Next, we have
where is with one element deleted. For each fixed , we have
(14) 
by counting the number of each monomial in . Noting that , we finish the proof. ∎
Now we are ready to state and prove the main result in this section.
Theorem 4.3.
Let . Let be a distribution on that is ()nearpairwise independent. Then the regularized proportional volume sampling with measure satisfies
(15) 
That is, the sampling gives approximation guarantee to regularized optimal design in expectation.
Note that by , (15) also implies approximation guarantee to the original regularized optimal design. However, we can exploit the gap of these two quantities to get a better approximation ratio which converges to 1 as . This is done formally in Section 6.
Proof.
We apply Lemma 4.2 to RHS of (15) to get
where we apply CauchyBinet to the last equality. Next, we apply Lemma 4.2 to LHS of (15) to get
Therefore, by crossmultiplying the numerator and denominator, the ratio equals to
For each fixed , we want to upper bound . By the definition of nearpairwise independence (11),
(16)  
(17) 
Therefore, the ratio is also bounded above by . ∎
5 Constructing a NearPairwiseIndependent Distribution
In this section, we want to construct a distribution on and prove its ()nearpairwiseindependence. Our proposed is hardcore with parameter defined by (coordinatewise) for some to be chosen later. With this choice of , we upper bound the ratio in terms of . Later in Section 6, after getting an explicit approximation ratio in terms of , we will optimize for to get the desired approximation result to Algorithm 1.
Lemma 5.1.
Let such that . Let be a distribution on that is hardcore with parameter defined by (coordinatewise) for some . Then, for all of size between 0 and , we have
(18) 
That is, is nearpairwise independent.
Proof.
Fix of size . Define to be the random set that includes each
independently with probability
. Let and . Then, noting that , we haveLet . Then by Chernoff bound,
(19) 
which finishes the proof. ∎
6 The Proof of the Main Result
The main aim of this section is prove the approximation guarantee of the regularized proportional volume sampling algorithm (Algorithm 1) for regularized optimal design. The main result is stated formally in Theorem 6.1.
Theorem 6.1.
Let , and , and suppose
(20) 
Denote . Then the proportional volume sampling with hardcore measure with parameter (coordinatewise) with satisfies
(21) 
Therefore, Algorithm 1 gives approximation ratio to regularized Aoptimal design.
The approximation guarantee of Algorithm 1 follows from (21) because in Algorithm 1 is a convex solution to regularized Aoptimal design, so the objective achieved by is at most the optimal value of the original problem.
We briefly outline the proof of Theorem 6.1 here, which combines results from previous sections. Lemma 5.1 shows that our constructed is nearpairwise independent for some dependent on . Theorem 4.3 converts nearpairwise independence to the )approximation guarantee to regularized optimal design. However, this may be a gap between the optimums of  and regularized optimal design. As increases, the gap is larger so that the approximation tightens even more (we quantify this gap formally in Claim 2). As a result, we want to pick small enough to have a small )approximation ratio but also big enough to exploit this gap. Choosing that gives our desired approximation is done in the proof of Theorem 6.1.
Before proving the main theorem, Theorem 6.1, we first simplify the parameter of nearpairwise independent that we constructed. The claim below shows that is a right condition to obtain .
Claim 1.
Let . Suppose
(22) 
Then
(23) 
Comments
There are no comments yet.