 # λ-Regularized A-Optimal Design and its Approximation by λ-Regularized Proportional Volume Sampling

In this work, we study the λ-regularized A-optimal design problem and introduce the λ-regularized proportional volume sampling algorithm, generalized from [Nikolov, Singh, and Tantipongpipat, 2019], for this problem with the approximation guarantee that extends upon the previous work. In this problem, we are given vectors v_1,…,v_n∈ℝ^d in d dimensions, a budget k≤ n, and the regularizer parameter λ≥0, and the goal is to find a subset S⊆ [n] of size k that minimizes the trace of (∑_i∈ Sv_iv_i^⊤ + λ I_d)^-1 where I_d is the d× d identity matrix. The problem is motivated from optimal design in ridge regression, where one tries to minimize the expected squared error of the ridge regression predictor from the true coefficient in the underlying linear model. We introduce λ-regularized proportional volume sampling and give its polynomial-time implementation to solve this problem. We show its (1+ϵ/√(1+λ'))-approximation for k=Ω(d/ϵ+log 1/ϵ/ϵ^2) where λ' is proportional to λ, extending the previous bound in [Nikolov, Singh, and Tantipongpipat, 2019] to the case λ>0 and obtaining asymptotic optimality as λ→∞.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Optimal design is a classical problem in statistics 

with many applications from diversity sampling to machine learning. Optimal design has many different criteria, such as

A,D,E,V-optimality, which correspond to different objectives to be optimized. In this work, we focus in -optimality. We refer the reader to  and references therein for applications of optimal design and other optimality criteria.

The problem of -optimal design can be defined as follows. We are given input vectors where is in dimensions and a budget , and the goal is to find a subset of size that minimizes the trace of (if does not span full rank, we ignore the

zero eigenvalues in calculating harmonic mean of the eigenvalues of

). Approximation algorithms for -optimal design include -approximation by volume sampling , -approximation for by a connection of optimal design with matrix sparsification , -approximation for by regret minimization , and -approximation for and for using a variant of local search and greedy algorithms . The best approximation known in the regime with large is obtained by  as follows.

###### Theorem 1.1 ().

There exists a polynomial-time -approximation algorithm for -optimal design problem for .

The result follows from solving the convex relaxation of -optimal design and sampling a set with proportional volume sampling based on the fractional solution obtained from the relaxation. Nikolov et al.  show that approximation guarantee of -optimal design follows from approximately independent distribution and that a general class of hard-core distributions is approximately independent. Finally, they show that proportional volume sampling can be efficiently implemented and is, indeed, a hard-core distribution, which conclude the proof of the approximation.

In this work, we generalize this approach to the -regularized -optimal design problem, where one aims to minimizes the trace of where is the

identity matrix. The problem is motivated from the use of ridge regression, a variant of linear regression with an

-regularization penalty, to find the best linear estimator. We define

near-pairwise independent distributions, and show that they also include a general class of hard-core distributions, and that near-pairwise independence implies approximation guarantee for -regularized -optimal design. Finally, we define -regularized proportional volume sampling and show its near-pairwise independence property and its polynomial-time implementation. All of these results imply the approximation to -regularized -optimal design, which is our main result and is stated as follows.

###### Theorem 1.2.

There exists a polynomial-time -approximation algorithm for -regularized -optimal design problem for . In fact, the approximation ratio is where and as .

The exact approximation ratio and constants in the bound of can be found in Theorem 6.1. Our analysis follows similarly as the one in , which heavily involves elementary symmetric polynomials of eigenvalues of the matrix . The key idea in extending the previous results to -regularized -optimal design is the fact that an elementary symmetric polynomial of eigenvalues of are sums of elementary symmetric polynomials of eigenvalues of . We then carefully group these polynomials and bound each of those groups using similar but more complicated inequalities from .

### 1.1 Related Work

For related work to -optimal design and its approximation algorithms, we refer the reader to  and references therein. Here, we focus on work related to -regularized -optimal design, when one uses ridge regression in place of linear regression to find a linear estimator in optimal design.

Ridge regression or regularized regression is introduced by Hoerl and Kennard  to ensure a unique solution of linear regression when a data matrix is singular, i.e., when the training data points do not span full dimensions. Ridge regression has been applied to many practical problems  and is one of classical linear methods for regression in machine learning .

Derezinski and Warmuth  introduced -regularized volume sampling, and their results imply -approximation for -regularized -optimal design. The linear dependence on in the approximation ratio is a result of their bound of that compares to rather than to for an optimal of the problem as in our work. We compare their result to ours in more details in Appendix A.

### 1.2 Organization

In Section 2, we provide background on optimal design and the motivation and definition of the -regularized -optimal design problem. In Section 3, we describe our algorithm based on convex relaxation and -regularized proportional volume sampling. In Section 4, we state near-pairwise independence property and prove its sufficiency to approximate -regularized -optimal design. In Section 5, we show that -regularized proportional volume sampling is hard-core, and that hard-core distributions are near-pairwise independent. In Section 6, we state and prove our main technical result, namely the approximation of -regularized -optimal design. In Section 7, we show a polynomial-time implementation of -regularized proportional volume sampling. We note in Appendix A the comparison of -regularized volume sampling [3, 4] and our -regularized proportional volume sampling. Appendix B contains derivations of formula deferred from the main body.

## 2 Notation, Background, and Motivation of λ-Regularized A-Optimal Design

Let be the -by- matrix of input vectors . We use the notation , a matrix of column vectors for , and a matrix of column vectors for . Let be the label (or response) column vector, and is the column vector . Denote the sets of all subsets of of size and at most , respectively. Let be the degree elementary symmetric polynomial in the variables , i.e., . By convention, for any and for . For any positive semi-definite matrix , we define to be , where is the vector of eigenvalues of . Denote the identity matrix of dimension , , and the dot product of two matrices of the same dimension. We denote

the multi-variate Gaussian distribution with mean

and covariance .

Different optimality criteria of optimal design can be viewed as different scalarizations of the matrix , such as the trace of the inverse as in -design, or the determinant in -design. One motivation on which we focus in this work for -design is the squared error of the estimator in linear model. In linear model, we assume that where

’s are independent Gaussian noise with mean zero and variance

. We want to pick to obtain labels which provide as much information as possible to best estimate .

#### Linear Regression.

One choice to estimate is by minimizing the sum of squared errors on the labeled samples:

 ^wS=argminw∈Rd{∥∥yS−V⊤Sw∥∥22} (1)

which is also called linear regression. This estimate is also known to be the maximum likelihood estimate (with no prior). The expected squared error of this estimator from is (see Appendix B for its derivation). Hence, to get as useful predictor as possible, one can minimize , which is a motivation to the -design objective.

#### Ridge Regression.

Suppose we estimate by minimizing the sum of squared errors on the labeled samples with an additional -regularization parameter :

 (2)

which is also called ridge regression. Ridge regression with

increases the stability the linear regression against the outlier, and forces the optimization problem to have a unique solution when

does not span full-rank which makes linear regression ill-defined. When , the problem reverts to standard linear regression. It is also known that is the maximum likelihood estimate of linear model given the Gaussian prior . The expected squared error of from is

 EηS[∥^wS(λ)−w∗∥22]=σ2trZS(λ)−1−λ⟨ZS(λ)−2,σ2Id−λw∗w∗⊤⟩. (3)

We summarize the distribution of the predictor or model error, and the prediction error with respect to a data matrix in dimensions, , of the ridge regression estimate in Tables 1 and 2. Some optimality criteria concern prediction error; for example, -optimal design minimizes the expected squared norm of with . We note that in general, we may also assume is a random Gaussian vector with (instead of ), and the results in this work still hold; the errors to be minimized will be upper bounded by as if . The derivation of Tables 1 and 2 can be found in Appendix B.

#### Bounding the Error of Ridge Regression Predictor.

The challenge to upper-bound (3) is the second-order term . One way to address this is to consider only the first-order term . For example, Derezinski and Warmuth  assume that , which gives , and then we have

 EηS[∥^wS(λ)−w∗∥22] ≤σ2tr(ZS(λ)−1). (4)

The right-hand side of (4) now contains only the first-order term , which can be easier to optimize. For example, results in [3, 4] imply an approximation for the objective . To the best of our knowledge, it is an open question whether there is an approximation algorithm that directly bounds without any assumption on .

### 2.1 λ-Regularized A-Optimal Design

The upper-bound of the expected squared predictor error in (4) is similar to the A-optimal design objective , and we follow Derezinski and Warmuth  in using it as an objective to be optimized. In particular, we define the -regularized -optimal design problem as, given input vectors in dimensions, positive integer , and , we find a subset of size to minimize

 minS⊆[n],|S|=ktr(VSV⊤S+λId)−1. (5)

#### λ-regularized Generalized Ratio Objective.

Similar to the generalized ratio objective in , we can also define its -regularized counterpart. The generalized ratio objective is the ratio of elementary symmetric polynomials of eigenvalues of , which captures both - and -design problems. Given , the goal is to choose a subset of size to minimize

 minS⊆[n],|S|=k(El′(VSV⊤S)El(VSV⊤S))1l−l′. (6)

Hence, one can also define -regularized generalized ratio objective as

 minS⊆[n],|S|=k(El′(VSV⊤S+λId)El(VSV⊤S+λId))1l−l′. (7)

## 3 λ-Regularized Proportional Volume Sampling Algorithm

Recall that we denote () the set of all subsets of size (of size ). Given , and a distribution over , we define the -regularized proportional volume sampling with measure to be the distribution over where for all . Given , we say a distribution over is hard-core with parameter if for all . Denote the spectral norm of matrix .

To solve -regularized -optimal design, we solve the convex relaxation of the optimization problem, namely

 minx∈RnEd−1(V(x)V(x)⊤+λI)Ed(V(x)V(x)⊤+λI) subject to (8) n∑i=1xi=k, (9) 1≥xi≥0 (10)

where , to get a fractional solution . Note that convexity follows from the convexity of function over the set of all PSD matrices . Then, we sample a set by -regularized proportional volume sampling with hard-core measure , where the parameter of the measure depends on the fractional solution . The summary of the algorithm is in Algorithm 1. We choose in such a way to obtained the desired approximation result. The approximation and motivation to how we set can be found in Section 6.

## 4 Reduction of Approxibility to Near-Pairwise Independence

In this section, we show that an approximation guarantee of -regularized proportional volume sampling with measure reduces to showing a property on which we called near-pairwise independence, stated formally in Theorem 4.3. We first define near-pairwise independence of a distribution.

###### Definition 4.1.

Let be a distribution on . Let . We say is ()-near-pairwise independent with respect to if for all each of size at most ,

 Pr\altmathcalS∼μ[S⊇T]Pr\altmathcalS∼μ[S⊇R]≤cα|R|−|T|xTxR (11)

We omit the phrase "with respect to " when the context is clear. Before we prove the main result, we make some calculation which will be used later.

###### Lemma 4.2.

For any PSD matrix and ,

and

###### Proof.

Let be eigenvalues of . Then we have

which proves the first equality. Next, we have

where is with one element deleted. For each fixed , we have

 d∑j=1ei(λ−j)=(d−i)ei(λ) (14)

by counting the number of each monomial in . Noting that , we finish the proof. ∎

Now we are ready to state and prove the main result in this section.

###### Theorem 4.3.

Let . Let be a distribution on that is ()-near-pairwise independent. Then the -regularized proportional volume sampling with measure satisfies

 E\altmathcalS∼μ′[Ed−1(ZS(λ))Ed(ZS(λ))]≤cαEd−1(V(x)V(x)⊤+αλI)Ed(V(x)V(x)⊤+αλI). (15)

That is, the sampling gives -approximation guarantee to -regularized -optimal design in expectation.

Note that by , (15) also implies -approximation guarantee to the original -regularized -optimal design. However, we can exploit the gap of these two quantities to get a better approximation ratio which converges to 1 as . This is done formally in Section 6.

###### Proof.

We apply Lemma 4.2 to RHS of (15) to get

 Ed−1(V(x)V(x)⊤+αλI)Ed(V(x)V(x)⊤+αλI) =∑d−1h=0(d−h)Eh(V(x)V(x)⊤)(αλ)d−1−h∑dℓ=0Eℓ(V(x)V(x)⊤)(αλ)d−ℓ =∑d−1h=0∑|T|=h(d−h)(αλ)d−1−hxTdet(V⊤TVT)∑dℓ=0∑|R|=ℓ(αλ)d−ℓxRdet(V⊤RVR)

where we apply Cauchy-Binet to the last equality. Next, we apply Lemma 4.2 to LHS of (15) to get

 E\altmathcalS∼μ′[Ed−1(ZS(λ))Ed(ZS(λ))] =∑S∈\altmathcalUμ(S)Ed(ZS(λ))Ed−1(ZS(λ))Ed(ZS(λ))∑S∈\altmathcalUμ(S)EdZS(λ)=∑S∈\altmathcalUμ(S)Ed−1ZS(λ)∑S∈\altmathcalUμ(S)EdZS(λ) =∑S∈\altmathcalUμ(S)∑d−1h=0(d−h)Eh(VSV⊤S)λd−1−h∑S∈\altmathcalUμ(S)∑dℓ=0Eℓ(VSV⊤S)λd−ℓ =∑S∈\altmathcalUμ(S)∑d−1h=0∑|T|=h,T⊆S(d−h)λd−1−hdet(V⊤TVT)∑S∈\altmathcalUμ(S)∑dℓ=0∑|R|=ℓ,R⊆Sλd−ℓdet(V⊤RVR) =∑d−1h=0∑|T|=h∑S∈\altmathcalU,S⊇Tμ(S)(d−h)λd−1−hdet(V⊤TVT)∑dℓ=0∑|R|=ℓ∑S∈\altmathcalU,S⊇Rμ(S)λd−ℓdet(V⊤RVR) =∑d−1h=0∑|T|=h(d−h)λd−1−hdet(V⊤TVT)Pr\altmathcalS∼μ[\altmathcalS⊇T]∑dℓ=0∑|R|=ℓλd−ℓdet(V⊤RVR)Pr\altmathcalS∼μ[\altmathcalS⊇R].

Therefore, by cross-multiplying the numerator and denominator, the ratio equals to

For each fixed , we want to upper bound . By the definition of near-pairwise independence (11),

 λd−1−h(αλ)d−ℓxRPrμ[\altmathcalS⊇T]λd−ℓ(αλ)d−1−hxTPrμ[\altmathcalS⊇R] ≤λd−1−h(αλ)d−ℓλd−ℓ(αλ)d−1−hcαℓ−h (16) =αh−ℓ+1⋅cαℓ−h=cα (17)

Therefore, the ratio is also bounded above by . ∎

## 5 Constructing a Near-Pairwise-Independent Distribution

In this section, we want to construct a distribution on and prove its ()-near-pairwise-independence. Our proposed is hard-core with parameter defined by (coordinate-wise) for some to be chosen later. With this choice of , we upper bound the ratio in terms of . Later in Section 6, after getting an explicit approximation ratio in terms of , we will optimize for to get the desired approximation result to Algorithm 1.

###### Lemma 5.1.

Let such that . Let be a distribution on that is hard-core with parameter defined by (coordinate-wise) for some . Then, for all of size between 0 and , we have

 Pr\altmathcalS∼μ[\altmathcalS⊇T]Pr\altmathcalS∼μ[\altmathcalS⊇R]≤βℓ−h1−exp(−(β−1)k−βd)23βk)⋅xTxR. (18)

That is, is -near-pairwise independent.

###### Proof.

Fix of size . Define to be the random set that includes each

independently with probability

. Let and . Then, noting that , we have

 Pr\altmathcalS∼μ[\altmathcalS⊇T]Pr\altmathcalS∼μ[\altmathcalS⊇R] =Pr[\altmathcalB⊇T,|\altmathcalB|≤k]Pr[\altmathcalB⊇R,|\altmathcalB|≤k]≤Pr[\altmathcalB⊇T]Pr[\altmathcalB⊇R,|\altmathcalB|≤k] =βℓ−hxTxR1Pr[∑i∉RYi≤k−ℓ].

Let . Then by Chernoff bound,

 Pr[Y>k−ℓ]≤exp(−((β−1)k+x(R)−βℓ)23β(k−x(R)))≤exp(−((β−1)k−βd)23βk) (19)

which finishes the proof. ∎

## 6 The Proof of the Main Result

The main aim of this section is prove the approximation guarantee of the -regularized proportional volume sampling algorithm (Algorithm 1) for -regularized -optimal design. The main result is stated formally in Theorem 6.1.

###### Theorem 6.1.

Let , and , and suppose

 k≥10dϵ+60ϵ2log(4/ϵ). (20)

Denote . Then the -proportional volume sampling with hard-core measure with parameter (coordinate-wise) with satisfies

 E\altmathcalS∼μ′[Ed−1(ZS(λ))Ed(ZS(λ))]≤(1+ϵ√1+λ′)Ed−1(V(x)V(x)⊤+λI)Ed(V(x)V(x)⊤+λI). (21)

Therefore, Algorithm 1 gives -approximation ratio to -regularized A-optimal design.

The approximation guarantee of Algorithm 1 follows from (21) because in Algorithm 1 is a convex solution to -regularized A-optimal design, so the objective achieved by is at most the optimal value of the original problem.

We briefly outline the proof of Theorem 6.1 here, which combines results from previous sections. Lemma 5.1 shows that our constructed is -near-pairwise independent for some dependent on . Theorem 4.3 converts -near-pairwise independence to the )-approximation guarantee to -regularized -optimal design. However, this may be a gap between the optimums of - and -regularized -optimal design. As increases, the gap is larger so that the approximation tightens even more (we quantify this gap formally in Claim 2). As a result, we want to pick small enough to have a small )-approximation ratio but also big enough to exploit this gap. Choosing that gives our desired approximation is done in the proof of Theorem 6.1.

Before proving the main theorem, Theorem 6.1, we first simplify the parameter of -near-pairwise independent that we constructed. The claim below shows that is a right condition to obtain .

###### Claim 1.

Let . Suppose

 k≥2βdβ−1+3β(β−1)2log(1/ϵ′). (22)

Then

 exp(−(β−1)k−βd)23βk)≤ϵ′. (23)
###### Proof.

(23) is equivalent to

 (β−1)k−βd≥√3βlog(1/ϵ′)k

which, by solving the quadratic equation in , is further equivalent to

 √k≥√3βlog(1/ϵ′)+√3βlog