1 Introduction
Consider a collection of experiments parameterized by dimensional vectors , and let denote the matrix with rows . The outcome of the
th experiment is a random variable
, where is the parameter vector of a linear model with prior distribution , and is independent noise. In experimental design, we have access to the vectors , for , but we are allowed to observe only a small number of outcomes for experiments we choose. Suppose that we observe the outcomes from a subset of experiments. The posterior distribution of given (the vector of outcomes in ) is:where denotes the matrix with rows for . In the Bayesian framework of experimental design [CV95], we assume that the prior precision matrix of the linear model is known, and our goal is to choose so as to minimize some quantity (a.k.a. an optimality criterion) measuring the “size” of the posterior covariance matrix . This quantity is a function of the subset covariance . Note that if matrix
is noninvertible then, even though the prior distribution is illdefined, we can still interpret it as having no prior information in the directions with eigenvalue 0. In particular, for
we recover classical experimental design, where the covariance matrix of given is . We will write the Bayesian optimality criteria as functions , where corresponds to the subset covariance . The following standard criteria [CV95, Puk06] are of primary interest to us:
Aoptimality: ;

Coptimality: for some vector ;

Doptimality: ;

Voptimality: .
Other popular criteria (less relevant to our discussion) include Eoptimality, (here, denotes the spectral norm) and Goptimality, .
The general task we consider is given as follows, where denotes :
Bayesian experimental design. Given an matrix , a criterion and ,
Optimal value. Given , and , we denote the optimum as .
The prior work around this problem can be grouped into two research questions. The first question asks what can we infer about just from the spectral information about the problem, which is contained in the data covariance matrix . The second question asks when does there exist a polynomial time algorithm for finding a )approximation for .
Question 1: Given only , and , what is the upper bound on ?
Question 2: Given , and , can we efficiently find a approximation for ?
A key aspect of both of these questions is how large the subset size has to be for us to provide useful answers. As a baseline, we should expect meaningful results when is at least (see discussion in [AZLSW17]), and in fact, for classical experimental design (i.e., when ), the problem becomes illdefined when . In the Bayesian setting we should be able to exploit the additional prior knowledge to achieve strong results even for . Intuitively, the larger the prior precision matrix
, the fewer degrees of freedom we have in the problem. To measure this, we use the statistical notion of
effective dimension [AM15].Definition 1
For psd matrices and , let the effective dimension of be defined as . We will use the shorthand when referring to .
Recently, [DW18b] obtained bounds on Bayesian A/Voptimality criteria for , suggesting that is the right notion of degrees of freedom for this problem. We argue that
can in fact be far too large of an estimate because it does not take into account the size
when computing the effective dimension. Intuitively, since is computed using the full data covariance , it is not in the appropriate scale with respect to the smaller covariance . One way to correct this is to increase the regularization on from to and use as the degrees of freedom. Another way is to rescale the full covariance to and use as the degrees of freedom. In fact, since , these two approaches are identical. Note that and this gap can be very large for some problems (see discussion in Appendix B). The following result supports the above reasoning by showing that for any such that , there is of size which satisfies . This not only improves on [DW18b] in terms of the supported range of sizes , but also in terms of the obtained bound (see Section 2 for a comparison).Theorem 1
Let be A/C/D/Voptimality and be . For any such that ,
Remark 2
We give an time algorithm for finding subset that certifies this bound.
To establish Theorem 1, we propose a new sampling distribution , where is a vector of weights. This is a special regularized variant of a determinantal point process (DPP), which is a wellstudied family of distributions [KT12] with numerous applications in sampling diverse subsets of elements. Given a psd matrix and a weight vector , we define as a distribution over subsets (of all sizes) such that:
A number of regularized DPPs have been proposed recently [Der19, DW18b], mostly within the context of Randomized Numerical Linear Algebra (RandNLA) [Mah11, DM16, DM17]. To our knowledge, ours is the first such definition that strictly falls under the traditional definition of a DPP [KT12]. We show this in Section 3, where we also prove that regularized DPPs can be decomposed into a lowrank DPP plus i.i.d. Bernoulli sampling (Theorem 5). This decomposition reduces the sampling cost from to , and involves a more general result about DPPs defined via a correlation kernel (Lemma 9), which is of independent interest.
To prove Theorem 1, in Section 4 we demonstrate a fundamental connection between an regularized DPP and Bayesian experimental design with precision matrix . For simplicity of exposition, let the weight vector be uniformly equal . If and is any one of the A/C/D/Voptimality criteria, then:
(1) 
Theorem 1 follows by showing an inequality similar to (1a) when is restricted to subsets of size at most (proof in Section 4). When , then bears a lot of similarity to proportional volume sampling which is an (unregularized) determinantal distribution proposed by [NSTT19]. That work used an inequality similar to (1a) for obtaining approximate algorithms in A/Doptimal classical experimental design (Question 2 with ). However, the algorithm of [NSTT19] for proportional volume sampling takes time, making it practically infeasible. On the other hand, the time complexity of sampling from is only , and recent advances in RandNLA for DPP sampling [DWH18, DWH19, Der19] suggest that time is also possible. Extending the ideas of [NSTT19] to our new regularized DPP distribution, we obtain efficient approximation algorithms for A/C/D/Voptimal Bayesian experimental design.
Theorem 3
Let be A/C/D/Voptimality and be . If for some , then there is a polynomial time algorithm that finds of size such that
Remark 4
The algorithm referred to in Theorem 3 first solves a convex relaxation of the task via a semidefinite program (SDP) to find the weights , then samples from the distribution times. The expected cost in addition to the SDP is .
Note that unlike in Theorem 3 we use the unrescaled effective dimension instead of the rescaled one, . The actual effective dimension that applies here (given in the proof in Section 4) depends on the SDP solution. It is always upper bounded by , but it may be significantly smaller. This result is a direct extension of [NSTT19] to Bayesian setting and to C/Voptimality criteria. Moreover, in their case, proportional volume sampling is usually the computational bottleneck (because its time dependence on the dimension can reach ), whereas for us the cost of sampling is negligible compared to the SDP. A number of different methods can be used to solve the SDP relaxation (see Section 5). For example, [AZLSW17] suggest using an iterative optimizer called entropic mirror descent, which is known to exhibit fast convergence and can run in time, where is the number of iterations.
2 Related work
We first discuss the prior works that focus on bounding the experimental design optimality criteria without obtaining approximation algorithms. First nontrivial bounds for the classical Aoptimality criterion (with ) were shown by [AB13]. Their result implies that for any , and they provide polynomial time algorithms for finding such solutions. The result was later extended by [DW17, DW18b, DW18a] to the case where , obtaining that for any , we have , and also a faster time algorithm was provided. Their result can be easily extended to cover any psd matrix and V/Coptimality (but not Doptimality). The key improvements of our Theorem 1 are that we cover a potentially much wider range of subset sizes, because , and our bound can be much tighter because . Finally, [DCMW19] propose a new notion of minimax experimental design, which is related to A/Voptimality. They also use a determinantal distribution for subset selection, however, due to different assumptions, their bounds are incomparable.
A number of works proposed approximation algorithms for experimental design which start with solving a convex relaxation of the problem, and then use some rounding strategy to obtain a discrete solution (see Table 1 for comparison). For example, [WYS17] gave an approximation algorithm for classical A/Voptimality with , where the rounding is done in a greedy fashion, and some randomized rounding strategies are also discussed. [NSTT19] suggested proportional volume sampling for the rounding step and obtained approximation algorithms for classical A/Doptimality with . Their approach is particularly similar to ours (when ). However, as discussed earlier, while their algorithms are polynomial, they are virtually intractable. [AZLSW17] proposed an efficient algorithm with a approximation guarantee for a wide range of optimality criteria, including A/C/D/E/V/Goptimality, both classical and Bayesian, when . Our results improve on this work in two ways: (1) in terms of the dependence on for A/C/D/Voptimality, and (2) in terms of the dependence on the dimension (by replacing with ) in the Bayesian setting. A lower bound shown by [NSTT19] implies that our Theorem 3 cannot be directly extended to Eoptimality, but a similar lower bound does not exist for Goptimality. We remark that the approximation approaches relying on a convex relaxation can generally be converted to an upper bound on akin to our Theorem 1, however none of them apply to the regime of , which is of primary interest in the Bayesian setting.
Purely greedy approximation algorithms have been shown to provide guarantees in a number of special cases for experimental design. One example is classical Doptimality criterion, which can be converted to a submodular function [BGS10]. Also, greedy algorithms for Bayesian A/Voptimality criteria have been considered [BBKT17, CR17b]. These methods can only provide a constant factor approximation guarantee (as opposed to
), and the factor is generally problem dependent (which means it could be arbitrarily large). Finally, a number of heuristics with good empirical performance have been proposed, such as Fedorov’s exchange method
[CN80]. However, in this work we focus on methods that provide theoretical approximation guarantees.3 A new regularized determinantal point process
In this section we introduce the determinantal sampling distribution we use for obtaining guarantees in Bayesian experimental design. Determinantal point processes (DPP) form a family of distributions which are used to model repulsion between elements in a random set, with many applications in machine learning
[KT12]. Here, we focus on the setting where we are sampling out of all subsets . Traditionally, a DPP is defined by a correlation kernel, which is an psd matrix with eigenvalues between 0 and 1, i.e., such that . Given a correlation kernel , the corresponding DPP is defined aswhere is the submatrix of with rows and columns indexed by . Another way of defining a DPP, popular in the machine learning community, is via an ensemble kernel . Any psd matrix is an ensemble kernel of a DPP defined as:
Crucially, every is also a , but not the other way around. Specifically, we have:
but (b) requires that be invertible which is not true for some DPPs. (This will be important in our analysis.) The classical algorithm for sampling from a DPP requires the eigendecomposition of either matrix or , which in general costs , followed by a sampling procedure which costs [HKP06, KT12].
We now define our regularized DPP and describe its connection with correlation and ensemble DPPs.
Definition 2
Given matrix , a sequence and a psd matrix such that is full rank, let be a distribution over :
(2) 
The fact that this is a proper distribution (i.e., that it sums to one) can be restated as a determinantal expectation formula: if are independent Bernoulli random variables, then
where was shown by [DM19, Lemma 7] and . Our main result in this section is the following efficient algorithm for which reduces it to sampling from a correlation DPP.
Theorem 5
For any , and a psd matrix s.t. is full rank, let
If are independent random variables, then .
Remark 6
Since the correlation kernel matrix has rank at most , the preprocessing cost of eigendecomposition is . Then, each sample costs only .
We prove the theorem in three steps. First, we express as an ensemble DPP, which requires some additional assumptions on and to be possible. Then, we convert the ensemble to a correlation kernel (eliminating the extra assumptions), and finally show that this kernel can be decomposed into a rank kernel plus Bernoulli sampling. Input: Compute Compute SVD of Sample [HKP06] Sample for return
Lemma 7
Given , and as in Theorem 5, assume that and are invertible. Then,
Proof Let . By Definition 2 and the fact that ,
which matches the definition of the Lensemble DPP.
At this point, to sample from , we could simply
invoke any algorithm for sampling from
an ensemble DPP. However, this would only work for invertible
, which in particular excludes the important case of
corresponding to classical experimental
design. Moreover, the standard algorithm would require computing the
eigendecomposition of the ensemble kernel, which (at
least if done naïvely) costs . Even after this is done, the
sampling cost would still be which can be considerably
more than . We first address the issue of invertibility of matrix
by expressing our distribution via a correlation DPP.
Lemma 8
Given , , and as in Theorem 5 (without any additional assumptions), we have
When and are invertible, then the proof (given in Appendix A) is a straightforward calculation. Then, we use a limit argument with and , where .
Finally, we show that the correlation DPP arrived at in Lemma 8 can be decomposed into a smaller DPP plus Bernoulli sampling. In fact, in the following lemma we obtain a more general recipe for combining DPPs with Bernoulli sampling, which may be of independent interest. Note that if are independent random variables then .
Lemma 9
Let and be psd matrices with eigenvalues between 0 and 1, and assume that is diagonal. If and , then
4 Guarantees for Bayesian experimental design
In this section we prove our main results regarding Bayesian experimental design (Theorems 1 and 3). First, we establish certain properties of the regularized DPP distribution that make it effective in this setting. Even though the size of the sampled subset is random and can be as large as , it is also highly concentrated around its expectation, which can be bounded in terms of the effective dimension. This is crucial, since both of our main results require a subset of deterministically bounded size. Recall that the effective dimension is defined as a function . The omitted proofs are in Appendix A.
Lemma 10
Given any , and a psd matrix s.t. is full rank, let be defined as in Theorem 5. Then
Next, we show two expectation inequalities for the matrix inverse and matrix determinant, which hold for the regularized DPP. We use them to bound the Bayesian optimality criteria in expectation.
Lemma 11
Whenever is a welldefined distribution it holds that
(3)  
(4) 
Corollary 12
Let be A/C/D/Voptimality. Whenever is welldefined,
Proof In the case of A, C, and Voptimality, the function
is a linear transformation of the matrix
so the bound follows from (3). For Doptimality, we apply (4) as follows:which completes the proof.
Finally, we present the key lemma that puts everything together. This
result is essentially a generalization of Theorem 1 from
which also follows Theorem 3.
Lemma 13
Let be A/C/D/Voptimality and be . For some , let and assume that . If , then a subset of size can be found in time that satisfies
Proof Let be defined so that , and suppose that . Then, using Theorem 11, we have
Using Lemma 10 we can bound the expected size of as follows:
Let and . If , then . Since is a determinantal point process, is a Poisson binomial r.v. so for ,
For any , we have
Denoting as , Markov’s inequality implies that
Also, we showed that . Setting for sufficiently large
we obtain that with probability
, the random set has size at most andWe can sample from conditioned on
and bounded as above by rejection
sampling. When , the set is completed to
with arbitrary indices. On average, samples from
are needed, so the cost is for the
eigendecomposition, for
sampling and for recomputing .
To prove the main results, we use Lemma 13 with
appropriately chosen weights .
Proof of Theorem 1 Let in Lemma 13. Then, we have and also . Since for any set of size , we have , the result follows.
Proof of Theorem 3 As discussed in [AZLSW17, BV04], the following convex relaxation of experimental design can be written as a semidefinite program and solved using standard SDP solvers:
(5) 
The solution satisfies . If we use in Lemma 13, then observing that , and setting for sufficiently large , the algorithm in the lemma finds subset such that
Comments
There are no comments yet.