Consider a collection of experiments parameterized by -dimensional vectors , and let denote the matrix with rows . The outcome of the
th experiment is a random variable, where is the parameter vector of a linear model with prior distribution , and is independent noise. In experimental design, we have access to the vectors , for , but we are allowed to observe only a small number of outcomes for experiments we choose. Suppose that we observe the outcomes from a subset of experiments. The posterior distribution of given (the vector of outcomes in ) is:
where denotes the matrix with rows for . In the Bayesian framework of experimental design [CV95], we assume that the prior precision matrix of the linear model is known, and our goal is to choose so as to minimize some quantity (a.k.a. an optimality criterion) measuring the “size” of the posterior covariance matrix . This quantity is a function of the subset covariance . Note that if matrix
is non-invertible then, even though the prior distribution is ill-defined, we can still interpret it as having no prior information in the directions with eigenvalue 0. In particular, forwe recover classical experimental design, where the covariance matrix of given is . We will write the Bayesian optimality criteria as functions , where corresponds to the subset covariance . The following standard criteria [CV95, Puk06] are of primary interest to us:
C-optimality: for some vector ;
Other popular criteria (less relevant to our discussion) include E-optimality, (here, denotes the spectral norm) and G-optimality, .
The general task we consider is given as follows, where denotes :
Bayesian experimental design. Given an matrix , a criterion and ,
Optimal value. Given , and , we denote the optimum as .
The prior work around this problem can be grouped into two research questions. The first question asks what can we infer about just from the spectral information about the problem, which is contained in the data covariance matrix . The second question asks when does there exist a polynomial time algorithm for finding a )-approximation for .
Question 1: Given only , and , what is the upper bound on ?
Question 2: Given , and , can we efficiently find a -approximation for ?
A key aspect of both of these questions is how large the subset size has to be for us to provide useful answers. As a baseline, we should expect meaningful results when is at least (see discussion in [AZLSW17]), and in fact, for classical experimental design (i.e., when ), the problem becomes ill-defined when . In the Bayesian setting we should be able to exploit the additional prior knowledge to achieve strong results even for . Intuitively, the larger the prior precision matrix
, the fewer degrees of freedom we have in the problem. To measure this, we use the statistical notion ofeffective dimension [AM15].
For psd matrices and , let the -effective dimension of be defined as . We will use the shorthand when referring to .
Recently, [DW18b] obtained bounds on Bayesian A/V-optimality criteria for , suggesting that is the right notion of degrees of freedom for this problem. We argue that
can in fact be far too large of an estimate because it does not take into account the sizewhen computing the effective dimension. Intuitively, since is computed using the full data covariance , it is not in the appropriate scale with respect to the smaller covariance . One way to correct this is to increase the regularization on from to and use as the degrees of freedom. Another way is to rescale the full covariance to and use as the degrees of freedom. In fact, since , these two approaches are identical. Note that and this gap can be very large for some problems (see discussion in Appendix B). The following result supports the above reasoning by showing that for any such that , there is of size which satisfies . This not only improves on [DW18b] in terms of the supported range of sizes , but also in terms of the obtained bound (see Section 2 for a comparison).
Let be A/C/D/V-optimality and be . For any such that ,
We give an time algorithm for finding subset that certifies this bound.
To establish Theorem 1, we propose a new sampling distribution , where is a vector of weights. This is a special regularized variant of a determinantal point process (DPP), which is a well-studied family of distributions [KT12] with numerous applications in sampling diverse subsets of elements. Given a psd matrix and a weight vector , we define as a distribution over subsets (of all sizes) such that:
A number of regularized DPPs have been proposed recently [Der19, DW18b], mostly within the context of Randomized Numerical Linear Algebra (RandNLA) [Mah11, DM16, DM17]. To our knowledge, ours is the first such definition that strictly falls under the traditional definition of a DPP [KT12]. We show this in Section 3, where we also prove that regularized DPPs can be decomposed into a low-rank DPP plus i.i.d. Bernoulli sampling (Theorem 5). This decomposition reduces the sampling cost from to , and involves a more general result about DPPs defined via a correlation kernel (Lemma 9), which is of independent interest.
To prove Theorem 1, in Section 4 we demonstrate a fundamental connection between an -regularized DPP and Bayesian experimental design with precision matrix . For simplicity of exposition, let the weight vector be uniformly equal . If and is any one of the A/C/D/V-optimality criteria, then:
Theorem 1 follows by showing an inequality similar to (1a) when is restricted to subsets of size at most (proof in Section 4). When , then bears a lot of similarity to proportional volume sampling which is an (unregularized) determinantal distribution proposed by [NSTT19]. That work used an inequality similar to (1a) for obtaining -approximate algorithms in A/D-optimal classical experimental design (Question 2 with ). However, the algorithm of [NSTT19] for proportional volume sampling takes time, making it practically infeasible. On the other hand, the time complexity of sampling from is only , and recent advances in RandNLA for DPP sampling [DWH18, DWH19, Der19] suggest that time is also possible. Extending the ideas of [NSTT19] to our new regularized DPP distribution, we obtain efficient -approximation algorithms for A/C/D/V-optimal Bayesian experimental design.
Let be A/C/D/V-optimality and be . If for some , then there is a polynomial time algorithm that finds of size such that
The algorithm referred to in Theorem 3 first solves a convex relaxation of the task via a semi-definite program (SDP) to find the weights , then samples from the distribution times. The expected cost in addition to the SDP is .
Note that unlike in Theorem 3 we use the unrescaled effective dimension instead of the rescaled one, . The actual effective dimension that applies here (given in the proof in Section 4) depends on the SDP solution. It is always upper bounded by , but it may be significantly smaller. This result is a direct extension of [NSTT19] to Bayesian setting and to C/V-optimality criteria. Moreover, in their case, proportional volume sampling is usually the computational bottleneck (because its time dependence on the dimension can reach ), whereas for us the cost of sampling is negligible compared to the SDP. A number of different methods can be used to solve the SDP relaxation (see Section 5). For example, [AZLSW17] suggest using an iterative optimizer called entropic mirror descent, which is known to exhibit fast convergence and can run in time, where is the number of iterations.
2 Related work
We first discuss the prior works that focus on bounding the experimental design optimality criteria without obtaining -approximation algorithms. First non-trivial bounds for the classical A-optimality criterion (with ) were shown by [AB13]. Their result implies that for any , and they provide polynomial time algorithms for finding such solutions. The result was later extended by [DW17, DW18b, DW18a] to the case where , obtaining that for any , we have , and also a faster time algorithm was provided. Their result can be easily extended to cover any psd matrix and V/C-optimality (but not D-optimality). The key improvements of our Theorem 1 are that we cover a potentially much wider range of subset sizes, because , and our bound can be much tighter because . Finally, [DCMW19] propose a new notion of minimax experimental design, which is related to A/V-optimality. They also use a determinantal distribution for subset selection, however, due to different assumptions, their bounds are incomparable.
A number of works proposed -approximation algorithms for experimental design which start with solving a convex relaxation of the problem, and then use some rounding strategy to obtain a discrete solution (see Table 1 for comparison). For example, [WYS17] gave an approximation algorithm for classical A/V-optimality with , where the rounding is done in a greedy fashion, and some randomized rounding strategies are also discussed. [NSTT19] suggested proportional volume sampling for the rounding step and obtained approximation algorithms for classical A/D-optimality with . Their approach is particularly similar to ours (when ). However, as discussed earlier, while their algorithms are polynomial, they are virtually intractable. [AZLSW17] proposed an efficient algorithm with a -approximation guarantee for a wide range of optimality criteria, including A/C/D/E/V/G-optimality, both classical and Bayesian, when . Our results improve on this work in two ways: (1) in terms of the dependence on for A/C/D/V-optimality, and (2) in terms of the dependence on the dimension (by replacing with ) in the Bayesian setting. A lower bound shown by [NSTT19] implies that our Theorem 3 cannot be directly extended to E-optimality, but a similar lower bound does not exist for G-optimality. We remark that the approximation approaches relying on a convex relaxation can generally be converted to an upper bound on akin to our Theorem 1, however none of them apply to the regime of , which is of primary interest in the Bayesian setting.
Purely greedy approximation algorithms have been shown to provide guarantees in a number of special cases for experimental design. One example is classical D-optimality criterion, which can be converted to a submodular function [BGS10]. Also, greedy algorithms for Bayesian A/V-optimality criteria have been considered [BBKT17, CR17b]. These methods can only provide a constant factor approximation guarantee (as opposed to
), and the factor is generally problem dependent (which means it could be arbitrarily large). Finally, a number of heuristics with good empirical performance have been proposed, such as Fedorov’s exchange method[CN80]. However, in this work we focus on methods that provide theoretical approximation guarantees.
3 A new regularized determinantal point process
In this section we introduce the determinantal sampling distribution we use for obtaining guarantees in Bayesian experimental design. Determinantal point processes (DPP) form a family of distributions which are used to model repulsion between elements in a random set, with many applications in machine learning[KT12]. Here, we focus on the setting where we are sampling out of all subsets . Traditionally, a DPP is defined by a correlation kernel, which is an psd matrix with eigenvalues between 0 and 1, i.e., such that . Given a correlation kernel , the corresponding DPP is defined as
where is the submatrix of with rows and columns indexed by . Another way of defining a DPP, popular in the machine learning community, is via an ensemble kernel . Any psd matrix is an ensemble kernel of a DPP defined as:
Crucially, every is also a , but not the other way around. Specifically, we have:
but (b) requires that be invertible which is not true for some DPPs. (This will be important in our analysis.) The classical algorithm for sampling from a DPP requires the eigendecomposition of either matrix or , which in general costs , followed by a sampling procedure which costs [HKP06, KT12].
We now define our regularized DPP and describe its connection with correlation and ensemble DPPs.
Given matrix , a sequence and a psd matrix such that is full rank, let be a distribution over :
The fact that this is a proper distribution (i.e., that it sums to one) can be restated as a determinantal expectation formula: if are independent Bernoulli random variables, then
where was shown by [DM19, Lemma 7] and . Our main result in this section is the following efficient algorithm for which reduces it to sampling from a correlation DPP.
For any , and a psd matrix s.t. is full rank, let
If are independent random variables, then .
Since the correlation kernel matrix has rank at most , the preprocessing cost of eigendecomposition is . Then, each sample costs only .
We prove the theorem in three steps. First, we express as an ensemble DPP, which requires some additional assumptions on and to be possible. Then, we convert the ensemble to a correlation kernel (eliminating the extra assumptions), and finally show that this kernel can be decomposed into a rank kernel plus Bernoulli sampling.
Given , and as in Theorem 5, assume that and are invertible. Then,
Proof Let . By Definition 2 and the fact that ,
which matches the definition of the L-ensemble DPP.
At this point, to sample from , we could simply invoke any algorithm for sampling from an ensemble DPP. However, this would only work for invertible , which in particular excludes the important case of corresponding to classical experimental design. Moreover, the standard algorithm would require computing the eigendecomposition of the ensemble kernel, which (at least if done naïvely) costs . Even after this is done, the sampling cost would still be which can be considerably more than . We first address the issue of invertibility of matrix by expressing our distribution via a correlation DPP.
Given , , and as in Theorem 5 (without any additional assumptions), we have
When and are invertible, then the proof (given in Appendix A) is a straightforward calculation. Then, we use a limit argument with and , where .
Finally, we show that the correlation DPP arrived at in Lemma 8 can be decomposed into a smaller DPP plus Bernoulli sampling. In fact, in the following lemma we obtain a more general recipe for combining DPPs with Bernoulli sampling, which may be of independent interest. Note that if are independent random variables then .
Let and be psd matrices with eigenvalues between 0 and 1, and assume that is diagonal. If and , then
4 Guarantees for Bayesian experimental design
In this section we prove our main results regarding Bayesian experimental design (Theorems 1 and 3). First, we establish certain properties of the regularized DPP distribution that make it effective in this setting. Even though the size of the sampled subset is random and can be as large as , it is also highly concentrated around its expectation, which can be bounded in terms of the -effective dimension. This is crucial, since both of our main results require a subset of deterministically bounded size. Recall that the effective dimension is defined as a function . The omitted proofs are in Appendix A.
Given any , and a psd matrix s.t. is full rank, let be defined as in Theorem 5. Then
Next, we show two expectation inequalities for the matrix inverse and matrix determinant, which hold for the regularized DPP. We use them to bound the Bayesian optimality criteria in expectation.
Whenever is a well-defined distribution it holds that
Let be A/C/D/V-optimality. Whenever is well-defined,
Proof In the case of A-, C-, and V-optimality, the function
is a linear transformation of the matrixso the bound follows from (3). For D-optimality, we apply (4) as follows:
Let be A/C/D/V-optimality and be . For some , let and assume that . If , then a subset of size can be found in time that satisfies
Proof Let be defined so that , and suppose that . Then, using Theorem 11, we have
Using Lemma 10 we can bound the expected size of as follows:
Let and . If , then . Since is a determinantal point process, is a Poisson binomial r.v. so for ,
For any , we have
Denoting as , Markov’s inequality implies that
Also, we showed that . Setting for sufficiently large
we obtain that with probability, the random set has size at most and
We can sample from conditioned on
and bounded as above by rejection
sampling. When , the set is completed to
with arbitrary indices. On average, samples from
are needed, so the cost is for the
sampling and for recomputing .
To prove the main results, we use Lemma 13 with appropriately chosen weights .
The solution satisfies . If we use in Lemma 13, then observing that , and setting for sufficiently large , the algorithm in the lemma finds subset such that