On greedy heuristics for computing D-efficient saturated subsets

05/18/2019 ∙ by Radoslav Harman, et al. ∙ Comenius University in Bratislava 0

Let F be a set consisting of n real vectors of dimension m ≤ n. For any saturated, i.e., m-element, subset S of F, let vol(S) be the volume of the parallelotope formed by the vectors of S. A set S^* is called a D-optimal saturated subset of F, if it maximizes vol(S) among all saturated subsets of F. In this paper, we propose two greedy heuristics for the construction of saturated subsets performing well with respect to the criterion of D-optimality: an improvement of the method suggested by Galil and Kiefer for the initiation of D-optimal experimental design algorithms, and a modification of the Kumar-Yildirim method, the original version of which was proposed for the initiation of the minimum-volume enclosing ellipsoid algorithms. We provide geometric and analytic insights into the two methods, and compare them to the commonly used random and regularized greedy heuristics. We also suggest variants of the greedy methods for a large set F, for the construction of D-efficient non-saturated subsets, and for alternative optimality criteria.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a set consisting of vectors which we will call “regressors”, and let . Our aim is to select an -element subset of , , focusing on the “saturated” subsets, that is, . The saturated subsets are natural initial solutions of various algorithms applied in statistics and computational geometry; note that in some of the applications the set is the result of a random process, in others it is a deterministic set obtained by an algebraic construction. Typically .

To measure the quality of the subsets of , saturated or non-saturated, we consider an “information-based” -optimality criterion, which is widely used in the theory of optimal experimental design (e.g., [5], [20], [22], and [2]).

Define the information matrix of as . If , set . Then, the -criterion is

(1)

The -optimal -element subset is any subset of that maximizes in the class of all -element subsets of .

To make the problem meaningful, we assume that is “non-singular” in the sense that is a non-singular matrix, i.e., the matrix with rows , , has full column rank. This assumption is clearly equivalent to the existence of a saturated non-singular subset of .

Note that for the case of saturated subsets, we can equivalently define -optimality as follows: A saturated subset is -optimal if and only if it maximizes . Geometrically, a -optimal saturated subset corresponds to the maximum-volume parallelotope formed by an -tuple of vectors from . For a general , a -optimal -element subset of maximizes the mean squared volume of the parallelotopes formed by the vectors of . This is a direct consequence of the Cauchy-Binet formula.

In the theory of optimal design of experiments, a -optimal -element subset is called a replication-free -optimal exact experimental design of size (e.g., [23], cf. [6])111For the saturated case, the problem of -optimal exact design with replication and the problem of -optimal exact designs without replications are equivalent.. Computing such sets, or designs, is generally a difficult problem of discrete optimization; for a review of optimal design algorithms, see, e.g., Chapter 12 of [2], [9] and [18].

Let

be the set of all probability measures on

. Clearly, if we find some

(2)

then we obtain the following upper bound on :

(3)

The probability measure from (2) is called the

-optimal approximate design for the linear regression model with regressors

. The -optimal approximate designs and, consequently, the bounds , can be computed via a number of specialized algorithms of convex optimization (e.g., [34], [33], [11], cf. Chapter 9 of [21]).

The -efficiency of a set relative to a non-singular set is defined as . If is a -optimal -element subset, then the relative efficiency of an -element subset with respect to will be denoted by . Note that .

The algorithms for -optimal subsets can also be used for various purposes other than constructing maximum-volume (configurations of) parallelepippeds or designs of statistical experiments. For instance, we may be interested in computing the ellipsoid 222MVEE stands for “Minimum-Volume Enclosing Ellipsoid”. which has the minimum volume among all ellipsoids covering the set and centred in , or the ellipsoid with the minimum volume among completely all ellipsoids covering . It turns out that the computation the MVEEs is equivalent to the computation of a -optimal approximate design for the statistical model with regressors in the case of , and for the statistical model with regressors in the case of . For details, see [26], [28], [29], [1], [30] or the introduction of [11].

Another application of -efficient subsets is in the information-based data subselection. Consider a large sample consisting of the points (e.g., the vectors of covariates) and let , , be the corresponding regressors, where is an appropriate “regression function”. Then, one can select an informative subsample of size by choosing regressors (and the corresponding points ) that make up a -optimal or -efficient subset (e.g., [4] and [32]).

To compute an information-based subsample, [32] provides a fast specialized deterministic algorithm, primarily for sizes , . Note, however, that the algorithm, although fast and potentially very useful for big data, produces subsets that are not saturated, and generally not particularly -efficient, in some cases even singular. The authors of [4] also construct a subsample motivated by criteria used in experimental design, by solving two optimization problems: the first one on a possibly continuous superset of , and then second one on .

As shown above, there is a large number of algorithms for constructing -optimal or -efficient subsets of a general size. However, these algorithms often require a small and non-singular initiation subset, preferably as efficient as possible. In particular, the saturated subsets can be used to initiate algorithms for both optimal approximate designs and optimal exact designs. In many such algorithms, such as [34], [33], and [2], the the initial solution suggested by the authors is either random or based on a regularized greedy heuristic (cf. Subsections 2.1 and 2.2), which may not be the ideal method. It was the main motivation of this paper to analyse and compare alternative methods to generate such initial subsets.

2 Greedy heuristics for saturated subsets

A general form of the greedy procedure for the construction of saturated subsets is the following:

Scheme of greedy heuristics for saturated subsets
1. Set and
2. Select from (S)
3. Set
5. If set and return to 2
6. Output

The main difficulty for the greedy construction of saturated subsets is that during the process of “saturation” we work with subsets with fewer than elements, and such subsets are necessarily singular. That is, and the quality of such subsets cannot be directly compared via . In particular, the general greedy heuristic, which in each step adds the regressor that provides the highest increase in the objective function (e.g., [19]) cannot be used. In [24], the author circumvented this deficiency by considering optimality criteria that are not constantly zero for singular subsets, but we will not use this approach.

The methods described next solve the saturated subset problem in different ways, leading to various specifications of the critical step 2 which we call a saturation step and denote it by (S).

The heuristic outlined above is sometimes called a “forward” method. However, there also exist “drop” or “backward” methods (e.g., [23] and [2], Section 12.4), where we begin with the full set and delete the least informative regressors one by one. These, however, turn out to be slow, because their asymptotic complexity is and in applications is usually a large number.

2.1 Random subsampling

The simplest method for obtaining a saturated subset is to select regressors uniformly at random. Formally, in step (S) we select a random from . This method is extremely fast, but the random saturated subsets tend to have low efficiency. It is even possible that such a saturated is singular. The case of a singular random subset is not only possible in theory, but also in applications, as shown in Example 1.

Example 1.

Consider the classical factorial linear regression model with factors, each with levels , with main factors and no constant term. In this situation the set of regressors is . Suppose that we intend to construct a saturated experimental design at random. Then, the probability of obtaining a singular subset (i.e., a singular design) can be undesirably high. For example, for , and this probability is , and .

To obtain more efficient subsets, methods for nonuniform random subsampling were proposed. In particular, the leverage-based random subsampling (e.g., [17] and [16]) selects from with probability that is proportional to its so-called leverage score .333Leverages based on subsets of , called sensitivities in optimal design of experiments (e.g., [5]) naturally occur in most -optimum design algorithms because they have an important analytic and statistical interpretation. Note that a non-uniform subsampling can be much more computationally demanding than the uniform sampling. Moreover, the leveraging methods are also not guaranteed to produce a non-singular subset, especially if one seeks a saturated subset: Example 2 demonstrates this.

Example 2.

In Example 1, the information matrix is proportional to and for all . Therefore, all regressors have the same leverage score which means that the leveraging method is equivalent to the uniform random subset selection, including the probabilities of generating a singular subset as listed in Example 1.

A non-zero probability of generating a singular subset for a given is not necessarily a damaging feature444Unless, of course, the probability of generating a singular subset is or very close to .; we can simply repeatedly use the method until we hit a non-singular subset. See Subsection 3.3 for further discussion on this “multi-start” approach.

2.2 The regularized greedy heuristic (RGH)

Except for the simple or the weighted random choice of a saturated set , the most common is the procedure based on a regularization of the information matrix of a sub-saturated set (see, e.g., Section 11.2 of [2]; cf. [31]). The regularized greedy heuristic (RGH) specifies step (S) as

(4)

Here, is a small positive constant; in the context of experimental design, [2] suggest to take between and . The matrix determinant lemma directly implies that (4) is equivalent to

(5)

A short R-implementation of the RGH based on an efficient computation of (5) is given in Appendix. The asymptotic complexity of this algorithm is .

The numerical experience shows that the RGH usually finds a reasonably -efficient saturated subset . However, a little known fact is that, similarly to the random methods, the RGH is not guaranteed to provide a non-singular saturated subset, even if one exists:

Example 3.

Consider the RGH with for , , and . Then, RGH first chooses , then (or ) and then (or ). The resulting subset is singular, although a non-singular subset of size does exist.

That is, we cannot fully rely on the non-singularity of the design produced by RGH. Moreover, unlike the random methods, the RGH is deterministic (cf. Subsection 3.3), therefore it has no meaning to run it multiple times in search for an improvement. On the other hand, RGH can be also applied to criteria other than -optimality in a straightforward manner. We will show in the following sections that to construct a -efficient saturated subset, it may be preferable to use modifications of the original Galil-Kiefer and the Kumar-Yildirim methods.

2.3 The Galil-Kiefer method (GKM)

In [7] (see also [2], Section 12.4) Galil and Kiefer proposed to specify step (S) as

(6)

where

denote the eigenvalues of a non-negative definite

matrix . That is, instead of maximizing the product of all eigenvalues , which is zero, we maximize the product of the largest eigenvalues of , which can be non-zero. Let be an -element set. Using the well known fact (e.g. 6.54 of [25]) that the largest eigenvalues of are equal to the eigenvalues of , we see that (6) is equivalent to

(7)

which was also noted in [7]. Formula (7) replaces the computation of eigenvalues by the simpler computation of determinants.

Although the approach of Galil and Kiefer appears to be reasonable, it seems that it has not been much analysed in the literature and used in applications. This is probably caused by the fact that the determinant form (7) is still computationally expensive. The paper [7] does not provide a more efficient solution for the update of ; the authors simply state that the standard formulas can be “implemented with appropriate changes”.

In the sequel, we show a more efficient and geometrically interpretable form of the optimization step. The basic insight is that the determinant form of the GKM has a simple geometric interpretation - we choose to maximize the -dimensional volume of the parallelotope spanned in by the vectors of , . Alternatively, we can formulate it as follows.

Theorem 1.

The GKM always produces a non-singular saturated subset of if such a subset exists. Moreover, the optimization (6) (or (7)) is equivalent to

(8)

where is the projector on the column space of .

Proof.

If then is not of full rank and its determinant is zero; here denotes the column space. It follows that the GKM always chooses and by induction, we obtain that the rank of is for . Therefore, can be expressed as , where is an matrix of rank . The positive eigenvalues of are equal to the positive eigenvalues of . That is,

which is equal to by the block determinant formula (e.g. [12], Theorem 13.3.8). Moreover, does not depend on and . ∎

As a side result, we obtained that the GKM is guaranteed to produce a non-singular subset.

Clearly,

(9)

therefore Theorem 1 implies that the algebraic approach based on the positive eigenvalues of the information matrix is geometrically a successive projection method: it projects the regressors to the orthogonal complement of and then finds the projected regressor of the largest norm. In an intuitive sense, such a regressor provides the most information not already captured by .

Another observation is that the GKM is, roughly stated, a limit of the RGH as tends to zero. First, we need a technical lemma.

Lemma 1.

Let , , . The set on the right hand side of (4) is equal to

(10)
Proof.

Let . Observe that . The non-zero eigenvalues of coincide with the non-zero eigenvalues of . It follows that is equal to , which can be expressed as

The term can be ignored in the maximization, and using the formula for determinant of a block matrix, (10) is equivalent to . Since the first term of the expression does not depend on , and the additive term does not affect the maximization, we obtain the desired result. ∎

Theorem 2.

The GKM is the limit of the RGH in the sense that for any finite set , an matrix , an matrix , and the orthogonal projector on there is some such that

(11)

for all positive .

Proof.

Let and let be such that

For any matrix , the pseudoinverse satisfies (e.g. [12], Theorem 20.7.1) and , therefore

for any . Since is finite, there is some such that

for all and all . Let and let be such that . Then , therefore . It follows that , i.e., by the previous lemma. ∎

If the regressors of are the result of iid sampling from a continuous -dimensional distribution, then the set on the RHS of (11) is a singleton with probability , i.e., both sides of (11) coincide. However, for a symmetric , the right hand side can be a large set, and the solutions of the RGH can form a strict subset of those of GKM for any ; see Section 4, specifically panels (a), (b), (d) in Figure 2.

The GKM is not only guaranteed to produce a non-singular saturated subset, it is guaranteed to produce a subset with efficiency at least , as we show next. First, a technical lemma.

Lemma 2.

Let . Then, , where and for can be defined by any of the three equivalent formulas:

(12)
(13)
(14)

In the expressions above is the projection matrix to and is the projection matrix to .

Proof.

From formula (13), it is clear that the transformation from to is in fact a Gram-Schmidt orthogonalization, where the obtained orthogonal vectors are not normalized. That is, for some upper triangular matrix with ones on the diagonal. It follows that , which yields the desired result .

Since are orthogonal, we have for any , and hence (14) is obtained. Formula (12) holds because . ∎

Theorem 3.

Let be the saturated subset of produced by the GKM. Then .

Proof.

Let and define , where the ’s are given by Lemma 2. Then . Let us also work with “normalized” information matrix , where is the size of . Without loss of generality, let us rotate the coordinate axes of , so that , which is possible because are orthogonal. Then, . Moreover, the set that consists of all points of the form satisfies and for any because all the elements of lie in . In particular, , where is the -optimal subset of size . It follows that . Therefore,

As a side result, we obtain an explicit formula for the -optimality value of : , where the projected regressors are constructed during the GKM process.

A naïve implementation of the GKM using (8) is relatively time consuming as it requires the calculations of projection matrices and the projections of the regressors. However, (8) depends only on the norms of the projected regressors and formula (14) shows that the projected regressors can be calculated successively by applying projections to the one-dimensional subspaces orthogonal to . Therefore, the algorithm can “forget” the original regressors, which allows for the following efficient implementation of the GKM:

Efficient GKM
1. Set
2.
3. Set
5. If , update , set and return to 2
6. Output ind

The efficient formulation of the GKM has asymptotic complexity . Its implementation in the programming language of R is in Appendix.

The GKM is non-random which is, in some situations, a drawback.555For instance, if we use the saturated subsets for initiation of deterministic heuristics, a randomized initiation method is crucial for the approach of multiple restarts. Galil and Kiefer were aware of this problem and they proposed a randomisation based on first selecting some points at random and only then starting to apply the GKM. However, such an approach may lead to singular -point subsets, if the first random points were chosen “unluckily”.

Nevertheless, the GKM can be also randomised as follows: in step (S), instead of selecting regressor that maximizes , we may select at random, with probabilities of the particular regressors proportional to , where ; the higher is, the less random is the procedure.

Because the randomised GKM never chooses regressors that lie in , it also arrives at a non-singular subset if such a subset exists by the same argument as in the non-random case.

2.4 The Kumar-Yildirim method (KYM)

Originally, the Kumar-Yildirim method was proposed in [15] to find initial points for computing the , which need not be centred at the origin. In each iteration, the method chooses points maximizing and minimizing the scalar product for a randomly chosen . Thus, the output is a set of points.

But the -optimal approximate design problem is equivalent to computing the centred at the origin and containing . For such a problem, one can apply the original Kumar-Yildirim method with formally added points . Then, the method selects in each step both and for some , which is redundant. It follows that it is enough to choose one of the two points.

Hence, the modified KYM can be stated as a special case of the general greedy heuristic where step (S) is

and is a random direction from .

It is not difficult to see that the output of the KYM is a non-singular saturated subset. However, we can formulate a stronger statement as follows.

Theorem 4.

Let be the saturated subset of produced by the KYM. Then

Proof.

The authors of [15] obtain the efficiency bound for the output of their algorithm by applying the results of the paper [3]. There, it is shown that , where each is equal to the difference as in the KYM, but for a general convex body . By considering the convex body 666 is called the Elfving set in the optimal experimental design literature, we obtain for that

where appears in the inequality because in each iteration of the algorithm in [3], one chooses for the maximizing .

Let . In [13], John showed that if is centred at the origin; hence in our settings. Combining the previous results, we obtain that

From the results mentioned, e.g., in [15] it follows that

where is the -optimal saturated subset and is the volume of the unit ball. Then,

and

For a large , the Stirling approximation of yields

Note that by a direct application of the efficiency limit provided by Lemma 3.1 of [15], one would obtain a weaker bound than that in Theorem 4. The bound in Theorem 4 is tighter because Kumar and Yildirim deal with general sets, and the centrally symmetric sets considered in this paper allow for stronger results.

An efficient implementation of the KYM in the programming language of R is given in Appendix. This implementation has asymptotic complexity , i.e., the same as the GKM implementation, although non-asymptotically the KYM tends to be faster than GKM (unless ).

Interestingly, the main idea of the KYM is similar to the projective interpretation of the GKM: both procedures seek to find a “good” new regressor in the space not yet spanned by the chosen regressors. However, the respective approaches for measuring what is a good regressor differ.

Note that both GKM and KYM are invariant under rotation, and, unlike the RGH, they are also invariant with respect to a centred dilation , , i.e., invariant under the common change of the units of measurement, if the regressors correspond to explanatory variables.

3 Variants and extensions

3.1 Subsets of sizes other than

The greedy heuristics for saturated subsets can be easily modified to generate efficient non-saturated subsets, i.e., subsets of sizes other than .

The problem of obtaining efficient supersaturated experimental designs (e.g., [8]) is very similar to problem of constructing -efficient subsets with . In fact, the main idea of the Galil-Kiefer initiation for -optimality, i.e. that can be maximized instead of , is the same as the approach to -optimal supersaturated designs in [14]. Note, however, that the motivation of Jones and Majumdar in [14] for working with

arises from a minimum bias estimator of the model parameters, and that they provide analytical results on optimal supersaturated designs in main-effects model, not design algorithms.

Based on a greedy heuristic for saturated subsets, we can formulate a general scheme for a greedy heuristic for arbitrary size subsets . Simply stated, if we can pre-maturely stop the greedy heuristics and if , we can repeatedly perform the heuristic for a subset of size (or less than in the last run) and after each run of the method remove the obtained regressors from . We can generalize this idea such that at each selection step, we select not a single regressor but a batch of regressors, which can be computationally advantageous. More precisely:

Scheme of greedy heuristics for subsets of size
1. Set and
2. Set
3. Select an -element subset of (S)
4. Set
5. If set and return to 2.
6. Output .

Of course here we will also need to keep and properly update some parameters which regulate step (S). Special case , provides the scheme of the heuristics for saturated subsets, but note that the IBOSS mentioned in the introduction and the original KYM are also special cases of this general scheme. The multitude of various specifications of the above scheme and the comparison of the resulting algorithms is left for further research.

3.2 A pre-selection strategy for a big

Let be a uniformly randomly selected -element subset of , where . Let us estimate the probability that is non-singular. Let be uniformly randomly chosen from with replacement. For let be the event that is a singular matrix. Clearly, are independent, therefore

where is the probability of each of . That is, the probability that a -element uniformly selected random subset of is singular vanishes at least exponentially with respect to .

At the same time, the probability is itself not too close to for most of sets used in applications. Even for such a symmetric set as from Example 1 it is possible to show that ; cf. [27].

This motivates the following strategy to obtain an efficient saturated subset of :

  1. Randomly pre-select of size from .

  2. Use either GKM or KYM to construct a saturated subset of .

Note that the size of does not depend on , i.e., it can be small also for an enormous set . Even if is only moderately large, the probability that the resulting saturated subset is singular can be negligibly small. For instance, if and , the probability of a singular resulting subset is less than .

We will show in Section 4 that the pre-selection strategy is a good compromise between the speed and efficiency.

3.3 Multi-run approach

Let H be a randomized heuristic. Since H can generate subsets of various efficiencies, a natural idea is to run H multiple times and then select the best subset obtained. Note that in this respect the randomness inherent in the result can be a positive, not a negative feature of an algorithm.

Denote ,

, the iid random variables representing the efficiencies of the subsets obtained by independent runs of H. Let

be such that . Then it can be shown that we need roughly runs of H for the median of to be . In this sense, two heuristics are similarly good, if they have similar ratios , where is the time of a single run of H plus the time needed to compute the quality of the resulting subset.

The performance of the multi-run approach is therefore given by the speed of H as well as the tails of the distribution of , and both quantities may be difficult to assess. However, numerical simulations for the randomized heuristics that we consider in this paper suggest that the variability of tends to be rather small, therefore is mostly determined by the median of . Hence, in the assessment of the performance of a heuristic, the median efficiency of the subsets resulting from a single run is the more important factor than the speed. See Section 4 for numerical results.

3.4 Eigenvalue-based criteria

We say that a criterion is eigenvalue-based if it depends only on the eigenvalues of . For instance, -optimality can be expressed as . Besides -optimality, the second most important eigenvalue-based criterion is -optimality defined as for a non-singular and for a singular (e.g., [22], Chapter 6).

Clearly, the basic idea of the Galil-Kiefer method can be used for any eigenvalue-based criterion – we can simply specify (S) as selecting that maximizes an appropriate function of the non-zero eigenvalues of . In some cases, like -optimality, such maximization can be performed much more easily than via numerical evaluation of the eigenvalues. For an example, we can utilize the following theorem for -optimality:

Theorem 5.

Let be of rank . Then,

(15)
Proof.

The positive eigenvalues of are equal to the positive eigenvalues of with the notation as in the proof of Theorem 1. Hence

which is equal to , where , based on the formula for the inverse of a block matrix ([12], Theorem 8.5.11). Observing that yields that the studied maximization problem is equivalent to the minimization of

(16)

Because does not depend on , the minimization of (16) is equivalent to the maximization of (5). ∎

That is, the GKM for -optimality is very similar to the GKM for -optimality, but -optimality requires an additional “correction term” to be added to (1).

It is unclear how can the KYM be tailored to -optimality and other eigenvalue-based criteria. However, the efficiency of the -optimal approximate design with respect to a very general class of eigenvalue-based criteria is bounded from below by ([10]) which suggests that the KYM (as well as the GKM for -optimality) will produce reasonably efficient subsets even in the case of other criteria than -optimality. This is also supported by our numerical experience. We leave the development of the heuristics for general eigenvalue-based criteria to further research.

4 Numerical comparisons

To provide a brief but comprehensive comparison of the performances of the greedy methods we generate the sets at random, as well as deterministically:

  1. We first generate a covariance matrix from the Wishart distribution and then independently from . We set , .

  2. We set .

The efficiencies of the saturated subsets are calculated as , where is computed via the REX algorithm suggested in [11]. We used a computer with a 64-bit Windows 10 system running an Intel Core i7-5500U CPU processor at GHz with GB of RAM. For the RGH, we used the regularization parameter .

The results are exhibited in Figures 1 and 2. The numerical observations can be summarized as follows:

  • Both RGH and GKM tend to provide highly efficient subsets. For a random the results of GKM and RGH usually coincide and for , RGH sometimes gives slightly more efficient subsets than GKM. On the other hand, GKM tends to be as fast as or faster than RGH.

  • The KYM is somewhat faster but less efficient than GKM.

  • As one might expect, the random initiation (RND) is the fastest and the least efficient method. The leverage-weighted random sampling (RNDl), performs badly in all analysed cases. For both RND and RNDl occasionally fail to the extent that they provide singular subsets.

  • GKM and KYM applied to a random subsample of size (cf. Subsection 3.2), denoted by GKMf and KYMf in the figures, are both very fast and efficient. In some cases, the speed of GKMf and KYMf is only slightly smaller than the speed of the random sub-sampling methods, yet, with a few exceptions, the efficiency is the same or almost the same as that of GKM and KYM applied to the full set . Moreover, the computation times and the resulting efficiencies are relatively stable under changes of and .

To compare the degree of the efficiency improvement using the multi-run approach described in Subsection 3.3, we computed the “time profiles” of the increase in efficiency of the best subset found by a repeated independent application of the individual heuristics. Evidently, for all procedures the efficiency increases very slowly (notice that the horizontal axis denotes the logarithm of the time), i.e., for the performance of the multi-run method the efficiency of a single-run is crucial. Naturally, the numerical differences between the studied methods depend on a multitude of factors, such as the actual implementation used and the set itself. Nevertheless, the exhibited figures are representative of the general behaviour of the heuristics that we observed.

Figure 1: The running times (in decadic logarithm) of the greedy saturated subsample heuristics and the -efficiencies of the resulting subsets. Regressors generated from a lifted

-dimesional normal distribution (

) as detailed in the text.
Figure 2: The running times (in decadic logarithm) of the greedy saturated subsample heuristics and the -efficiencies of the resulting subsets. Regressors are deterministically set to .
Figure 3: The dependence of the best efficiency on the computation time achieved by the multistart approach. Regressors generated from a lifted -dimesional normal distribution () as detailed in the text.
Figure 4: The dependence of the best efficiency on the computation time achieved by the multistart approach. Regressors are deterministically set to , . Note that for this model the RGH and the GKM produce the perfectly optimal saturated subset such that is a Hadamard matrix.

5 Discussion

We showed that a good approach to the construction of -efficient saturated subsets is to use the efficient version of the Galil-Kiefer method or the modified Kumar-Yildirim method. Nevertheless, the regularized greedy method and the random sampling can also be reasonable choices for some applications (the regularized greedy method if we are more concerned about the efficiency than the speed and the random sampling in the opposite situation). These heuristics can be directly generalized for the construction of non-saturated subsets. For a large size of the initial set, the pre-sampling strategy can be recommended. To minimize the risk of a singular subset, we can use the multi-run approach. Finally, the regularized greedy method and the closely related Galil-Kiefer method can be modified for the use with a different criterion than -optimality.

Appendix - R Scripts

Regularized greedy heuristic

function(Fx, del=1e-4) {
# Fx is the nxm matrix of regressors
  m <- ncol(Fx); one <- rep(1, m)
  M <- del * diag(m)
  j <- which.max((Fx^2) %*% one)
  ind <- rep(0, m); ind[1] <- j
  for(i in 2:m) {
   M <- M + tcrossprod(Fx[j, ])
   j <- which.max(((Fx %*%
   t(chol(solve(M))))^2) %*% one)
   ind[i] <- j
  }
  return(ind)
}

Galil-Kiefer method

function(Fx) {
# Fx is the nxm matrix of regressors
  m <- ncol(Fx)
  v2 <- (Fx^2) %*% rep(1, m)
  j <- which.max(v2)
  ind <- rep(0, m); ind[1] <- j
  for(i in 2:m) {
    scp <- Fx %*% Fx[j, ]
    Fx <- Fx - scp %*% t(Fx[j, ]) / v2[j]
    v2 <- v2 - scp^2 / v2[j]
    j <- which.max(v2); ind[i] <- j
  }
  return(ind)
}

Kumar-Yildirim method

function(Fx) {
# Fx is the nxm matrix of regressors
  m <- ncol(Fx); P <- diag(m)
  j <- which.max(abs(Fx %*% rnorm(m)))
  ind <- rep(0, m); ind[1] <- j
  for(i in 2:m) {
    fx <- P %*% Fx[j,]
    P <- P - tcrossprod(fx) / sum(fx^2)
    j <- which.max(abs(Fx %*%
                      (P %*% rnorm(m))))
    ind[i] <- j
  }
  return(ind)
}

References

  • Ahipasaoglu [2015] S. D. Ahipasaoglu. Fast algorithms for the minimum volume estimator. Journal of Global Optimization, 62:351–370, 2015.
  • Atkinson et al. [2007] A. C. Atkinson, A. Donev, and R. Tobias. Optimum experimental designs, with SAS. Oxford University Press, New York, 2007.
  • Betke and Henk [1993] U. Betke and M. Henk. Approximating the volume of convex bodies. Discrete Computational Geometry, 10:15–21, 1993.
  • Drovandi et al. [2017] C. C. Drovandi, C. C. Holmes, J. M. McGree, K. Mengersen, S. Richardson, and E. G. Ryan. Principles of experimental design for big data analysis. Statistical Science, 32:85–404, 2017.
  • Fedorov [1972] V. V. Fedorov. Theory of optimal experiments. Academic Press, New York, 1972.
  • Fedorov [1989] V. V. Fedorov. Optimal design with bounded density: optimization algorithms of the exchange type. Journal of Statistical Planning and Inference, 22:1–13, 1989.
  • Galil and Kiefer [1980] Z. Galil and J. Kiefer. Time- and space-saving computer methods, related to Mitchell’s DETMAX, for finding D-optimum designs. Technometrics, 22:301–313, 1980.
  • Gilmour [2006] S. G. Gilmour. Factor screening via supersaturated designs. In A. Dean and S. Lewis, editors, Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics, pages 169–190, New York, 2006. Springer.
  • Goos and Jones [2011] P. Goos and B. Jones. Optimal design of experiments: a case study approach. Wiley, Chichester, 2011.
  • Harman [2004] R. Harman. Lower bounds on efficiency ratios based on -optimal designs. In A. Di Bucchianico, H. Läuter, and H. P. Wynn, editors, Proceedings of the 7th International Workshop on Model-Oriented Design and Analysis, pages 77–84. Heeze, 2004.
  • Harman et al. [2019] R. Harman, L. Filová, and P. Richtárik. A randomized exchange algorithm for computing optimal approximate designs of experiments. Journal of the American Statistical Association, online first, 2019.
  • Harville [1997] D. A. Harville. Matrix Algebra From A Statistician’s Perspective. Springer-Verlag, New York, 1997.
  • John [1948] F. John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays in Honor of R. Courant, pages 187–204. Wiley, New York, 1948.
  • Jones and Majumdar [2014] B. Jones and D. Majumdar. Optimal supersaturated designs. Journal of the American Statistical Association, 109:1592–1600, 2014.
  • Kumar and Yildirim [2005] P. Kumar and E. A. Yildirim. Minimum volume enclosing ellipsoids and core sets. Journal of Optimization Theory and Applications, 126:1–21, 2005.
  • Ma and Sun [2015] P. Ma and X. Sun. Leveraging for big data regression. WIREs Computational Statistics, 7:70–76, 2015.
  • Ma et al. [2015] P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging.

    Journal of Machine Learning Research

    , 16:861–911, 2015.
  • Mandal et al. [2015] A. Mandal, W. K. Wong, and Y. Yu. Algorithmic searches for optimal designs. In A. Dean, M. Morris, J. Stufken, and Bingham D., editors, Handbook of Design and Analysis of Experiments. Chapman & Hall/CRC, Boca Raton, 2015.
  • Nemhauser et al. [1978] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14:265–294, 1978.
  • Pázman [1986] A. Pázman. Foundation of Optimum Experimental Design. Reidel Publ., Dordrecht, 1986.
  • Pronzato and Pázman [2013] L. Pronzato and A. Pázman. Design of Experiments in Nonlinear Models. Springer, New York, 2013.
  • Pukelsheim [1993] F. Pukelsheim. Optimal design of experiments. Wiley, New York, 1993.
  • Rasch et al. [1997] D. A. M. K. Rasch, E. M. T. Hendrix, and E. P. J. Boer.

    Replication-free optimal designs in regression analysis.

    Computational Statistics, 12:19–52, 1997.
  • Sagnol [2013] G. Sagnol. Approximation of a maximum-submodular-coverage problem involving spectral functions, with application to experimental designs. Discrete Applied Mathematics, 161:258–276, 2013.
  • Seber [2008] G. A. Seber. A Matrix Handbook for Statisticians. John Wiley & Sons, New Jersey, 2008.
  • Silvey and Titterington [1973] S. D. Silvey and D. H. Titterington. A geometric approach to optimum design theory. Biometrika, 60:21–32, 1973.
  • Tao and Vu [2007] T. Tao and V. Vu. On the singularity probability of random Bernoulli matrices. Journal of the American Mathematical Society, 20:603–628, 2007.
  • Titterington [1975] D. M. Titterington. Some geometrical aspects of D-optimality. Biometrika, 62:313–320, 1975.
  • Titterington [1978] D. M. Titterington. Estimation of correlation coefficients by ellipsoidal trimming. Journal of the Royal Statistical Society: Series C, 27:227–234, 1978.
  • Todd [2016] M. J. Todd. Minimum-Volume Ellipsoids: Theory and Algorithms. SIAM, Philadelphia, 2016.
  • Vuchkov [1977] I. N. Vuchkov. A ridge-type procedure for design of experiments. Biometrika, 64:147–150, 1977.
  • Wang et al. [2018] H. Wang, M. Yang, and J. Stufken. Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association, online first, 2018.
  • Yang et al. [2013] M. Yang, S. Biedermann, and Tang E. On optimal designs for nonlinear models: a general and efficient algorithm. Journal of the American Statistical Association, 108:1411–1420, 2013.
  • Yu [2011] Y. Yu. D-optimal designs via a cocktail algorithm. Statistics and Computing, 21:475–481, 2011.