A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments

01/17/2018
by   Radoslav Harman, et al.
0

We propose a class of subspace ascent methods for computing optimal approximate designs that covers both existing as well as new and more efficient algorithms. Within this class of methods, we construct a simple, randomized exchange algorithm (REX). Numerical comparisons suggest that the performance of REX is comparable or superior to the performance of state-of-the-art methods across a broad range of problem structures and sizes. We focus on the most commonly used criterion of D-optimality that also has applications beyond experimental design, such as the construction of the minimum volume ellipsoid containing a given set of data-points. For D-optimality, we prove that the proposed algorithm converges to the optimum. We also provide formulas for the optimal exchange of weights in the case of the criterion of A-optimality. These formulas enable one to use REX for computing A-optimal and I-optimal designs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/25/2021

Efficient computational algorithms for approximate optimal designs

In this paper, we propose two simple yet efficient computational algorit...
05/18/2019

On greedy heuristics for computing D-efficient saturated subsets

Let F be a set consisting of n real vectors of dimension m ≤ n. For any ...
01/09/2022

Computing optimal experimental designs on finite sets by log-determinant gradient flow

Optimal experimental designs are probability measures with finite suppor...
12/22/2020

Refined bounds for randomized experimental design

Experimental design is an approach for selecting samples among a given s...
04/08/2018

Efficient Computational Algorithm for Optimal Continuous Experimental Designs

A simple yet efficient computational algorithm for computing the continu...
04/08/2021

Optimal Design of Multifactor Experiments via Grid Exploration

For computing efficient approximate designs of multifactor experiments, ...
08/03/2021

Bayesian I-optimal designs for choice experiments with mixtures

Discrete choice experiments are frequently used to quantify consumer pre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The topic of this paper is the computation of optimal approximate designs for regression models with uncorrelated errors (e.g., [11], [28], [31], [3]). As a special case, we also briefly discuss the minimum volume enclosing ellipsoid problem (e.g., [44]).

Suppose that we intend to perform an experiment consisting of a set of trials. Assume that the observed response of each trial depends on a design point chosen from a finite set . For simplicity but without the loss of generality, we will assume that . Note that in the experimental design problems, can be a large number, often many thousands.

For , the real-valued observation is assumed to satisfy the linear111It is also straightforward to apply the algorithms proposed in this paper to the computation of locally-optimal designs for non-linear

regression models. In this sense, we gain clarity and do not lose generality if we formulate our method for the linear regression (see

[3], chapter 17). regression model , where is the known regressor associated with

, the vector

contains the unknown parameters of the model, and , with

, is an unknown variance.

222We assume homoskedasticity and the normality of errors for the sake of simplicity. Our algorithms can be applied to all models with uncorrelated , , with finite variances, after a simple transformation of the problem ([3], chapter 23). For different trials, the errors are assumed to be uncorrelated. Let the model be non-singular in the sense that spans . We also avoid redundant regressors by assuming that for all . Note that in the applications, the number of the parameters tends to be relatively small, mostly less than .

We formalize an (approximate experimental) design on as an -dimensional vector with non-negative real components summing to one.333Often, an approximate experimental design is formalized as a probability measure on , but for a finite design space, the representations by a probability measure and by a vector of probabilities are equivalent. From the point of view of an experimenter, the component of represents the proportion of the trials to be performed in the design point . The support of a design is .444Note that there are theoretical reasons corroborated by extensive numerical evidence that the “optimal” designs tend to be sparse in the sense of having support much smaller than . More precisely, there always exists an optimal design supported on a set of a size that does not exceed . The set of all designs on denoted by

is the probability simplex in

, which is compact and convex.

The information matrix associated with a design is defined by

Under our model’s assumptions, the information matrix is proportional to the Fisher information matrix corresponding to . Therefore, the general aim is to choose such that is “as large as possible”, which we make precise next.

Let be the set of symmetric non-negative definite matrices, and let be a criterion of optimality, that is, a function measuring the “size” of information matrices. A design is said to be a -optimal design if it maximizes in the class of all designs:

(1)

Matrix is referred to as the -optimal information matrix.

Due to their natural statistical interpretations, the two most common optimality criteria are -optimality and -optimality. In this paper, we use them in the forms (e.g., [31]) and , respectively, for any positive definite matrix . For a singular non-negative definite matrix , the values of the criteria are defined to be . Functions and are positively homogeneous, continuous, and concave on the set of all non-negative definite matrices. Hence, for both - and -optimality, the optimal design always exists, and the optimal information matrix is non-singular.

Let denote the set of regular designs, i.e.,

For this paragraph, let be fixed. The variance function of is the -dimensional vector with components

Therefore, is a function that maps regular designs to . From the statistical point of view,

is proportional to the variance of the best linear unbiased estimator (BLUE) of

, provided that the approximate design can be actually carried out (hence the term). We also note that

(2)

where is a version of the -optimality criterion and is the singular design in (formally the -th standardized unit vector, i.e., the vertex of ). This explains why is often used in iterative algorithms for computing the -optimal designs to select “the most promising direction” of change of the current design.

For -optimality, because the directional derivative of in in the direction of is , we can define an -dimensional vector with components

as a quantity analogous to the variance function in the case of -optimality.

In the following, let be either or , depending on the criterion under consideration. Besides indicating a degree of importance of design points, another reason for utilizing the vector is that can be used to compute a natural stopping rule for algorithms based on the notion of statistical efficiency, as follows.

The - and -efficiencies of a design relative to are (see [31])

Because and are positively homogeneous, the notion of efficiency as defined above can be interpreted in terms of the relative numbers of trials needed to achieve a given criterion value. For instance, means that, asymptotically, to achieve the same amount of information measured by the criterion of -optimality, we only need of the trials using the design compared to number of trials needed if we use the design . Let . If is a -optimal design and is an -optimal design, then (see [31], Ch.6)

(3)

If is the current design and its efficiency compared to the optimal design is almost equal to , there is no practical reason to continue the computation.

1.1 Methods for computing optimal designs of experiments

In this paper, we are concerned with the numerical computation of (approximate) -optimal designs as given in (1), with special focus on - and -optimality. Within the field of optimal experimental design, the first contributions to solving this problem were made by V. V. Fedorov and H. Wynn (e.g., [11] and [50]), who developed the so-called vertex direction methods (VDMs) that are closely related to the Frank-Wolfe algorithm. In each iteration, VDMs move the current design in the direction of for some design point while decreasing all remaining components of by the same proportion. Here, can be chosen to maximize . Although these methods converge, some of them monotonically, they tend to be inconveniently slow. More efficient variants (e.g., [4]) also allow the decrease of a single component of the current design. A related alternative approach called the vertex exchange method for -optimality (VEM) was proposed by Bohning [6].

The VEM algorithm (see Algorithm 1 for a pseudo-code) is one of the simplest special cases of the class of algorithms presented in this paper. In each iteration, with a current design , VEM selects a pair of points defined by

We call such a pair a Bohning’s pair of design points. Then, the VEM computes such that in is maximized. Such an , which we call the Bohning’s step, can be analytically computed. After each exchange, the variance function is recomputed.

input : Regressors of the model, required efficiency , maximum time
output : Approximate design
1 Generate a regular -point design ;
2 while  do
3      Let belong to
4       Let belong to
5       Let belong to
6       ;
7 end while
Algorithm 1 A variant of the Bohning’s vertex exchange method (VEM). The value of is calculated as a lower bound on the efficiency of the current design based on a standard formula (see (3)). The variable measures the time elapsed from the start of the computation.

A substantially different strategy for solving the problem (1), involving simultaneously updating all components of , is implemented in the multiplicative algorithm (MUL). The MUL and its variants can be attributed to [40]; extensions can be found in [10] and [17].

The three methods of VDM, VEM and MUL were more recently combined by Yu ([48]) in the “cocktail” algorithm for -optimality. This approach often increases the speed while preserving convergence.

Another efficient recent method is presented in [46]. It is the simplicial decomposition algorithm where the master non-linear problem is solved by a second-order Newton method and can be used for various twice-differentiable concave criteria, including both - and -optimality. See also [45] for an earlier application of simplicial decomposition in the optimal design of experiments and further references.

The problem of searching for an optimal design can be also solved by specific mathematical programming methods. Namely, in [49], it is shown that the problem of -optimality can be cast as semidefinite programming, while -optimality can be reduced to maxdet programming. More recently, the work [38] enables one to solve both - and -optimal design problems (for the approximate design case without additional constraints555For an improvement that also enables solving exact design problems and approximate design problems with non-standard constraints using the second-order cone programming, see [39].) by the second-order cone programming approach.

The optimal design problem is closely related to the minimum volume enclosing ellipsoid (MVEE) problem. Considerable attention has been paid to developing fast algorithms for the MVEE problem in mathematical programming and optimization literature. For the latest developments, see [43], [2], and [44].

Consider a set spanning . Then, the task of finding MVEE , , containing can be cast as

with the dual problem

Thus, the problem of finding the -optimal design is equivalent to the MVEE problem (see, e.g., [44]), making the task of developing a fast and efficient algorithm for computing -optimal designs relevant for a wider optimization community.

With -optimality being the most popular criterion, the literature focusing on the computational issues of -optimality is scarcer, although in some areas, the use of -optimality is very natural. In particular, the computation of the important -optimal designs666These are occasionally called - or -optimal designs. (e.g., [8], [13]) can be converted into -optimality (cf. [3], Section 10.6). That is, -optimal designs on a finite design space can also be computed using the algorithm developed in this paper. In addition to the mathematical programming methods mentioned above, a generalization of the vertex direction method that focuses on -optimality, together with an analytical formula for the optimal VDM step-length, is presented in [1].

1.2 Summary of contributions

In Section 2, we introduce a general class of methods for finding optimal approximate designs that we call the subspace ascent method (SAM) and describe some known algorithms as special cases.

In Section 3, we present our key method: REX (randomized exchange method). REX is a simple batch-randomized exchange algorithm that is a member of the SAM family and whose performance is superior to the state-of-the-art algorithms in nearly all problems we tested. The proposed algorithm can be viewed as an efficient extension or combination of both the VEM algorithm and the KL exchange algorithm that is used to compute exact designs (see [3]

). In the same section, we compare our method to known subspace ascent/descent methods that were recently developed in the optimization and machine learning communities by highlighting the similarities and differences. The numerical comparison of the state-of-the art algorithms with this newly proposed method is the subject of Section

4.

In the Appendix, we provide a proof of convergence of the proposed algorithm in the case of -optimality. Additionally, in the existing literature, many of the mentioned methods have only been tailored to -optimality, and although the transition to -optimality is possible, it is not trivial. Therefore, in the Appendix, we also present formulas for the optimum exchange of weights for -optimality, which, to the best of our knowledge, were previously only derived for the -optimality criterion.

2 Subspace ascent method

In Section 2.1, we propose a generic algorithmic paradigm for solving (1). It helps illustrate several existing approaches and the newly proposed method from a broader perspective. We refer to this paradigm as the “Subspace Ascent Method” (SAM), formalized as Algorithm 2. Subsequently, in Section 3.2, we briefly comment on the context of the “subspace ascent” idea within existing optimization, linear algebra and machine learning literature.

2.1 Sam

SAM (Algorithm 2) is an iterative procedure for finding an approximate -optimal design. We start with an arbitrary regular design . In iteration , a subset of the design points is chosen via a certain rule (e.g., a small random subset of all design points). In this iteration, we only allow for weights for to change. All other weights are frozen at their last values. That is, we are deliberately reducing our search for to a subset of the set of all designs at the intersection of and a particular affine subspace of :

Note that for all and, therefore, . In particular, there exists such that the ascent condition is satisfied.

input : Initial regular design
output : Approximate design
1 Set
2 while  does not satisfy a stopping condition do
3      Select a subset of the set of design points
4       Define the active subspace of as
5       Compute as an approximate solution of satisfying the ascent condition
6       Set
7 end while
Algorithm 2 Subspace Ascent Method (SAM)

If , then is a singleton and the method cannot progress. As soon as , this problem is removed, and has a chance to contain points better than .

SAM is a generic method in the sense that it does not specify how the set is chosen or how the approximate solution is obtained. Formally, any optimum design algorithm for a discrete design space (including VDMs and the MUL) is a special case of SAM if we allow the choice . However, the intended philosophy of the SAM approach is to make the sets much smaller than . This makes the partial optimization problem small-dimensional and potentially simple to solve, at least approximately.

The algorithm VEM chooses to be a two point set, which allows the partial optimization problem to be analytically solved for -optimality (and, as we show in the Appendix, for -optimality). However, this method requires overly frequent updates of the set , which is a computationally non-trivial operation for large . Therefore, the speed of VEM significantly declines with the increase of the design space.

Similarly, SAM also includes the algorithm YBT ([46]), which chooses to be the support of the current design enriched by one additional support point . Then, a partial optimization problem is solved by a specific constrained Newton method. However, this approach is generally inefficient for larger values of because then the sets tend to be large and the efficiency of the Newton method deteriorates rapidly as the dimension of the problem increases (as demonstrated in Section 4).

3 Randomized exchange algorithm

In this section, we propose a particular case of SAM for which we coin the name randomized exchange algorithm (REX)

. Then, we briefly comment on the connection of this method to the existing literature on mathematical optimization and machine learning.

3.1 Rex

Let be a regular design and be an -dimensional vector with components , as defined in the introduction. Recall that the value is a -based estimate of the plausibility that is the support point of an optimal design.

The general idea behind the proposed batch-randomized algorithm is to repeatedly select a batch of several pairs of design points (the number of pairs is not fixed and may vary from iteration to iteration) and randomly perform optimal exchanges of weights between the pairs of design points. The selection of the batch depends on the vector evaluated in the best available design . Its size can be fine-tuned by a tuning parameter . We will call the resulting algorithm the randomized exchange algorithm (REX). We now proceed to a description of the REX algorithm (See Algorithm 3 for a concise pseudo-code).

  1. LBE step. Given the current design , compute and subsequently perform the “leading Bohning exchange” (LBE) as follows:

    (4)

    where

    An optimal step is called “nullifying” if it is equal to either or .

  2. Subspace selection. Subspace of in which the method will operate arises as the union of two sets. One is formed by a greedy process () informed by the elements of the vector , and the other is identical to the support of the current design ().

    1. Greedy. Set , and choose points

      where corresponds to the th largest component of the vector .

    2. Support. Set

      Let be the size of : .

    3. Active subspace. The active set of design points is defined as

      The weights of no other design points are modified in this iteration. Note that if or if , then . However, a standard situation is , and our method operates in a proper subspace of the full design space .

  3. Subspace step. We now perform a specific update of the design points in . That is, we allow for to be updated. Elements for are kept unchanged.

    1. Formation of pairs. Let be a uniform random permutation of , and let be a uniform random permutation of . Form the sequence

      (5)

      of pairs of active design points.

    2. Update. If the LBE step was nullifying, sequentially perform all nullifying -optimal exchanges between the pairs in (5) with the corresponding updates of and . If the LBE was not nullifying in the current iteration, sequentially perform all -optimal exchanges between the pairs in (5) with the corresponding updates of and .

  4. Stopping rule. If a stopping rule is satisfied, stop. Otherwise, iterate by returning to step 1.

We remark that for the criteria of - and -optimality, we have rapid explicit formulas providing the optimal exchanges in steps 1 and 3b of REX; see Appendix A. However, in principle, REX can also be applied to any other concave (even non-differentiable) optimality criterion using numerical procedures for the one-dimensional convex optimization on an interval, provided that we find a suitable analogy to the function .

REX makes explicit use of the fast optimal weight exchange between two design points, which makes it much less dependent on than YBT. Moreover, REX requires less frequent re-computations of the function , which leads to significant computational savings compared to VEM for large . Note that REX is very simple. In particular, it is simpler than the hybrid cocktail algorithm of Yu [48]. Moreover, unlike the method of [48], the result does not depend on the ordering of the design points. Finally, as we demonstrate, in most cases, it is also numerically more efficient than both of the state-of-the art methods proposed in [48] and [46].

input : Regressors , parameter , threshold efficiency , maximum time
output : Approximate design
1 Generate a random regular design
2 while  and  do
3      Perform the LBE in as given by (4).
4       Let be the vector corresponding to a random permutation of the elements of
5       Let be the vector corresponding to a random permutation of the indices of the greatest elements of
6       for  in  do
7            for  in  do
8                  Find in
9                   if the LBE was not nullifying or or  then
10                        
11                   end if
12                  
13             end for
14            
15       end for
16      
17 end while
Algorithm 3 Randomized exchange algorithm (REX). The value of is the lower bound on the current design efficiency compared to the optimal design. For - or -optimality, it is possible to use formulas (3). The variable determines the time from the beginning of the computation. The optimal step-length can be computed by explicit formulas given in the appendix (in the case of - or -optimality) or by a procedure for the one-dimensional convex optimization on a finite interval.

3.2 Connection to existing literature on subspace descent methods

Subspace ascent/descent methods of various types have recently been extensively studied in the optimization [26, 36, 12], linear algebra [22, 21, 14], machine learning [41, 32, 20] and image processing [7] literature. Of particular importance are the randomized and greedy variants.

Randomized subspace descent methods operate by taking steps in random subspaces of the domain of interest according to some distribution over subspaces (all of which typically have a relatively small and fixed dimension) that is fixed throughout the iterative process. Once a subspace has been identified, a gradient, Newton or a proximal point step is usually taken in the subspace. These methods have been found to scale to extremely large dimensions for certain applications [36, 37] (e.g., billions of unknowns), and have become the state of the art for a variety of big data analysis tasks. Randomized selection rules are, in practice, often better than cyclic rules that pass through the subspaces in a fixed predefined order. Intuitively, this happens because randomization helps avoid/break unfavorable orderings. It was recently shown by Sun and Ye [42] that there is a large theoretical gap between randomized and cyclic rules (proportional to the square of the dimension of the problem), which favors randomized rules.

Greedy subspace descent methods operate by taking steps in subspaces selected greedily according to some potential/importance function of interest. Again, once a subspace has been identified, typically a gradient, Newton or a proximal point step is taken in the subspace. The ideal greedy method employs the maximum improvement rule, which requires that at any given iteration, one should choose the subspace that leads to the maximum improvement in the objective. Since this is almost always impractical, one needs to resort to approximating the maximum improvement rule by an efficiently implementable proxy rule. Whether one should use a greedy or a randomized method depends on the problem structure, as both have applications in which they dominate.

Let us now highlight some of the key similarities and differences between REX and existing subspace descent methods:

  1. Support plays a role. The subspace selection rule in REX has a greedy component through the inclusion of the set . However, unlike most existing subspace descent methods, it also has an “active set / support” component through the inclusion of .

  2. Greedy subspaces. Greedy subspace rules beyond subspaces of dimension one have not been explored in the optimization and machine learning literature. To the best of our knowledge, the only exception is the work of Csiba and Richtárik [9]. However, both their greedy subspace rule and the problem that they tackle is different from ours.

  3. Subspace step. While existing subspace descent methods typically rely on traditional deterministic (as opposed to stochastic) optimization steps such as gradient or Newton steps, the REX performs a randomized pairwise exchange step in the subspace. We perform pairwise exchange steps as for the most important optimal design criteria. These can be calculated exactly in closed-form, which facilitates fast computations. This is similar to the sequential minimal optimization technique of Platt [29] that is influential in the machine learning literature. Methods with randomized steps performed in the subspace, such as REX, are rare in the optimization literature. One of the earliest examples of such an approach is the CoCoA framework of Jaggi et al. [19], aimed at distributed primal-dual optimization in machine learning. Both their subspace subroutine and the problem that they tackle are very different from ours.

  4. Random reshuffling. Randomization in REX is present in the form of a random permutation of pairs from . Methods relying on random permutations can be seen as a hybrid between stochastic and cyclic methods. These approaches are not theoretically understood due the intrinsic difficulty in capturing their behaviors using complexity analysis. The work of Gürbüzbalaban et al. [15] provides one of the first insights into this problem. However, their techniques do not apply to our problem or algorithm.

  5. Nonseparable constraints. Virtually all modern stochastic and greedy subspace descent methods in optimization apply to unconstrained problems777Problems with a convex constraint admitting a fast (i.e., closed form) projection are considered equally simple as unconstrained problems., or to problems with separable constraints structure, as this property is crucial in the analysis of these methods. One of the early works that explicitly tackles non-separable constraints is that of Necoara et al. [25]. However, their work applies to linear constraints only.

  6. Smoothness. Vast majority of papers on stochastic and greedy subspace descent methods assume that the objective function has a Lipschitz continuous gradient (this property is often called -smoothness in the machine learning literature). However, this is not true in our case. Hence, fundamentally different convergence analysis techniques are needed in our case. Only a handful of subspace descent methods do not rely on this assumption. One of them is the stochastic Chambolle-Pock algorithm [7], which is a stochastic extension of a famous state-of-the-art method in the area of computational imaging: the Chambolle-Pock method.

4 Numerical comparisons

In this section, we compare the performance of an R ([33]) implementation of REX and two state-of-the-art algorithms for computing the -optimal design of experiments. As a first candidate, we used the cocktail algorithm (CO), more precisely its R implementation, which was made available by the author of CO (see [48]).888There are two variants of CO. One has a pre-specified neighborhood structure, and the other has nearest neighbors computed on the fly. For each model, we selected the variant that performed better. As a second algorithm, we selected the Newton-type method [46] (YBT). For a fair comparison, the SAS code kindly provided by the authors of YBT was converted into R with as high fidelity as possible. The R codes of REX are available at

Note that it is possible to directly use the codes without any commercial software and without technical knowledge of the algorithms.

In the following two subsections, we compare the competing algorithms for two very different, generic -optimum design problems with widely varying sizes. This approach provides comprehensive information about the relative strengths of the methods in a broad range of situations. However, we remark that the numerical comparisons must be taken with a grain of salt because they depend on a multitude of factors. Different implementations and different hardware capabilities can lead to somewhat different relative results.

In all computations with REX, we chose the tuning constant , uniformly random -point initial designs. We used a computer with a 64-bit Windows 10 operating system running an Intel Core i7-5500U CPU processor at GHz with GB of RAM.

4.1 Quadratic models

Consider the full quadratic regression model on the -dimensional cube , . In other words, for the observations are modeled as

(6)

where correspond to the parameters , . We discretized the cube to obtain a lattice of points, which are numbered by indices from . Thus, the set of regressors corresponds to . These are representatives of structured models, such as those used in the response surface methodology (see, e.g., [24]). Each of the above mentioned algorithms was run for various combinations of values and (see Figure 1).

Figure 1: Typical performance of the algorithms CO (black dotted line), YBT (red dashed line) and REX (blue solid line) for -optimality in the case of quadratic models (6) on an -point discretization of . The vertical axis denotes the log-efficiency , where is the -efficiency of design. Thus, log-efficiencies , and correspond to -efficiencies , and , respectively. The horizontal axis corresponds to the computation time in seconds. Each algorithm was run times to illustrate the degree of variability.

The numerical results confirm the claim of [46] that YBT is superior to CO for a large size of the design space and a small number of model parameters (see panels (b) and (f) of Figure 1). However, for a smaller design space, particularly if the number of parameters is large, the algorithm CO tends to perform better than YBT (cf. panels (a), (c) and (e) of Figure 1). A similar observation can also be drawn for other models, such as the one presented in the next subsection.

Importantly, the performance of REX is comparable or superior to both state-of-the-art algorithms for all combinations of and that we explored.

4.2 Random models

As a second example, we chose a very different problem in which the regressors are drawn independently and randomly from the

-dimensional normal distribution

. These models represent unstructured and unordered optimal design situations in which the possible covariates of the model are drawn from a random population. The results are demonstrated in Figure 2.

Figure 2: Typical performance of the algorithm CO (black dotted line), YBT (red dashed line) and REX (blue solid line) for -optimality in the case of models with Gaussian random regressors of dimension . The vertical axis denotes the log-efficiency , where is the -efficiency. The horizontal axis corresponds to the computation time. Each algorithm was run times.

The random models again show the superior performance of the proposed algorithm. The differences in favor of REX are even more pronounced compared to the quadratic models. This is possibly caused by the fact that on the unstructured set of regressors, procedures such as CO that depend on the ordering of design points are disadvantaged. Note that the algorithm REX is independent of the ordering of the design points.

4.3 Further remarks on the numerical comparisons

-optimality. In addition to -optimality, we performed a comparison of CO, YBT and REX for several problems under the criterion of -optimality. Note that the algorithm CO was originally developed for -optimality only, but the formulas given in the Appendix (Subsection A.2) allowed us to modify it for the computation of -optimal designs. For -optimality, the relative efficiencies of the three algorithms were similar to the case of -optimality, although the advantage of REX is generally less pronounced.

Other algorithms for -optimality. For some models with a small number of design points and a very large number of parameters, the classical methods, particularly the multiplicative algorithm, outperform all three “state-of-the-art” algorithms (REX, YBT, CA), at least in an initial part of the computation. However, in a vast majority of test problems, the classical methods cannot compete with the three modern algorithms. A comprehensive comparison of all available methods is beyond the scope of this paper.

Disadvantages of REX. We believe it to be important that the reader is informed of the disadvantages of our method we are aware of. First, the method occasionally generates streaks of only slowly increasing criterion values that are, however, usually followed by a rapid improvement. Second, we were not able to prove the convergence to the optimum999Note that the convergence of the criterion values per-se follows trivially from the monotonicity of REX. However, the convergence to the value optimal for the problem at hand seems to be difficult to prove for a general concave criterion. except for the case of -optimality. Nevertheless, we also tested REX for other criteria, and the algorithm converged in all test problems.

5 Conclusions

As we have numerically demonstrated, the randomized exchange algorithm (REX) largely outperforms recent state-of-the-art methods for computing -optimal approximate designs of experiments, mostly in the cases where the model consists of randomly generated or otherwise unstructured regressors. Overall, in comparison to the competing methods, the performance of REX deteriorates much less when both the size of the design space and the number of parameters increase. It is also worth noting that REX is concise, has relatively low memory requirements, and does not utilize any advanced mathematical programming solvers. In addition, for -optimality, we theoretically proved the convergence of REX to the correct optimum. To adapt the proposed algorithm to the -optimality criterion, we derived appropriate vertex-exchange formulas.

REX offers many possibilities for further development. Besides the exploration of its convergence properties for a general concave criterion, there is a space for improvement in an optimized, perhaps adaptive selection of the parameter that regulates the batch size. One way to further increase the speed of the algorithm is to incorporate specific methods of initial design construction and the deletion rules of non-supporting points (see [16] for -optimality and [30] for -optimality).

Moreover, it is possible to employ REX for the computations of optimal designs on continuous (infinite) design spaces. One approach is to choose a large number of random points inside the continuous design space, use the speed of REX for unstructured design spaces to construct the optimal approximate design on the finite sub-sample, and fine-tune the design by using standard routines of constrained continuous optimization. However, a numerical and theoretical analysis of this methodology is beyond the scope of this paper.

Acknowledgments The work of the first two authors was supported by Grant Number 1/0521/16 from the Slovak Scientific Grant Agency (VEGA). The last author acknowledges support through the KAUST baseline research funding scheme.

References

  • [1] Ahipasaoglu SD (2015): “A first-order algorithm for the A-optimal experimental design problem: a mathematical programming approach.” Statistics and Computing, Vol 25.6, pp. 1113-1127.
  • [2] Ahipasaoglu SD (2015): “Fast algorithms for the minimum volume estimator.” Journal of Global Optimization, Vol 62.2, pp. 351-370.
  • [3] Atkinson AC, Donev AN, Tobias RD (2007): Optimum Experimental Designs, With SAS. Oxford University Press
  • [4] Atwood, CL (1973): “Sequences converging to D-optimal designs of experiments”, The Annals of Statistics 1, pp. 342-352
  • [5] Atwood CL (1976): “Convergent design sequences, for sufficiently regular optimality criteria.” The Annals of Statistics 4.6, pp. 1124-1138.
  • [6] Böhning D (1986): “A vertex-exchange-method in D-optimal design theory.” Metrika 33.1: pp. 337-347.
  • [7] Chambolle A, Ehrhardt MJ, Richtárik P, Schönlieb CB (2017): “Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications.” arXiv:1706.04957, pp. 1-25.
  • [8] Cook RD, Nachtsheim CJ (1982): “Model robust, linear-optimal designs.” Technometrics 24.1, pp. 49-54.
  • [9] Csiba D, Richtárik P (2017): “Global convergence of arbitrary-block gradient methods for generalized Polyak-Lojasiewicz functions.” arXiv preprint arXiv:1709.03014.
  • [10] Dette H, Pepelyshev A, Zhigljavsky A (2008): “Improving updating rules in multiplicative algorithms for computing D-optimal designs.” Computational Statistics & Data Analysis 53.2, pp. 312-320.
  • [11] Fedorov V (1972): “Theory of Optimal Experiments”, Academic Press, New York.
  • [12] Fercoq O, Richtárik P (2015): “Accelerated, Parallel and PROXimal coordinate descent.” SIAM Journal on Optimization, Vol 25.4, pp. 1997-2023.
  • [13] Goos P, Jones B, Syafitri U (2016): “I-Optimal Design of Mixture Experiments”, Journal of the American Statistical Association, Vol. 111, pp. 899-911.
  • [14] Gower RM, Richtárik P (2015): “Randomized iterative methods for linear systems.” SIAM Journal on Matrix Analysis and Applications, Vol 36.4, pp. 1660-1690.
  • [15] Gürbüzbalaban M, Ozdaglar A, Parrilo P (2015): “Why random reshuffling beats stochastic gradient descent”, arXiv preprint arXiv:1510.08560.
  • [16] Harman R, Pronzato L (2007): “Improvements on removing nonoptimal support points in D-optimum design algorithms”, Statistics & Probability Letters, Vol 77.1, pp. 90-94.
  • [17] Harman R (2014): “Multiplicative methods for computing D-optimal stratified designs of experiments”, Journal of Statistical Planning and Inference, Vol 146, pp. 82-94.
  • [18] Harman R, Filová L (2016): “Package ’OptimalDesign’”, https://cran.r-project.org/web/packages/OptimalDesign/index.html
  • [19] Jaggi M, Smith V, Takáč M, Terhorst J, Krishnan S, Hofmann T, Jordan MI (2014): “Communication-efficient distributed dual coordinate ascent”, Advances in Neural Information Processing Systems, pp. 3068-3076.
  • [20]

    Konečný J, Liu J, Richtárik P, Takáč M (2016): “Mini-batch semi-stochastic gradient descent in the proximal setting.” IEEE Journal of Selected Topics in Signal Processing, Vol 10.2, pp. 242-255.

  • [21] Lee YT, Sidford A (2010): “Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems.” IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS).
  • [22] Leventhal D, Lewis AS (2010): “Randomized methods for linear constraints: convergence rates and conditioning.” Mathematics of Operations Research, Vol 35, pp. 641-654.
  • [23] Mandal S, Torsney B (2006): “Construction of optimal designs using a clustering approach”, Journal of Statistical Planning and Inference, Vol 136.3, pp. 1120-1134.
  • [24] Myers RH, Montgomery DC, Anderson-Cook CM (2016): “Response surface methodology: process and product optimization using designed experiments”, John Wiley & Sons.
  • [25] Necoara I, Nesterov Yu, Glineur F (2011): “A random coordinate descent method on large optimization problems with linear constraints”, Manuscript.
  • [26] Nesterov Yu (2012): “Efficiency of coordinate descent methods on huge-scale optimization problems.” SIAM Journal on Optimization, Vol 22, pp. 341-362.
  • [27] Nutini, J, Schmidt M, Laradji IH, Friedlander M, Koepke H (2015): “Coordinate descent converges faster with the Gauss-Southwell rule than random selection”, Proceedings of the 32nd International Conference on Machine Learning.
  • [28] Pázman A (1986): “Foundations of Optimum Experimental Design”. Reidel.
  • [29]

    Platt J (1998): “Sequential minimal optimization: A fast algorithm for training support vector machines”. Manuscript.

  • [30] Pronzato L (2013): “A delimitation of the support of optimal designs for Kiefer’s -class of criteria.” Statistics & Probability Letters, Vol 83.12, pp. 2721-2728.
  • [31] Pukelsheim F (2006): “Optimal Design of Experiments (Classics in Applied Mathematics)”, Society for Industrial and Applied Mathematics.
  • [32] Qu Z, Richtárik P, Zhang T (2015): “Quartz: Randomized dual coordinate ascent with arbitrary sampling.” In Advances in Neural Information Processing Systems, Vol 28, pp. 865-873.
  • [33] R Core Team (2016): “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • [34] Richtárik P, Takáč M (2016): “On optimal probabilities in stochastic coordinate descent methods.” Optimization Letters, Vol 10.6, pp. 1233-1243.
  • [35] Richtárik P, Takáč M (2014): “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function .” Mathematical Programming Vol 156.1, pp. 1-38.
  • [36] Richtárik P, Takáč M (2016): “Parallel coordinate descent methods for big data optimization.” Mathematical Programming, Vol 144.2, pp. 433-484.
  • [37] Richtárik P, Takáč M (2016): “Distributed coordinate descent method for learning with big data.” Journal of Machine Learning Research, Vol 17.75, pp. 1-25.
  • [38] Sagnol G (2011): ”Computing optimal designs of multiresponse experiments reduces to second-order cone programming”, Journal of Statistical Planning and Inference, Vol 121, pp. 1684-1708.
  • [39] Sagnol G, Harman R (2015): “Computing exact D-optimal designs by mixed integer second order cone programming”, The Annals of Statistics, Vol 43.5, pp. 2198-2224.
  • [40] Silvey SD, Titterington DH, Torsney B (1978): “An algorithm for optimal designs on a design space.” Communications in Statistics-Theory and Methods, Volume 7.14, pp. 1379-1389.
  • [41] Shalev-Shwartz S, Zhang T (2013): “Stochastic dual coordinate ascent methods for regularized loss minimization.” Journal of Machine Learning Research, Vol 14, pp. 567-599.
  • [42] Sun R, Ye Y (2016): “Worst-case complexity of cyclic coordinate descent: gap with randomized version.” arXiv:1604.07130, pp. 1-39.
  • [43] Todd MJ, Yildirim EA (2007): “On Khachiyan’s algorithm for the computation of minimum-volume enclosing ellipsoids”, Discrete Applied Mathematics, Vol 155.13, pp. 1731-1744.
  • [44] Todd MJ (2016): “Minimum-Volume Ellipsoids: Theory and Algorithms”. Society for Industrial and Applied Mathematics.
  • [45] Ucinski D, Patan M (2007): “D-optimal design of a monitoring network for parameter estimation of distributed systems”, Journal of Global Optimization, Vol 39,, pp. 291-322
  • [46] Yang M, Biedermann S, Tang E (2013): “On optimal designs for nonlinear models: a general and efficient algorithm.” Journal of the American Statistical Association, Vol 108, pp. 1411-1420.
  • [47] Yu Y (2010): “Monotonic convergence of a general algorithm for computing optimal designs”, The Annals of Statistics, Vol 38.3, pp. 1593-1606.
  • [48] Yu Y (2011): “D-optimal designs via a cocktail algorithm.” Statistics and Computing, Vol 21.4, pp. 475-481.
  • [49] Vandenberghe L, Boyd S, Wu S (1998): “Determinant maximization with linear matrix inequality constraints”, SIAM Journal on Matrix Analysis, Vol 19, pp. 499-533.
  • [50] Wynn HP (1970): “The sequential generation of -optimum experimental designs.” The Annals of Mathematical Statistics, Vol 41.5, pp. 1655-1664.

Appendix A Optimal Step-length and Convergence

Let , , and let be a criterion of design optimality. The optimal step-length of weight exchange in the design between design points and is any

(7)

In accord with the definition (7), we provide an optimal step-length formula for any design . We also prove the convergence of the REX algorithm for -optimality.

a.1 D-optimality: Optimal Step-length and a Proof of Convergence of REX

Let and . If , the optimal step-length is trivially . Therefore, we can assume that at least one of the weights and is strictly positive. We use the notation

For -optimality, it has been shown (see [6], [48]) that an optimal choice of the step-length in (7) is as follows.

If and are linearly independent, then101010Note that if and are linearly independent, the Cauchy-Schwarz inequality implies .

(8)

If and are linearly dependent, we can111111For , the choice of the optimal is not unique. set

(9)

A -optimal vertex-exchange step for between design points is then

The -optimal vertex exchange step is called nullifying if or if , that is, if the -th or the -th component of is equal to . The pair of indices

will be called the -optimal Bohning’s vertex pair and the steplength for , , will be called the -optimal Bohning’s step. In the following, we will use the well-known facts (i) , (ii) , and (iii) is -optimal if and only if (see, e.g., [3], 9.2). In (ii), denotes the -efficiency of relative to a -optimal design.

Lemma 1.

Let and . Then

(10)
(11)

Moreover, if is not -optimal and , is the -optimal Bohning’s vertex pair, then .

Proof.

Inequality (10) follows from the choice of . We will prove (11). If , then . Assume that . Then the rules for imply and . Now, to show that , it is enough to verify that the directional derivative of in in the direction of is strictly positive. Noting that is the gradient of in a non-singular , we obtain that the directional derivative is

For the case , the strict inequality in (11) can be proved analogously.

Finally, we prove the last statement of the theorem. If is not optimal, then , i.e., (i) implies that . However, , therefore , and the formulas for the optimal step-length imply , i.e., . We can close the proof by using (11). ∎

Recall that for a Bohning’s pair is called nullifying if is equal to or .

Lemma 2.

For any , there exists such that for all satisfying and a non-nullifying -optimal Bohning’s pair , , we have

Proof.

Let and . Clearly, is a continuous function of on the compact . Therefore, it is bounded on from above by some constant . Set , which is clearly finite.

Assume first that and are linearly dependent. Then, the Bohning’s step can only be non-nullifying if . However, this cannot happen because from fact (i) and since , from fact (iii).

Therefore, and are linearly independent. Then, since the -optimal Bohning’s step given by (8) is assumed to be non-nullifying, we have . From the matrix determinant lemma and the Sherman-Morrison formula, it is straightforward to show that for any positive definite matrix , any and

(12)

Setting , , and in (A.1) gives

Recalling that , and , we obtain the inequality from the lemma. ∎

Theorem 1.

The algorithm REX with the step-length rule defined by (8) and (9) converges to a -optimal design in the sense that the sequence of -criterion values of designs produced by the algorithm converges to the -optimal value .121212

The convergence is the sure convergence of random variables.

131313The -optimal value is defined as , where is the -optimal information matrix.

Proof.

Let the initial design of REX be , let , and let be the sequence of designs that REX generates by the leading -optimal Bohning’s exchanges at iterations . Because all the transformations used by REX are non-decreasing with respect to , the sequence has a limit, say . Assume that is not equal to the -optimal value of the problem and let . Now, inspired by the proof strategy of Bohning ([6]), we will split the proof into two cases.141414We will deal with Case 1 differently than Bohning. In each case, we show a contradiction, which then implies the claim of the theorem.

Case 1: Assume that there is only a finite number of non-nullifying leading -optimal Bohning’s exchanges. Each nullifying exchange can only decrease the size of the support of the current design, or it can keep the size of the support constant. Since there can only be a finite number of support-reducing nullifying exchanges, there is some index such that all the exchanges performed after the -th iteration are nullifying and keep a constant size of the support. Hence, after the -th iteration, the algorithm does not alter the set of all non-zero weights. At the same time, REX is designed such that after the -th iteration, it will perform only nullifying exchanges (also in the non-leading exchanges). Since is finite, this means that the generated designs must return to one of the previous designs after a finite number of steps. This is impossible because according to Lemma 1, each -optimal Bohning’s step (even if it is nullifying) strictly increases the criterion value of a sub--optimal design.

Case 2: Assume that there is an infinite number of non-nullifying leading -optimal Bohning’s steps, at iterations . Then, Lemmas 1 and 2 imply that

(13)

for all and some positive real numbers . This is impossible if we assume that because the RHS of (13) converges to infinity with but the efficiency of any design compared to the -optimal design is bounded by from above. ∎

a.2 A-optimality: Optimal Step-length

Let be a fixed pair of design points and let be a fixed design. Let at least one of the weights and be strictly positive.

Denote , , and . For in the open interval , the matrix as well as the information matrix

are positive definite, and we can use the Woodbury formula to obtain151515Note that (A.1) implies