1 Introduction
The topic of this paper is the computation of optimal approximate designs for regression models with uncorrelated errors (e.g., [11], [28], [31], [3]). As a special case, we also briefly discuss the minimum volume enclosing ellipsoid problem (e.g., [44]).
Suppose that we intend to perform an experiment consisting of a set of trials. Assume that the observed response of each trial depends on a design point chosen from a finite set . For simplicity but without the loss of generality, we will assume that . Note that in the experimental design problems, can be a large number, often many thousands.
For , the realvalued observation is assumed to satisfy the linear^{1}^{1}1It is also straightforward to apply the algorithms proposed in this paper to the computation of locallyoptimal designs for nonlinear
regression models. In this sense, we gain clarity and do not lose generality if we formulate our method for the linear regression (see
[3], chapter 17). regression model , where is the known regressor associated with, the vector
contains the unknown parameters of the model, and , with, is an unknown variance.
^{2}^{2}2We assume homoskedasticity and the normality of errors for the sake of simplicity. Our algorithms can be applied to all models with uncorrelated , , with finite variances, after a simple transformation of the problem ([3], chapter 23). For different trials, the errors are assumed to be uncorrelated. Let the model be nonsingular in the sense that spans . We also avoid redundant regressors by assuming that for all . Note that in the applications, the number of the parameters tends to be relatively small, mostly less than .We formalize an (approximate experimental) design on as an dimensional vector with nonnegative real components summing to one.^{3}^{3}3Often, an approximate experimental design is formalized as a probability measure on , but for a finite design space, the representations by a probability measure and by a vector of probabilities are equivalent. From the point of view of an experimenter, the component of represents the proportion of the trials to be performed in the design point . The support of a design is .^{4}^{4}4Note that there are theoretical reasons corroborated by extensive numerical evidence that the “optimal” designs tend to be sparse in the sense of having support much smaller than . More precisely, there always exists an optimal design supported on a set of a size that does not exceed . The set of all designs on denoted by
is the probability simplex in
, which is compact and convex.The information matrix associated with a design is defined by
Under our model’s assumptions, the information matrix is proportional to the Fisher information matrix corresponding to . Therefore, the general aim is to choose such that is “as large as possible”, which we make precise next.
Let be the set of symmetric nonnegative definite matrices, and let be a criterion of optimality, that is, a function measuring the “size” of information matrices. A design is said to be a optimal design if it maximizes in the class of all designs:
(1) 
Matrix is referred to as the optimal information matrix.
Due to their natural statistical interpretations, the two most common optimality criteria are optimality and optimality. In this paper, we use them in the forms (e.g., [31]) and , respectively, for any positive definite matrix . For a singular nonnegative definite matrix , the values of the criteria are defined to be . Functions and are positively homogeneous, continuous, and concave on the set of all nonnegative definite matrices. Hence, for both  and optimality, the optimal design always exists, and the optimal information matrix is nonsingular.
Let denote the set of regular designs, i.e.,
For this paragraph, let be fixed. The variance function of is the dimensional vector with components
Therefore, is a function that maps regular designs to . From the statistical point of view,
is proportional to the variance of the best linear unbiased estimator (BLUE) of
, provided that the approximate design can be actually carried out (hence the term). We also note that(2) 
where is a version of the optimality criterion and is the singular design in (formally the th standardized unit vector, i.e., the vertex of ). This explains why is often used in iterative algorithms for computing the optimal designs to select “the most promising direction” of change of the current design.
For optimality, because the directional derivative of in in the direction of is , we can define an dimensional vector with components
as a quantity analogous to the variance function in the case of optimality.
In the following, let be either or , depending on the criterion under consideration. Besides indicating a degree of importance of design points, another reason for utilizing the vector is that can be used to compute a natural stopping rule for algorithms based on the notion of statistical efficiency, as follows.
The  and efficiencies of a design relative to are (see [31])
Because and are positively homogeneous, the notion of efficiency as defined above can be interpreted in terms of the relative numbers of trials needed to achieve a given criterion value. For instance, means that, asymptotically, to achieve the same amount of information measured by the criterion of optimality, we only need of the trials using the design compared to number of trials needed if we use the design . Let . If is a optimal design and is an optimal design, then (see [31], Ch.6)
(3) 
If is the current design and its efficiency compared to the optimal design is almost equal to , there is no practical reason to continue the computation.
1.1 Methods for computing optimal designs of experiments
In this paper, we are concerned with the numerical computation of (approximate) optimal designs as given in (1), with special focus on  and optimality. Within the field of optimal experimental design, the first contributions to solving this problem were made by V. V. Fedorov and H. Wynn (e.g., [11] and [50]), who developed the socalled vertex direction methods (VDMs) that are closely related to the FrankWolfe algorithm. In each iteration, VDMs move the current design in the direction of for some design point while decreasing all remaining components of by the same proportion. Here, can be chosen to maximize . Although these methods converge, some of them monotonically, they tend to be inconveniently slow. More efficient variants (e.g., [4]) also allow the decrease of a single component of the current design. A related alternative approach called the vertex exchange method for optimality (VEM) was proposed by Bohning [6].
The VEM algorithm (see Algorithm 1 for a pseudocode) is one of the simplest special cases of the class of algorithms presented in this paper. In each iteration, with a current design , VEM selects a pair of points defined by
We call such a pair a Bohning’s pair of design points. Then, the VEM computes such that in is maximized. Such an , which we call the Bohning’s step, can be analytically computed. After each exchange, the variance function is recomputed.
A substantially different strategy for solving the problem (1), involving simultaneously updating all components of , is implemented in the multiplicative algorithm (MUL). The MUL and its variants can be attributed to [40]; extensions can be found in [10] and [17].
The three methods of VDM, VEM and MUL were more recently combined by Yu ([48]) in the “cocktail” algorithm for optimality. This approach often increases the speed while preserving convergence.
Another efficient recent method is presented in [46]. It is the simplicial decomposition algorithm where the master nonlinear problem is solved by a secondorder Newton method and can be used for various twicedifferentiable concave criteria, including both  and optimality. See also [45] for an earlier application of simplicial decomposition in the optimal design of experiments and further references.
The problem of searching for an optimal design can be also solved by specific mathematical programming methods. Namely, in [49], it is shown that the problem of optimality can be cast as semidefinite programming, while optimality can be reduced to maxdet programming. More recently, the work [38] enables one to solve both  and optimal design problems (for the approximate design case without additional constraints^{5}^{5}5For an improvement that also enables solving exact design problems and approximate design problems with nonstandard constraints using the secondorder cone programming, see [39].) by the secondorder cone programming approach.
The optimal design problem is closely related to the minimum volume enclosing ellipsoid (MVEE) problem. Considerable attention has been paid to developing fast algorithms for the MVEE problem in mathematical programming and optimization literature. For the latest developments, see [43], [2], and [44].
Consider a set spanning . Then, the task of finding MVEE , , containing can be cast as
with the dual problem
Thus, the problem of finding the optimal design is equivalent to the MVEE problem (see, e.g., [44]), making the task of developing a fast and efficient algorithm for computing optimal designs relevant for a wider optimization community.
With optimality being the most popular criterion, the literature focusing on the computational issues of optimality is scarcer, although in some areas, the use of optimality is very natural. In particular, the computation of the important optimal designs^{6}^{6}6These are occasionally called  or optimal designs. (e.g., [8], [13]) can be converted into optimality (cf. [3], Section 10.6). That is, optimal designs on a finite design space can also be computed using the algorithm developed in this paper. In addition to the mathematical programming methods mentioned above, a generalization of the vertex direction method that focuses on optimality, together with an analytical formula for the optimal VDM steplength, is presented in [1].
1.2 Summary of contributions
In Section 2, we introduce a general class of methods for finding optimal approximate designs that we call the subspace ascent method (SAM) and describe some known algorithms as special cases.
In Section 3, we present our key method: REX (randomized exchange method). REX is a simple batchrandomized exchange algorithm that is a member of the SAM family and whose performance is superior to the stateoftheart algorithms in nearly all problems we tested. The proposed algorithm can be viewed as an efficient extension or combination of both the VEM algorithm and the KL exchange algorithm that is used to compute exact designs (see [3]
). In the same section, we compare our method to known subspace ascent/descent methods that were recently developed in the optimization and machine learning communities by highlighting the similarities and differences. The numerical comparison of the stateofthe art algorithms with this newly proposed method is the subject of Section
4.In the Appendix, we provide a proof of convergence of the proposed algorithm in the case of optimality. Additionally, in the existing literature, many of the mentioned methods have only been tailored to optimality, and although the transition to optimality is possible, it is not trivial. Therefore, in the Appendix, we also present formulas for the optimum exchange of weights for optimality, which, to the best of our knowledge, were previously only derived for the optimality criterion.
2 Subspace ascent method
In Section 2.1, we propose a generic algorithmic paradigm for solving (1). It helps illustrate several existing approaches and the newly proposed method from a broader perspective. We refer to this paradigm as the “Subspace Ascent Method” (SAM), formalized as Algorithm 2. Subsequently, in Section 3.2, we briefly comment on the context of the “subspace ascent” idea within existing optimization, linear algebra and machine learning literature.
2.1 Sam
SAM (Algorithm 2) is an iterative procedure for finding an approximate optimal design. We start with an arbitrary regular design . In iteration , a subset of the design points is chosen via a certain rule (e.g., a small random subset of all design points). In this iteration, we only allow for weights for to change. All other weights are frozen at their last values. That is, we are deliberately reducing our search for to a subset of the set of all designs at the intersection of and a particular affine subspace of :
Note that for all and, therefore, . In particular, there exists such that the ascent condition is satisfied.
If , then is a singleton and the method cannot progress. As soon as , this problem is removed, and has a chance to contain points better than .
SAM is a generic method in the sense that it does not specify how the set is chosen or how the approximate solution is obtained. Formally, any optimum design algorithm for a discrete design space (including VDMs and the MUL) is a special case of SAM if we allow the choice . However, the intended philosophy of the SAM approach is to make the sets much smaller than . This makes the partial optimization problem smalldimensional and potentially simple to solve, at least approximately.
The algorithm VEM chooses to be a two point set, which allows the partial optimization problem to be analytically solved for optimality (and, as we show in the Appendix, for optimality). However, this method requires overly frequent updates of the set , which is a computationally nontrivial operation for large . Therefore, the speed of VEM significantly declines with the increase of the design space.
Similarly, SAM also includes the algorithm YBT ([46]), which chooses to be the support of the current design enriched by one additional support point . Then, a partial optimization problem is solved by a specific constrained Newton method. However, this approach is generally inefficient for larger values of because then the sets tend to be large and the efficiency of the Newton method deteriorates rapidly as the dimension of the problem increases (as demonstrated in Section 4).
3 Randomized exchange algorithm
In this section, we propose a particular case of SAM for which we coin the name randomized exchange algorithm (REX)
. Then, we briefly comment on the connection of this method to the existing literature on mathematical optimization and machine learning.
3.1 Rex
Let be a regular design and be an dimensional vector with components , as defined in the introduction. Recall that the value is a based estimate of the plausibility that is the support point of an optimal design.
The general idea behind the proposed batchrandomized algorithm is to repeatedly select a batch of several pairs of design points (the number of pairs is not fixed and may vary from iteration to iteration) and randomly perform optimal exchanges of weights between the pairs of design points. The selection of the batch depends on the vector evaluated in the best available design . Its size can be finetuned by a tuning parameter . We will call the resulting algorithm the randomized exchange algorithm (REX). We now proceed to a description of the REX algorithm (See Algorithm 3 for a concise pseudocode).

LBE step. Given the current design , compute and subsequently perform the “leading Bohning exchange” (LBE) as follows:
(4) where
An optimal step is called “nullifying” if it is equal to either or .

Subspace selection. Subspace of in which the method will operate arises as the union of two sets. One is formed by a greedy process () informed by the elements of the vector , and the other is identical to the support of the current design ().

Greedy. Set , and choose points
where corresponds to the th largest component of the vector .

Support. Set
Let be the size of : .

Active subspace. The active set of design points is defined as
The weights of no other design points are modified in this iteration. Note that if or if , then . However, a standard situation is , and our method operates in a proper subspace of the full design space .


Subspace step. We now perform a specific update of the design points in . That is, we allow for to be updated. Elements for are kept unchanged.

Formation of pairs. Let be a uniform random permutation of , and let be a uniform random permutation of . Form the sequence
(5) of pairs of active design points.

Update. If the LBE step was nullifying, sequentially perform all nullifying optimal exchanges between the pairs in (5) with the corresponding updates of and . If the LBE was not nullifying in the current iteration, sequentially perform all optimal exchanges between the pairs in (5) with the corresponding updates of and .


Stopping rule. If a stopping rule is satisfied, stop. Otherwise, iterate by returning to step 1.
We remark that for the criteria of  and optimality, we have rapid explicit formulas providing the optimal exchanges in steps 1 and 3b of REX; see Appendix A. However, in principle, REX can also be applied to any other concave (even nondifferentiable) optimality criterion using numerical procedures for the onedimensional convex optimization on an interval, provided that we find a suitable analogy to the function .
REX makes explicit use of the fast optimal weight exchange between two design points, which makes it much less dependent on than YBT. Moreover, REX requires less frequent recomputations of the function , which leads to significant computational savings compared to VEM for large . Note that REX is very simple. In particular, it is simpler than the hybrid cocktail algorithm of Yu [48]. Moreover, unlike the method of [48], the result does not depend on the ordering of the design points. Finally, as we demonstrate, in most cases, it is also numerically more efficient than both of the stateofthe art methods proposed in [48] and [46].
3.2 Connection to existing literature on subspace descent methods
Subspace ascent/descent methods of various types have recently been extensively studied in the optimization [26, 36, 12], linear algebra [22, 21, 14], machine learning [41, 32, 20] and image processing [7] literature. Of particular importance are the randomized and greedy variants.
Randomized subspace descent methods operate by taking steps in random subspaces of the domain of interest according to some distribution over subspaces (all of which typically have a relatively small and fixed dimension) that is fixed throughout the iterative process. Once a subspace has been identified, a gradient, Newton or a proximal point step is usually taken in the subspace. These methods have been found to scale to extremely large dimensions for certain applications [36, 37] (e.g., billions of unknowns), and have become the state of the art for a variety of big data analysis tasks. Randomized selection rules are, in practice, often better than cyclic rules that pass through the subspaces in a fixed predefined order. Intuitively, this happens because randomization helps avoid/break unfavorable orderings. It was recently shown by Sun and Ye [42] that there is a large theoretical gap between randomized and cyclic rules (proportional to the square of the dimension of the problem), which favors randomized rules.
Greedy subspace descent methods operate by taking steps in subspaces selected greedily according to some potential/importance function of interest. Again, once a subspace has been identified, typically a gradient, Newton or a proximal point step is taken in the subspace. The ideal greedy method employs the maximum improvement rule, which requires that at any given iteration, one should choose the subspace that leads to the maximum improvement in the objective. Since this is almost always impractical, one needs to resort to approximating the maximum improvement rule by an efficiently implementable proxy rule. Whether one should use a greedy or a randomized method depends on the problem structure, as both have applications in which they dominate.
Let us now highlight some of the key similarities and differences between REX and existing subspace descent methods:

Support plays a role. The subspace selection rule in REX has a greedy component through the inclusion of the set . However, unlike most existing subspace descent methods, it also has an “active set / support” component through the inclusion of .

Greedy subspaces. Greedy subspace rules beyond subspaces of dimension one have not been explored in the optimization and machine learning literature. To the best of our knowledge, the only exception is the work of Csiba and Richtárik [9]. However, both their greedy subspace rule and the problem that they tackle is different from ours.

Subspace step. While existing subspace descent methods typically rely on traditional deterministic (as opposed to stochastic) optimization steps such as gradient or Newton steps, the REX performs a randomized pairwise exchange step in the subspace. We perform pairwise exchange steps as for the most important optimal design criteria. These can be calculated exactly in closedform, which facilitates fast computations. This is similar to the sequential minimal optimization technique of Platt [29] that is influential in the machine learning literature. Methods with randomized steps performed in the subspace, such as REX, are rare in the optimization literature. One of the earliest examples of such an approach is the CoCoA framework of Jaggi et al. [19], aimed at distributed primaldual optimization in machine learning. Both their subspace subroutine and the problem that they tackle are very different from ours.

Random reshuffling. Randomization in REX is present in the form of a random permutation of pairs from . Methods relying on random permutations can be seen as a hybrid between stochastic and cyclic methods. These approaches are not theoretically understood due the intrinsic difficulty in capturing their behaviors using complexity analysis. The work of Gürbüzbalaban et al. [15] provides one of the first insights into this problem. However, their techniques do not apply to our problem or algorithm.

Nonseparable constraints. Virtually all modern stochastic and greedy subspace descent methods in optimization apply to unconstrained problems^{7}^{7}7Problems with a convex constraint admitting a fast (i.e., closed form) projection are considered equally simple as unconstrained problems., or to problems with separable constraints structure, as this property is crucial in the analysis of these methods. One of the early works that explicitly tackles nonseparable constraints is that of Necoara et al. [25]. However, their work applies to linear constraints only.

Smoothness. Vast majority of papers on stochastic and greedy subspace descent methods assume that the objective function has a Lipschitz continuous gradient (this property is often called smoothness in the machine learning literature). However, this is not true in our case. Hence, fundamentally different convergence analysis techniques are needed in our case. Only a handful of subspace descent methods do not rely on this assumption. One of them is the stochastic ChambollePock algorithm [7], which is a stochastic extension of a famous stateoftheart method in the area of computational imaging: the ChambollePock method.
4 Numerical comparisons
In this section, we compare the performance of an R ([33]) implementation of REX and two stateoftheart algorithms for computing the optimal design of experiments. As a first candidate, we used the cocktail algorithm (CO), more precisely its R implementation, which was made available by the author of CO (see [48]).^{8}^{8}8There are two variants of CO. One has a prespecified neighborhood structure, and the other has nearest neighbors computed on the fly. For each model, we selected the variant that performed better. As a second algorithm, we selected the Newtontype method [46] (YBT). For a fair comparison, the SAS code kindly provided by the authors of YBT was converted into R with as high fidelity as possible. The R codes of REX are available at
Note that it is possible to directly use the codes without any commercial software and without technical knowledge of the algorithms.
In the following two subsections, we compare the competing algorithms for two very different, generic optimum design problems with widely varying sizes. This approach provides comprehensive information about the relative strengths of the methods in a broad range of situations. However, we remark that the numerical comparisons must be taken with a grain of salt because they depend on a multitude of factors. Different implementations and different hardware capabilities can lead to somewhat different relative results.
In all computations with REX, we chose the tuning constant , uniformly random point initial designs. We used a computer with a 64bit Windows 10 operating system running an Intel Core i75500U CPU processor at GHz with GB of RAM.
4.1 Quadratic models
Consider the full quadratic regression model on the dimensional cube , . In other words, for the observations are modeled as
(6) 
where correspond to the parameters , . We discretized the cube to obtain a lattice of points, which are numbered by indices from . Thus, the set of regressors corresponds to . These are representatives of structured models, such as those used in the response surface methodology (see, e.g., [24]). Each of the above mentioned algorithms was run for various combinations of values and (see Figure 1).
The numerical results confirm the claim of [46] that YBT is superior to CO for a large size of the design space and a small number of model parameters (see panels (b) and (f) of Figure 1). However, for a smaller design space, particularly if the number of parameters is large, the algorithm CO tends to perform better than YBT (cf. panels (a), (c) and (e) of Figure 1). A similar observation can also be drawn for other models, such as the one presented in the next subsection.
Importantly, the performance of REX is comparable or superior to both stateoftheart algorithms for all combinations of and that we explored.
4.2 Random models
As a second example, we chose a very different problem in which the regressors are drawn independently and randomly from the
dimensional normal distribution
. These models represent unstructured and unordered optimal design situations in which the possible covariates of the model are drawn from a random population. The results are demonstrated in Figure 2.The random models again show the superior performance of the proposed algorithm. The differences in favor of REX are even more pronounced compared to the quadratic models. This is possibly caused by the fact that on the unstructured set of regressors, procedures such as CO that depend on the ordering of design points are disadvantaged. Note that the algorithm REX is independent of the ordering of the design points.
4.3 Further remarks on the numerical comparisons
optimality. In addition to optimality, we performed a comparison of CO, YBT and REX for several problems under the criterion of optimality. Note that the algorithm CO was originally developed for optimality only, but the formulas given in the Appendix (Subsection A.2) allowed us to modify it for the computation of optimal designs. For optimality, the relative efficiencies of the three algorithms were similar to the case of optimality, although the advantage of REX is generally less pronounced.
Other algorithms for optimality. For some models with a small number of design points and a very large number of parameters, the classical methods, particularly the multiplicative algorithm, outperform all three “stateoftheart” algorithms (REX, YBT, CA), at least in an initial part of the computation. However, in a vast majority of test problems, the classical methods cannot compete with the three modern algorithms. A comprehensive comparison of all available methods is beyond the scope of this paper.
Disadvantages of REX. We believe it to be important that the reader is informed of the disadvantages of our method we are aware of. First, the method occasionally generates streaks of only slowly increasing criterion values that are, however, usually followed by a rapid improvement. Second, we were not able to prove the convergence to the optimum^{9}^{9}9Note that the convergence of the criterion values perse follows trivially from the monotonicity of REX. However, the convergence to the value optimal for the problem at hand seems to be difficult to prove for a general concave criterion. except for the case of optimality. Nevertheless, we also tested REX for other criteria, and the algorithm converged in all test problems.
5 Conclusions
As we have numerically demonstrated, the randomized exchange algorithm (REX) largely outperforms recent stateoftheart methods for computing optimal approximate designs of experiments, mostly in the cases where the model consists of randomly generated or otherwise unstructured regressors. Overall, in comparison to the competing methods, the performance of REX deteriorates much less when both the size of the design space and the number of parameters increase. It is also worth noting that REX is concise, has relatively low memory requirements, and does not utilize any advanced mathematical programming solvers. In addition, for optimality, we theoretically proved the convergence of REX to the correct optimum. To adapt the proposed algorithm to the optimality criterion, we derived appropriate vertexexchange formulas.
REX offers many possibilities for further development. Besides the exploration of its convergence properties for a general concave criterion, there is a space for improvement in an optimized, perhaps adaptive selection of the parameter that regulates the batch size. One way to further increase the speed of the algorithm is to incorporate specific methods of initial design construction and the deletion rules of nonsupporting points (see [16] for optimality and [30] for optimality).
Moreover, it is possible to employ REX for the computations of optimal designs on continuous (infinite) design spaces. One approach is to choose a large number of random points inside the continuous design space, use the speed of REX for unstructured design spaces to construct the optimal approximate design on the finite subsample, and finetune the design by using standard routines of constrained continuous optimization. However, a numerical and theoretical analysis of this methodology is beyond the scope of this paper.
Acknowledgments The work of the first two authors was supported by Grant Number 1/0521/16 from the Slovak Scientific Grant Agency (VEGA). The last author acknowledges support through the KAUST baseline research funding scheme.
References
 [1] Ahipasaoglu SD (2015): “A firstorder algorithm for the Aoptimal experimental design problem: a mathematical programming approach.” Statistics and Computing, Vol 25.6, pp. 11131127.
 [2] Ahipasaoglu SD (2015): “Fast algorithms for the minimum volume estimator.” Journal of Global Optimization, Vol 62.2, pp. 351370.
 [3] Atkinson AC, Donev AN, Tobias RD (2007): Optimum Experimental Designs, With SAS. Oxford University Press
 [4] Atwood, CL (1973): “Sequences converging to Doptimal designs of experiments”, The Annals of Statistics 1, pp. 342352
 [5] Atwood CL (1976): “Convergent design sequences, for sufficiently regular optimality criteria.” The Annals of Statistics 4.6, pp. 11241138.
 [6] Böhning D (1986): “A vertexexchangemethod in Doptimal design theory.” Metrika 33.1: pp. 337347.
 [7] Chambolle A, Ehrhardt MJ, Richtárik P, Schönlieb CB (2017): “Stochastic primaldual hybrid gradient algorithm with arbitrary sampling and imaging applications.” arXiv:1706.04957, pp. 125.
 [8] Cook RD, Nachtsheim CJ (1982): “Model robust, linearoptimal designs.” Technometrics 24.1, pp. 4954.
 [9] Csiba D, Richtárik P (2017): “Global convergence of arbitraryblock gradient methods for generalized PolyakLojasiewicz functions.” arXiv preprint arXiv:1709.03014.
 [10] Dette H, Pepelyshev A, Zhigljavsky A (2008): “Improving updating rules in multiplicative algorithms for computing Doptimal designs.” Computational Statistics & Data Analysis 53.2, pp. 312320.
 [11] Fedorov V (1972): “Theory of Optimal Experiments”, Academic Press, New York.
 [12] Fercoq O, Richtárik P (2015): “Accelerated, Parallel and PROXimal coordinate descent.” SIAM Journal on Optimization, Vol 25.4, pp. 19972023.
 [13] Goos P, Jones B, Syafitri U (2016): “IOptimal Design of Mixture Experiments”, Journal of the American Statistical Association, Vol. 111, pp. 899911.
 [14] Gower RM, Richtárik P (2015): “Randomized iterative methods for linear systems.” SIAM Journal on Matrix Analysis and Applications, Vol 36.4, pp. 16601690.
 [15] Gürbüzbalaban M, Ozdaglar A, Parrilo P (2015): “Why random reshuffling beats stochastic gradient descent”, arXiv preprint arXiv:1510.08560.
 [16] Harman R, Pronzato L (2007): “Improvements on removing nonoptimal support points in Doptimum design algorithms”, Statistics & Probability Letters, Vol 77.1, pp. 9094.
 [17] Harman R (2014): “Multiplicative methods for computing Doptimal stratified designs of experiments”, Journal of Statistical Planning and Inference, Vol 146, pp. 8294.
 [18] Harman R, Filová L (2016): “Package ’OptimalDesign’”, https://cran.rproject.org/web/packages/OptimalDesign/index.html
 [19] Jaggi M, Smith V, Takáč M, Terhorst J, Krishnan S, Hofmann T, Jordan MI (2014): “Communicationefficient distributed dual coordinate ascent”, Advances in Neural Information Processing Systems, pp. 30683076.

[20]
Konečný J, Liu J, Richtárik P, Takáč M (2016): “Minibatch semistochastic gradient descent in the proximal setting.” IEEE Journal of Selected Topics in Signal Processing, Vol 10.2, pp. 242255.
 [21] Lee YT, Sidford A (2010): “Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems.” IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS).
 [22] Leventhal D, Lewis AS (2010): “Randomized methods for linear constraints: convergence rates and conditioning.” Mathematics of Operations Research, Vol 35, pp. 641654.
 [23] Mandal S, Torsney B (2006): “Construction of optimal designs using a clustering approach”, Journal of Statistical Planning and Inference, Vol 136.3, pp. 11201134.
 [24] Myers RH, Montgomery DC, AndersonCook CM (2016): “Response surface methodology: process and product optimization using designed experiments”, John Wiley & Sons.
 [25] Necoara I, Nesterov Yu, Glineur F (2011): “A random coordinate descent method on large optimization problems with linear constraints”, Manuscript.
 [26] Nesterov Yu (2012): “Efficiency of coordinate descent methods on hugescale optimization problems.” SIAM Journal on Optimization, Vol 22, pp. 341362.
 [27] Nutini, J, Schmidt M, Laradji IH, Friedlander M, Koepke H (2015): “Coordinate descent converges faster with the GaussSouthwell rule than random selection”, Proceedings of the 32nd International Conference on Machine Learning.
 [28] Pázman A (1986): “Foundations of Optimum Experimental Design”. Reidel.

[29]
Platt J (1998): “Sequential minimal optimization: A fast algorithm for training support vector machines”. Manuscript.
 [30] Pronzato L (2013): “A delimitation of the support of optimal designs for Kieferâ™s class of criteria.” Statistics & Probability Letters, Vol 83.12, pp. 27212728.
 [31] Pukelsheim F (2006): “Optimal Design of Experiments (Classics in Applied Mathematics)”, Society for Industrial and Applied Mathematics.
 [32] Qu Z, Richtárik P, Zhang T (2015): “Quartz: Randomized dual coordinate ascent with arbitrary sampling.” In Advances in Neural Information Processing Systems, Vol 28, pp. 865873.
 [33] R Core Team (2016): “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/.
 [34] Richtárik P, Takáč M (2016): “On optimal probabilities in stochastic coordinate descent methods.” Optimization Letters, Vol 10.6, pp. 12331243.
 [35] Richtárik P, Takáč M (2014): “Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function .” Mathematical Programming Vol 156.1, pp. 138.
 [36] Richtárik P, Takáč M (2016): “Parallel coordinate descent methods for big data optimization.” Mathematical Programming, Vol 144.2, pp. 433484.
 [37] Richtárik P, Takáč M (2016): “Distributed coordinate descent method for learning with big data.” Journal of Machine Learning Research, Vol 17.75, pp. 125.
 [38] Sagnol G (2011): ”Computing optimal designs of multiresponse experiments reduces to secondorder cone programming”, Journal of Statistical Planning and Inference, Vol 121, pp. 16841708.
 [39] Sagnol G, Harman R (2015): “Computing exact Doptimal designs by mixed integer second order cone programming”, The Annals of Statistics, Vol 43.5, pp. 21982224.
 [40] Silvey SD, Titterington DH, Torsney B (1978): “An algorithm for optimal designs on a design space.” Communications in StatisticsTheory and Methods, Volume 7.14, pp. 13791389.
 [41] ShalevShwartz S, Zhang T (2013): “Stochastic dual coordinate ascent methods for regularized loss minimization.” Journal of Machine Learning Research, Vol 14, pp. 567599.
 [42] Sun R, Ye Y (2016): “Worstcase complexity of cyclic coordinate descent: gap with randomized version.” arXiv:1604.07130, pp. 139.
 [43] Todd MJ, Yildirim EA (2007): “On Khachiyan’s algorithm for the computation of minimumvolume enclosing ellipsoids”, Discrete Applied Mathematics, Vol 155.13, pp. 17311744.
 [44] Todd MJ (2016): “MinimumVolume Ellipsoids: Theory and Algorithms”. Society for Industrial and Applied Mathematics.
 [45] Ucinski D, Patan M (2007): “Doptimal design of a monitoring network for parameter estimation of distributed systems”, Journal of Global Optimization, Vol 39,, pp. 291322
 [46] Yang M, Biedermann S, Tang E (2013): “On optimal designs for nonlinear models: a general and efficient algorithm.” Journal of the American Statistical Association, Vol 108, pp. 14111420.
 [47] Yu Y (2010): “Monotonic convergence of a general algorithm for computing optimal designs”, The Annals of Statistics, Vol 38.3, pp. 15931606.
 [48] Yu Y (2011): “Doptimal designs via a cocktail algorithm.” Statistics and Computing, Vol 21.4, pp. 475481.
 [49] Vandenberghe L, Boyd S, Wu S (1998): “Determinant maximization with linear matrix inequality constraints”, SIAM Journal on Matrix Analysis, Vol 19, pp. 499533.
 [50] Wynn HP (1970): “The sequential generation of optimum experimental designs.” The Annals of Mathematical Statistics, Vol 41.5, pp. 16551664.
Appendix A Optimal Steplength and Convergence
Let , , and let be a criterion of design optimality. The optimal steplength of weight exchange in the design between design points and is any
(7) 
In accord with the definition (7), we provide an optimal steplength formula for any design . We also prove the convergence of the REX algorithm for optimality.
a.1 Doptimality: Optimal Steplength and a Proof of Convergence of REX
Let and . If , the optimal steplength is trivially . Therefore, we can assume that at least one of the weights and is strictly positive. We use the notation
For optimality, it has been shown (see [6], [48]) that an optimal choice of the steplength in (7) is as follows.
If and are linearly independent, then^{10}^{10}10Note that if and are linearly independent, the CauchySchwarz inequality implies .
(8) 
If and are linearly dependent, we can^{11}^{11}11For , the choice of the optimal is not unique. set
(9) 
A optimal vertexexchange step for between design points is then
The optimal vertex exchange step is called nullifying if or if , that is, if the th or the th component of is equal to . The pair of indices
will be called the optimal Bohning’s vertex pair and the steplength for , , will be called the optimal Bohning’s step. In the following, we will use the wellknown facts (i) , (ii) , and (iii) is optimal if and only if (see, e.g., [3], 9.2). In (ii), denotes the efficiency of relative to a optimal design.
Lemma 1.
Let and . Then
(10)  
(11) 
Moreover, if is not optimal and , is the optimal Bohning’s vertex pair, then .
Proof.
Inequality (10) follows from the choice of . We will prove (11). If , then . Assume that . Then the rules for imply and . Now, to show that , it is enough to verify that the directional derivative of in in the direction of is strictly positive. Noting that is the gradient of in a nonsingular , we obtain that the directional derivative is
For the case , the strict inequality in (11) can be proved analogously.
Finally, we prove the last statement of the theorem. If is not optimal, then , i.e., (i) implies that . However, , therefore , and the formulas for the optimal steplength imply , i.e., . We can close the proof by using (11). ∎
Recall that for a Bohning’s pair is called nullifying if is equal to or .
Lemma 2.
For any , there exists such that for all satisfying and a nonnullifying optimal Bohning’s pair , , we have
Proof.
Let and . Clearly, is a continuous function of on the compact . Therefore, it is bounded on from above by some constant . Set , which is clearly finite.
Assume first that and are linearly dependent. Then, the Bohning’s step can only be nonnullifying if . However, this cannot happen because from fact (i) and since , from fact (iii).
Therefore, and are linearly independent. Then, since the optimal Bohning’s step given by (8) is assumed to be nonnullifying, we have . From the matrix determinant lemma and the ShermanMorrison formula, it is straightforward to show that for any positive definite matrix , any and
(12) 
Setting , , and in (A.1) gives
Recalling that , and , we obtain the inequality from the lemma. ∎
Theorem 1.
The algorithm REX with the steplength rule defined by (8) and (9) converges to a optimal design in the sense that the sequence of criterion values of designs produced by the algorithm converges to the optimal value .^{12}^{12}12
The convergence is the sure convergence of random variables.
^{13}^{13}13The optimal value is defined as , where is the optimal information matrix.Proof.
Let the initial design of REX be , let , and let be the sequence of designs that REX generates by the leading optimal Bohning’s exchanges at iterations . Because all the transformations used by REX are nondecreasing with respect to , the sequence has a limit, say . Assume that is not equal to the optimal value of the problem and let . Now, inspired by the proof strategy of Bohning ([6]), we will split the proof into two cases.^{14}^{14}14We will deal with Case 1 differently than Bohning. In each case, we show a contradiction, which then implies the claim of the theorem.
Case 1: Assume that there is only a finite number of nonnullifying leading optimal Bohning’s exchanges. Each nullifying exchange can only decrease the size of the support of the current design, or it can keep the size of the support constant. Since there can only be a finite number of supportreducing nullifying exchanges, there is some index such that all the exchanges performed after the th iteration are nullifying and keep a constant size of the support. Hence, after the th iteration, the algorithm does not alter the set of all nonzero weights. At the same time, REX is designed such that after the th iteration, it will perform only nullifying exchanges (also in the nonleading exchanges). Since is finite, this means that the generated designs must return to one of the previous designs after a finite number of steps. This is impossible because according to Lemma 1, each optimal Bohning’s step (even if it is nullifying) strictly increases the criterion value of a suboptimal design.
Case 2: Assume that there is an infinite number of nonnullifying leading optimal Bohning’s steps, at iterations . Then, Lemmas 1 and 2 imply that
(13) 
for all and some positive real numbers . This is impossible if we assume that because the RHS of (13) converges to infinity with but the efficiency of any design compared to the optimal design is bounded by from above. ∎
a.2 Aoptimality: Optimal Steplength
Let be a fixed pair of design points and let be a fixed design. Let at least one of the weights and be strictly positive.
Denote , , and . For in the open interval , the matrix as well as the information matrix
are positive definite, and we can use the Woodbury formula to obtain^{15}^{15}15Note that (A.1) implies
Comments
There are no comments yet.