1 Introduction
Supervised learning models typically need to be trained on large amounts of labeled instances to perform well. While many modern systems can easily produce a large number of unlabeled instances at low cost, the labeling process can be very difficult, expensive or timeconsuming. However, different labeled instances contain various information and contribute to the learning process in different ways. Therefore, an interesting question arises: how to choose the most informative instances to label so that one can improve learning rate of the model and reduce the labeling cost at the same time?
In statistics, the problem of selecting which instances to label is referred as Design of Experiments (DOE) (Federov (1972); Box et al. (2005); Del Castillo (2007)). DOE is a systematic method to explore the relationship between process input (or experimental “factors”) and output responses under limited resources for conducting experiments. Traditional DOE was developed for physical experiments in agricultural and industrial applications where the goal is to optimize some continuous function of the covariates. A classic theory of experimental design exists for linear statistical models that assume the response is a function of a small number of covariates (for a summary of this theory, see, e.g., Federov (1972)
). In these problems, the number of covariates or “factors” of interest in an experiment is relatively small and so is the number of tests or size of the experiment, given the high experimental cost. However, with the development of modern technology, scientists and engineers frequently face different challenges arising not only from higher dimensional data but also from more complex data structures. Furthermore, in experiments where the response is an image that needs to be classified or text that needs to be recognized and categorized, the dimension of each data instance is much higher than those typically dealt with in the classical DOE literature. A particular type of data complexity specially important in experiments with image and text data occurs when the problem data actually lies on an unknown manifold of much smaller dimension of the space in which it appears to reside.
The goal of this paper is to discuss a new methodology for designing optimal experimental designs that minimize the number of experimental runs for highdimensional manifold data and at the same time acquire as much useful information about the response as possible. As far as we know, no existing work has provided a theoretical justification of optimal experimental design methods for highdimensional manifold data. This paper contributes to the theoretical development of DOE methods on manifolds. In particular, we prove a new Equivalence Theorem for a continuous optimal design on a Riemannian manifolds, and also provide a converging algorithm for finding the optimal design.
The rest of this paper is organized as follows. In Section 2, we briefly introduce the traditional Optimal Design of Experiment (ODOE) on Euclidean space, and then explain the idea of manifold learning, in particular the manifold regularization model from Belkin et al. (2006). More importantly, a manifoldbased ODOE scheme is discussed. In Section 3, we provide the theoretical justification of optimal experimental design and present a new equivalence theorem of ODOE on Riemannian manifolds. In Section 4, we illustrate our proposed algorithm and provide a convergence analysis. In Section 5, several numerical experiments are conducted to demonstrate the effectiveness of the proposed algorithm for finding optimal designs on manifold data. In Section 6, we conclude this paper and discuss some possible future directions.
2 Optimal Design of Experiments on Manifolds
2.1 Traditional ODOE on Euclidean Space
Consider initially a classical linear regression model
(1) 
where is some nonlinear mapping function that maps from the input space to the feature space ,
is a column vector of unknown parameters, and
is assumed to have a distribution . Given a sample of design points , if the corresponding response valuesare available, the wellknown ordinary least squares estimates of the
parameters are given by:(2) 
where is a design matrix with th row defined as , and is a response vector. As a result, the corresponding fitted function is .
Classical work on Optimal Design of Experiments (ODOE) was developed by Kiefer and Wolfowitz (1960) and summarized by Fedorov (1972) . The goal of ODOE is to find an experimental design that is optimal with respect to some statistical criterion related either to the model parameter estimates or to the model predictions. For example, given the linear regression model (1), the Doptimality criterion minimizes the determinant of the covariance matrix of the parameter estimates , while the Goptimality criterion minimizes the maximum prediction variance . These and similar criteria are called “alphabetic optimality” design criteria by Box and Draper (2007).
Traditional ODOE methods assume both the covariate and response data lie on an Euclidean space under the classical linear regression model (2). However, in applications to highdimensional image and text datasets, traditional ODOE is not applicable as it does not consider the intrinsic manifold structure these type of datasets usually have (Tenenbaum et al. (2000); Roweis and Saul (2000)).
While there have been recent attempts at applying alphabetic optimality criteria to manifold learning models (He (2010); Chen et al. (2010); Alaeddini et al. (2019)), no theoretical justification exists, as far as we know, to these methods, and no guarantees can be given for their success other than empirical experiments. A new theory for the optimal experimental designs is therefore needed that explicitly considers highdimensional manifold data, justify existing methods if possible, and shows a principled way to develop new methods. Before we discuss the design of experiments on manifolds, first we need to introduce a manifold learning model by Belkin et al. (2006) that will be used in the sequel.
2.2 Manifold Regularization Model
In the wellknown paradigm of machine learning, the process of learning is seen as using the training data
to construct a function that maps a data instance to a label variable . Letbe the joint distribution that generates labeled data
and be the marginal distribution that generates unlabeled data . In order to extend the learning of functions to nonlinear manifolds, Belkin et al. (2006) assume that the conditional distribution varies smoothly as moves along a manifold that supports . In other words, if two data pointsare close as measured by an intrinsic (or geodesic) distance on this manifold, then the two probabilities of the labels,
and, will be similar. These authors developed a semisupervised learning framework that involves solving the following double regularized objective function:
(3) 
where
is a given loss function (such as squared loss
), is a Reproducing Kernel Hilbert Space (RKHS) (Aronszajn (1950)) with associated Mercer kernel , is a penalty term with the norm equipped in that imposes smoothness conditions in the ambient space (Wahba (1990)), and is a penalty term for nonsmoothness along geodesics on the intrinsic manifold structure of . Moreover, and are two regularization parameters that control the amount of penalization in the ambient space and in the intrinsic manifold that supports , respectively. Some other work in recent literature can be explained as a particular case of this general framework. For example, the spatial regression model proposed by Ettinger et al. (2016) can be seen as the manifold regularization model (3) without the ambient space regularization. There are also different work about nonparametric regression models on manifolds (Cheng and Wu (2013); Marzio et al. (2014); Lin et al. (2017)), but in this paper we focus on the manifold regularization model from Belkin et al. (2006).Intuitively, the choice of
should be a smoothness penalty corresponding to the probability distribution
. However, in most realworld applications is not known, and therefore empirical estimates of the marginal distribution must be used. Considerable research has been devoted to the case when is supported on a compact manifold (Roweis and Saul (2000); Tenenbaum et al. (2000); Belkin and Niyogi (2003); Donoho and Grimes (2003); Coifman et al. (2005)). Under this assumption, it can be shown that problem (3) can be reduced to(4) 
where and is the Laplacian matrix associated with the data adjacency graph that is constructed on all the labeled and the unlabeled data points . In particular, the graph Laplacian approximates the LaplaceBeltrami operator acting on the continuous Riemannian manifold (see Belkin (2003); Lafon (2004); Belkin and Niyogi (2005); Coifman et al. (2005); Hein et al. (2005)). In this way, Belkin et al. provide a theoretical justification to the common trick in manifold learning of using a graph and geodesic distances on the graph as an approximate representation of the manifold , providing a precise sense in which the graph approaches as the number of data points gets denser. This way, the term serves as an approximation for , and enforces the penalization on the lack of smoothness of as it varies between adjacent points in the graph .
In addition, Belkin et al. (2006) proceed to prove a representer theorem (similar to the theory of splines in Wahba (1990)), which shows how the solution of the infinite dimensional problem (4) can be represented in terms of a finite sum over the labeled and unlabeled points:
(5) 
where is the Mercer kernel associated with the ambient space . More details about the manifold regularized model can be found in Belkin et al. (2006).
2.3 Regularized ODOE on Manifolds
As far as we know, the first discussion of regularized ODOE comes from Vuchkov (1977)
, who proposed a ridgetype procedure for ODOE based on the ridge regression estimator:
(6) 
Vuchkov’s motivation was to use the ridge estimator to solve the singular or illconditional problems that might exist in the sequential Doptimal design algorithm when the number of design points is smaller than the number of parameters to estimate. The ridge solution (6) can be seen as a particular case of the more general learning problem (4) where is a squaredloss function, and the RKHS is equipped with a norm and the manifold regularization parameter .
To discuss the optimal experimental design for the general manifold regularization model (4), we first need to clarify some notation. Without loss of generality, assume a sequential experimental design generation, starting with no labeled data at the beginning of the sequence. Let be the set of points that has been labeled at the th iteration, and be the corresponding label vector. Given a square loss function, the manifold regularization model (4) becomes the Laplacian Regularized Least Squares (LapRLS) problem:
(7) 
Substituting the representer theorem (5) to (7), we get a convex differentiable objective function with respect to :
(8) 
where and are the Gram matrix defined by
and is the kernel embedded in the RKHS . Taking a derivative of (8) with respect to and making it equal to 0, we arrive at the following expression:
(9) 
Consider a linear regression model form (1) and a linear kernel for , the regression parameters can be estimated by
(10) 
where
(11) 
By some simple linear algebra (a formal proof is provided in the Appendix), the estimated parameters (10) can be simplified to
(12) 
He (2010) demonstrated that the covariance matrix of (12) can be approximated as:
(13) 
The determinant of covariance matrix (13) is the statistical criterion we will minimize to obtain a Doptimal design. Before we discuss the optimal design algorithm, first we will provide some theoretical justification of ODOE on Riemannian manifolds in the following section.
3 Theoretical Results
When the determinant of is maximized, one obtains a socalled Doptimal experimental design. In Euclidean space, “continuous” or “exact” optimal design theory (which considers the proportion of experimental tests allocated to different locations over the space) indicates the equivalence between the Doptimality criteria and the socalled Goptimality criteria, where the maximum prediction variance is minimized, as stated by the celebrated KieferWolfowitz (KW) theorem (Kiefer and Wolfowitz (1960); Kiefer (1974)). In analogy with the KW theorem, we aim to develop a new equivalence result for optimal experimental design based on the manifold regularization model (4), which can then be used for designing an experiment on a Riemannian manifold.
Assume there is an infinite number of points
that are uniformly distributed on a Riemannian manifold
. Let be a continuous design on . For any continuous design , based on the Carathéodory Theorem, it’s known that can be represented as(14) 
For any , the corresponding information matrix of LapRLS model is defined as
(15) 
where is a probability measure of design on the experimental region , is the LaplaceBeltrami operator on , and is the uniform measure on . Note that the last two terms in (15) are independent of the design , thus for simplicity, define
(16) 
Then (15) can be written as
(17) 
Based on the parameters estimates (12), for a given continuous design , the prediction variance at a test point is
(18) 
As it can be seen, under the LapRLS model one can obtain a Doptimal design by maximizing the determinant of and a Goptimal design by minimizing . Similarly to the optimal design of experiments in Euclidean space, we prove next an equivalence theorem on Riemannian manifolds that shows how the D and G optimality criteria lead to the same optimal design. Before the equivalence theorem is discussed, we need to prove some auxiliary results. The proofs of these proposition are provided in the Appendix.
Proposition 1.
Let and be two designs with the corresponding information matrices and . Then
(19) 
where is the information matrix of the design
(20) 
Proposition 2.
Let and be two designs with the corresponding information matrices and . Then
(21) 
where is the information matrix of the design
(22) 
Proposition 3.
For any continuous design ,

(23) 
(24)
Proposition 4.
The function is a strictly concave function.
Based on Propositions 14, we can now prove the equivalence theorem for the LapRLS model. In summary, the following theorem demonstrates that the Doptimal design and Goptimal design are equivalent on the Riemannian manifold . It also provides the theoretical value of maximum prediction variance of the LapRLS model when the D/G optimal design is achieved.
Theorem 1 (Equivalence Theorem).
The following statements are equivalent:

the design maximizes

the design minimizes

Proof
(1) 1 2
Let be the design that maximizes and define , where is some arbitrary design. According to Proposition 2, we have that
(25) 
When , we have . Thus
(26)  
(27)  
(28) 
Since is the maximal solution, then
(29) 
Without loss of generality, assume the design has only one instance . Then we have
(30) 
and
Thus
(31) 
In addition, based on Proposition 3, we have
(32) 
Combining (31) and (32), we can conclude that the Doptimal design minimizes .
(2) 2 1
Let be the design that minimizes , but assume it is not Doptimal. Based on Proposition 4, we know there must exist a design such that:
(33) 
where
(34) 
Then
Since is the design that minimizes , by Proposition 3, we have
(35) 
Thus, for any ,
(36) 
Then
(37)  
(38)  
(39) 
Therefore, we have
This contradicts with (33). Therefore, the design is also Doptimal.
(3) 1 3
Let be the Doptimal design. From the previous proof, in particular Equation (31), we know that
(40) 
(4) 3 1
Let be the design such that
(41) 
Then, for any ,
(42) 
Based on the previous proof, we know that equation (42) implies that there is no improving direction for the Doptimal criteria. Thus is the Doptimal design.
(5) Since 1 2, 1 3, then 2 3 and the equivalence theoreme is proved.
Different from the classical equivalence theorem on Euclidean space, Theorem 1 demonstrates the equivalence of Doptimal design and Goptimal design on the Riemannian manifold. In addition, for any given design , Equation (24) provides a new lower bound for the maximum prediction variance. Theorem 1 shows that this lower bound (24) can be achieved at the D/G optimal design . Therefore, Theorem 1 also provides a theoretical justification that the optimal D/G design would minimize the maximum prediction variance of the model.
4 Proposed Algorithm and Convergence Analysis
Before we discuss our proposed algorithm for finding optimal experimental design on manifolds, some auxiliary results need to be derived.
Proposition 5.
Let be the information matrix of the design at th iteration. Let be the information matrix of the design concentrated at one single point . Given , then
(43) 
Proposition 6.
Let be the information matrix of the design at th iteration. Construct the design at th iteration as
(44) 
where
(45) 
Then the resulting sequence is nondecreasing.
Based on Proposition 5 and 6, we propose a new algorithm to find the DG optimal experimental design on a manifold, as shown in Algorithm 1. Note how after obtaining an optimal design for the data to be labeled, and obtaining the corresponding labels, we use both labeled and unlabeled instances to train the manifold regularized model.
We next provide a convergence analysis of the proposed algorithm.
(46) 

Find s.t.
(47) 
Update the design
(48) where is a user choice that satisfies
(49) 
Compute the information matrix , set and repeat step 13.
Theorem 2 (Convergence Theorem).
The iterative procedure in Algorithm 1 converges to the Doptimal design ,
(50) 
Proof
Let the design not be Doptimal. Based on Proposition 6, we have
(51) 
It is known that any bounded monotone sequence converges. Thus the sequence , , …, converges to some limit . Next we need to show
(52) 
The proof proceeds by contradiction. Assume
(53) 
By the convergence of the sequence , , …, , we know that, for , there s.t.
(54) 
Based on Proposition 5, we have
Then,
(55) 
Define , then we can rewrite (55) as
(56) 
Define a function as
(57) 
Then
(58) 
Clearly, for . Thus, for a given , is an increasing function with respect to . On the other hand,
(59) 
Let , we have
(60)  
(61)  
(62)  
(63)  
(64) 
Thus, for and a given , is an increasing function. In particular, plug in the formula of , we have
(65) 
Notice that is the same choice of in the proposed algorithm.
In addition, based on Equation (57), it can be seen that for any and , we have . Note that is an arbitrary positive number in equation (56), which implies need to be an infinitely small positive number to satisfy equation (56), i.e. given , there s.t.
(66) 
However, based on the assumption (53) and the Theorem 1, we have that
(67) 
Choosing , we have a contradiction, and therefore, the convergence theorem is proved.
From the derivation of Algorithm 1, it is not difficult to notice that ODOEM is a modeldependent design. The corresponding manifold regularization model (7) need to be trained after a desired number of instances is labeled. As it is shown before, Algorithm 1 is a converging algorithm on a continuous design space. However, sometimes the experimental design space is not continuous and only a set of candidate points is available. For a discrete design space with a set of candidate points, one can evaluate each candidate point and choose the point with maximum prediction variance. The resulting sequence of is still nondecreasing, since
where .
5 Numerical Results
To illustrate the empirical performance of the proposed ODOEM algorithm in practice, we consider its application to both synthetic datasets, low dimensional manifold datasets that permit straightforward visualization of the resulting designs, and also its application to the high dimensional realworld image datasets.
5.1 Synthetic Datasets
In this section, we generated four different twodimensional manifold datasets: data on a Torus, on a Möbius Strip, on a figure “8” immersion of a Klein bottle and on a classic Klein bottle. Each of the first three datasets contains 400 instances and the last dataset contains 1600 instances. For all four datasets, we plot these twodimensional manifolds in a threedimensional Euclidean space, as shown in Figure 18. The colors on these manifolds represent the corresponding response values (or their estimates based on different experimental designs), which were defined by
(68) 
where and . The red numbers on the manifolds represent the sequence of labeled instances by different design algorithms.
In order to show the improvement provided by using the ODOEM algorithm, we compare it with a classical Doptimal design algorithm on a kernel regression model, which does not consider the manifold structure. For both of the learning models, we choose a RBF kernel and set the range parameter to be 0.01. In addition, we choose in both models for numerical stability, and in ODOEM (a discussion of the choice of is provided in the Appendix).
For some realworld applications, the data may not strictly lie on a given manifold due to noise. In order to explore the robustness of the ODOEM algorithm to noise, we also let the four synthetic datasets fluctuate around their manifolds by adding noise to . In other words, for each of the four manifolds, we investigate both of the case when the data lie exactly on the given manifold and the case when are not exactly on the manifold.
The results are shown in Figure 14. As it can be seen, on all four synthetic datasets, ODOEM performs much better than kernel regression Doptimal Design in terms of instance selection and function fitting, under both of the case with noises and the case without noises.