Supervised learning models typically need to be trained on large amounts of labeled instances to perform well. While many modern systems can easily produce a large number of unlabeled instances at low cost, the labeling process can be very difficult, expensive or time-consuming. However, different labeled instances contain various information and contribute to the learning process in different ways. Therefore, an interesting question arises: how to choose the most informative instances to label so that one can improve learning rate of the model and reduce the labeling cost at the same time?
In statistics, the problem of selecting which instances to label is referred as Design of Experiments (DOE) (Federov (1972); Box et al. (2005); Del Castillo (2007)). DOE is a systematic method to explore the relationship between process input (or experimental “factors”) and output responses under limited resources for conducting experiments. Traditional DOE was developed for physical experiments in agricultural and industrial applications where the goal is to optimize some continuous function of the covariates. A classic theory of experimental design exists for linear statistical models that assume the response is a function of a small number of covariates (for a summary of this theory, see, e.g., Federov (1972)
). In these problems, the number of covariates or “factors” of interest in an experiment is relatively small and so is the number of tests or size of the experiment, given the high experimental cost. However, with the development of modern technology, scientists and engineers frequently face different challenges arising not only from higher dimensional data but also from more complex data structures. Furthermore, in experiments where the response is an image that needs to be classified or text that needs to be recognized and categorized, the dimension of each data instance is much higher than those typically dealt with in the classical DOE literature. A particular type of data complexity specially important in experiments with image and text data occurs when the problem data actually lies on an unknown manifold of much smaller dimension of the space in which it appears to reside.
The goal of this paper is to discuss a new methodology for designing optimal experimental designs that minimize the number of experimental runs for high-dimensional manifold data and at the same time acquire as much useful information about the response as possible. As far as we know, no existing work has provided a theoretical justification of optimal experimental design methods for high-dimensional manifold data. This paper contributes to the theoretical development of DOE methods on manifolds. In particular, we prove a new Equivalence Theorem for a continuous optimal design on a Riemannian manifolds, and also provide a converging algorithm for finding the optimal design.
The rest of this paper is organized as follows. In Section 2, we briefly introduce the traditional Optimal Design of Experiment (ODOE) on Euclidean space, and then explain the idea of manifold learning, in particular the manifold regularization model from Belkin et al. (2006). More importantly, a manifold-based ODOE scheme is discussed. In Section 3, we provide the theoretical justification of optimal experimental design and present a new equivalence theorem of ODOE on Riemannian manifolds. In Section 4, we illustrate our proposed algorithm and provide a convergence analysis. In Section 5, several numerical experiments are conducted to demonstrate the effectiveness of the proposed algorithm for finding optimal designs on manifold data. In Section 6, we conclude this paper and discuss some possible future directions.
2 Optimal Design of Experiments on Manifolds
2.1 Traditional ODOE on Euclidean Space
Consider initially a classical linear regression model
where is some nonlinear mapping function that maps from the input space to the feature space ,
is a column vector of unknown parameters, andis assumed to have a distribution . Given a sample of design points , if the corresponding response values parameters are given by:
where is a design matrix with -th row defined as , and is a response vector. As a result, the corresponding fitted function is .
Classical work on Optimal Design of Experiments (ODOE) was developed by Kiefer and Wolfowitz (1960) and summarized by Fedorov (1972) . The goal of ODOE is to find an experimental design that is optimal with respect to some statistical criterion related either to the model parameter estimates or to the model predictions. For example, given the linear regression model (1), the D-optimality criterion minimizes the determinant of the covariance matrix of the parameter estimates , while the G-optimality criterion minimizes the maximum prediction variance . These and similar criteria are called “alphabetic optimality” design criteria by Box and Draper (2007).
Traditional ODOE methods assume both the covariate and response data lie on an Euclidean space under the classical linear regression model (2). However, in applications to high-dimensional image and text datasets, traditional ODOE is not applicable as it does not consider the intrinsic manifold structure these type of datasets usually have (Tenenbaum et al. (2000); Roweis and Saul (2000)).
While there have been recent attempts at applying alphabetic optimality criteria to manifold learning models (He (2010); Chen et al. (2010); Alaeddini et al. (2019)), no theoretical justification exists, as far as we know, to these methods, and no guarantees can be given for their success other than empirical experiments. A new theory for the optimal experimental designs is therefore needed that explicitly considers high-dimensional manifold data, justify existing methods if possible, and shows a principled way to develop new methods. Before we discuss the design of experiments on manifolds, first we need to introduce a manifold learning model by Belkin et al. (2006) that will be used in the sequel.
2.2 Manifold Regularization Model
In the well-known paradigm of machine learning, the process of learning is seen as using the training datato construct a function that maps a data instance to a label variable . Let
be the joint distribution that generates labeled dataand be the marginal distribution that generates unlabeled data . In order to extend the learning of functions to nonlinear manifolds, Belkin et al. (2006) assume that the conditional distribution varies smoothly as moves along a manifold that supports . In other words, if two data points
are close as measured by an intrinsic (or geodesic) distance on this manifold, then the two probabilities of the labels,and
, will be similar. These authors developed a semi-supervised learning framework that involves solving the following double regularized objective function:
is a given loss function (such as squared loss), is a Reproducing Kernel Hilbert Space (RKHS) (Aronszajn (1950)) with associated Mercer kernel , is a penalty term with the norm equipped in that imposes smoothness conditions in the ambient space (Wahba (1990)), and is a penalty term for non-smoothness along geodesics on the intrinsic manifold structure of . Moreover, and are two regularization parameters that control the amount of penalization in the ambient space and in the intrinsic manifold that supports , respectively. Some other work in recent literature can be explained as a particular case of this general framework. For example, the spatial regression model proposed by Ettinger et al. (2016) can be seen as the manifold regularization model (3) without the ambient space regularization. There are also different work about nonparametric regression models on manifolds (Cheng and Wu (2013); Marzio et al. (2014); Lin et al. (2017)), but in this paper we focus on the manifold regularization model from Belkin et al. (2006).
Intuitively, the choice of
should be a smoothness penalty corresponding to the probability distribution. However, in most real-world applications is not known, and therefore empirical estimates of the marginal distribution must be used. Considerable research has been devoted to the case when is supported on a compact manifold (Roweis and Saul (2000); Tenenbaum et al. (2000); Belkin and Niyogi (2003); Donoho and Grimes (2003); Coifman et al. (2005)). Under this assumption, it can be shown that problem (3) can be reduced to
where and is the Laplacian matrix associated with the data adjacency graph that is constructed on all the labeled and the unlabeled data points . In particular, the graph Laplacian approximates the Laplace-Beltrami operator acting on the continuous Riemannian manifold (see Belkin (2003); Lafon (2004); Belkin and Niyogi (2005); Coifman et al. (2005); Hein et al. (2005)). In this way, Belkin et al. provide a theoretical justification to the common trick in manifold learning of using a graph and geodesic distances on the graph as an approximate representation of the manifold , providing a precise sense in which the graph approaches as the number of data points gets denser. This way, the term serves as an approximation for , and enforces the penalization on the lack of smoothness of as it varies between adjacent points in the graph .
In addition, Belkin et al. (2006) proceed to prove a representer theorem (similar to the theory of splines in Wahba (1990)), which shows how the solution of the infinite dimensional problem (4) can be represented in terms of a finite sum over the labeled and unlabeled points:
where is the Mercer kernel associated with the ambient space . More details about the manifold regularized model can be found in Belkin et al. (2006).
2.3 Regularized ODOE on Manifolds
As far as we know, the first discussion of regularized ODOE comes from Vuchkov (1977)
, who proposed a ridge-type procedure for ODOE based on the ridge regression estimator:
Vuchkov’s motivation was to use the ridge estimator to solve the singular or ill-conditional problems that might exist in the sequential D-optimal design algorithm when the number of design points is smaller than the number of parameters to estimate. The ridge solution (6) can be seen as a particular case of the more general learning problem (4) where is a squared-loss function, and the RKHS is equipped with a -norm and the manifold regularization parameter .
To discuss the optimal experimental design for the general manifold regularization model (4), we first need to clarify some notation. Without loss of generality, assume a sequential experimental design generation, starting with no labeled data at the beginning of the sequence. Let be the set of points that has been labeled at the -th iteration, and be the corresponding label vector. Given a square loss function, the manifold regularization model (4) becomes the Laplacian Regularized Least Squares (LapRLS) problem:
where and are the Gram matrix defined by
and is the kernel embedded in the RKHS . Taking a derivative of (8) with respect to and making it equal to 0, we arrive at the following expression:
Consider a linear regression model form (1) and a linear kernel for , the regression parameters can be estimated by
By some simple linear algebra (a formal proof is provided in the Appendix), the estimated parameters (10) can be simplified to
The determinant of covariance matrix (13) is the statistical criterion we will minimize to obtain a D-optimal design. Before we discuss the optimal design algorithm, first we will provide some theoretical justification of ODOE on Riemannian manifolds in the following section.
3 Theoretical Results
When the determinant of is maximized, one obtains a so-called D-optimal experimental design. In Euclidean space, “continuous” or “exact” optimal design theory (which considers the proportion of experimental tests allocated to different locations over the space) indicates the equivalence between the D-optimality criteria and the so-called G-optimality criteria, where the maximum prediction variance is minimized, as stated by the celebrated Kiefer-Wolfowitz (KW) theorem (Kiefer and Wolfowitz (1960); Kiefer (1974)). In analogy with the KW theorem, we aim to develop a new equivalence result for optimal experimental design based on the manifold regularization model (4), which can then be used for designing an experiment on a Riemannian manifold.
Assume there is an infinite number of points
that are uniformly distributed on a Riemannian manifold. Let be a continuous design on . For any continuous design , based on the Carathéodory Theorem, it’s known that can be represented as
For any , the corresponding information matrix of LapRLS model is defined as
where is a probability measure of design on the experimental region , is the Laplace-Beltrami operator on , and is the uniform measure on . Note that the last two terms in (15) are independent of the design , thus for simplicity, define
Then (15) can be written as
Based on the parameters estimates (12), for a given continuous design , the prediction variance at a test point is
As it can be seen, under the LapRLS model one can obtain a D-optimal design by maximizing the determinant of and a G-optimal design by minimizing . Similarly to the optimal design of experiments in Euclidean space, we prove next an equivalence theorem on Riemannian manifolds that shows how the D and G optimality criteria lead to the same optimal design. Before the equivalence theorem is discussed, we need to prove some auxiliary results. The proofs of these proposition are provided in the Appendix.
Let and be two designs with the corresponding information matrices and . Then
where is the information matrix of the design
Let and be two designs with the corresponding information matrices and . Then
where is the information matrix of the design
For any continuous design ,
The function is a strictly concave function.
Based on Propositions 1-4, we can now prove the equivalence theorem for the LapRLS model. In summary, the following theorem demonstrates that the D-optimal design and G-optimal design are equivalent on the Riemannian manifold . It also provides the theoretical value of maximum prediction variance of the LapRLS model when the D/G optimal design is achieved.
Theorem 1 (Equivalence Theorem).
The following statements are equivalent:
the design maximizes
the design minimizes
(1) 1 2
Let be the design that maximizes and define , where is some arbitrary design. According to Proposition 2, we have that
When , we have . Thus
Since is the maximal solution, then
Without loss of generality, assume the design has only one instance . Then we have
In addition, based on Proposition 3, we have
(2) 2 1
Let be the design that minimizes , but assume it is not D-optimal. Based on Proposition 4, we know there must exist a design such that:
Since is the design that minimizes , by Proposition 3, we have
Thus, for any ,
Therefore, we have
This contradicts with (33). Therefore, the design is also D-optimal.
(3) 1 3
Let be the D-optimal design. From the previous proof, in particular Equation (31), we know that
(4) 3 1
Let be the design such that
Then, for any ,
Based on the previous proof, we know that equation (42) implies that there is no improving direction for the D-optimal criteria. Thus is the D-optimal design.
(5) Since 1 2, 1 3, then 2 3 and the equivalence theoreme is proved.
Different from the classical equivalence theorem on Euclidean space, Theorem 1 demonstrates the equivalence of D-optimal design and G-optimal design on the Riemannian manifold. In addition, for any given design , Equation (24) provides a new lower bound for the maximum prediction variance. Theorem 1 shows that this lower bound (24) can be achieved at the D/G optimal design . Therefore, Theorem 1 also provides a theoretical justification that the optimal D/G design would minimize the maximum prediction variance of the model.
4 Proposed Algorithm and Convergence Analysis
Before we discuss our proposed algorithm for finding optimal experimental design on manifolds, some auxiliary results need to be derived.
Let be the information matrix of the design at -th iteration. Let be the information matrix of the design concentrated at one single point . Given , then
Let be the information matrix of the design at -th iteration. Construct the design at -th iteration as
Then the resulting sequence is non-decreasing.
Based on Proposition 5 and 6, we propose a new algorithm to find the D-G optimal experimental design on a manifold, as shown in Algorithm 1. Note how after obtaining an optimal design for the data to be labeled, and obtaining the corresponding labels, we use both labeled and unlabeled instances to train the manifold regularized model.
We next provide a convergence analysis of the proposed algorithm.
Update the design
where is a user choice that satisfies
Compute the information matrix , set and repeat step 1-3.
Theorem 2 (Convergence Theorem).
The iterative procedure in Algorithm 1 converges to the D-optimal design ,
Let the design not be D-optimal. Based on Proposition 6, we have
It is known that any bounded monotone sequence converges. Thus the sequence , , …, converges to some limit . Next we need to show
The proof proceeds by contradiction. Assume
By the convergence of the sequence , , …, , we know that, for , there s.t.
Based on Proposition 5, we have
Define , then we can rewrite (55) as
Define a function as
Clearly, for . Thus, for a given , is an increasing function with respect to . On the other hand,
Let , we have
Thus, for and a given , is an increasing function. In particular, plug in the formula of , we have
Notice that is the same choice of in the proposed algorithm.
In addition, based on Equation (57), it can be seen that for any and , we have . Note that is an arbitrary positive number in equation (56), which implies need to be an infinitely small positive number to satisfy equation (56), i.e. given , there s.t.
Choosing , we have a contradiction, and therefore, the convergence theorem is proved.
From the derivation of Algorithm 1, it is not difficult to notice that ODOEM is a model-dependent design. The corresponding manifold regularization model (7) need to be trained after a desired number of instances is labeled. As it is shown before, Algorithm 1 is a converging algorithm on a continuous design space. However, sometimes the experimental design space is not continuous and only a set of candidate points is available. For a discrete design space with a set of candidate points, one can evaluate each candidate point and choose the point with maximum prediction variance. The resulting sequence of is still non-decreasing, since
5 Numerical Results
To illustrate the empirical performance of the proposed ODOEM algorithm in practice, we consider its application to both synthetic datasets, low dimensional manifold datasets that permit straightforward visualization of the resulting designs, and also its application to the high dimensional real-world image datasets.
5.1 Synthetic Datasets
In this section, we generated four different two-dimensional manifold datasets: data on a Torus, on a Möbius Strip, on a figure “8” immersion of a Klein bottle and on a classic Klein bottle. Each of the first three datasets contains 400 instances and the last dataset contains 1600 instances. For all four datasets, we plot these two-dimensional manifolds in a three-dimensional Euclidean space, as shown in Figure 1-8. The colors on these manifolds represent the corresponding response values (or their estimates based on different experimental designs), which were defined by
where and . The red numbers on the manifolds represent the sequence of labeled instances by different design algorithms.
In order to show the improvement provided by using the ODOEM algorithm, we compare it with a classical D-optimal design algorithm on a kernel regression model, which does not consider the manifold structure. For both of the learning models, we choose a RBF kernel and set the range parameter to be 0.01. In addition, we choose in both models for numerical stability, and in ODOEM (a discussion of the choice of is provided in the Appendix).
For some real-world applications, the data may not strictly lie on a given manifold due to noise. In order to explore the robustness of the ODOEM algorithm to noise, we also let the four synthetic datasets fluctuate around their manifolds by adding noise to . In other words, for each of the four manifolds, we investigate both of the case when the data lie exactly on the given manifold and the case when are not exactly on the manifold.
The results are shown in Figure 1-4. As it can be seen, on all four synthetic datasets, ODOEM performs much better than kernel regression D-optimal Design in terms of instance selection and function fitting, under both of the case with noises and the case without noises.