High-dimensional function approximation plays an important role in building predictive models for a variety of scientific and engineering problems. It is typical for scientists to build an accurate and fast-to-evaluate surrogate model to replace a computationally expensive physical model, in order to reduce the overall cost of a large set of model executions. However, when the dimension of the target function’s input space becomes large, the data fitting becomes a computationally challenging task. Due to the curse of dimensionality, an accurate function approximation would require the number of samples in the training dataset to increase exponentially with respect to the dimension of input variables. On the other hand, given the complexity of the underlying physical model, the amount of observational data is often very limited. This causes classical approximation methods such as sparse polynomial approximation (e.g. sparse grids) to fail on high-dimensional problems outside of some special situations. One way to alleviate the challenge is to reduce the input dimension of the target function by finding intrinsically low-dimensional structures.
The existing methods for dimension reduction in function approximation can be divided into two main categories. The first one is to exploit the dependence between input variables to build low-dimensional manifolds in the input space. For example, principal component analysis is widely used, due to its simplicity, to compress the input space to a low-dimension manifold. Isometric feature mapping 
is an effective method to compute a globally nonlinear low-dimensional embedding of high-dimensional data. Its modification known as locally linear embedding[28, 11] provides solutions to more general cases. However, in practice there are often no dependences between input variables to exploit, so that the dimension of the input space cannot be effectively reduced by methods reliant on this assumption. This represents a challenging research question for function approximation, namely, how to effectively reduce the dimension of a function with independent input variables.
To answer this question, the second category of dimension reduction methods aims at reducing the input dimension by exploiting the relationship between the input and the output, i.e., learning the geometry of a function’s level sets. This includes methods such as sufficient dimension reduction (SDR) [8, 2, 20, 26], the active subspace (AS) method [7, 6], and neural network based methods [34, 31, 17, 3]. This type of method first identifies a linear/nonlinear transformation that maps the input variables to a handful of active variables (or coordinates), then projects the observational data onto the subspace spanned by the active variables, and finally performs the data fitting in the low-dimensional subspace to determine the function approximation.
The SDR method [8, 2, 20] provides a general framework for finding reduced variables in statistical regression. Given the predictor (input) and its associated scalar response (output) , the SDR seeks such that . Various algorithms have been developed to determine , including the sliced inverse regression [22, 25, 9]10], and principal Hessian directions 
in which the population moment matrices of the inverse regression are approximated based on the given regression data. These methods can be extended to the nonlinear setting by introducing the kernel approaches as done in[21, 19, 32, 33] .
The AS method [7, 6] is a popular dimension reduction approach that seeks a set of directions in the input space, named active components, affecting the function value most significantly on average. Given the values of the function and its gradient at a set of sample points, this method first evaluates the uncentered covariance matrix of the gradient, denoted by , are selected to define active components
, which is a linear transformation of the input. The subspace spanned by the set of active components describes a low-dimensional linear subspace embedded in the original input space that captures most of the variation in model output. A regression surface is then constructed based on the data projected onto the active subspace , i.e., .
Recently, neural network based approaches [34, 31, 17, 3] were developed to extract low-dimensional structures from high-dimensional functions. For instance, a feature map was built in  by aligning its Jacobian with the target function’s gradient field, and the function was approximated by solving a gradient-enhanced least-squares problem. The Nonlinear Level set Learning (NLL) method  finds a bijective nonlinear transformation that maps an input point to a new point which is of the same dimension as , more specifically, with modeled by a reversible residual neural network (RevNet) [12, 5]. In this approach, the transformed variables are expected to be split into two sets: a set of active variables (coordinates) and a set of inactive variables , so that the function value is insensitive to perturbations in . That is, if , a small perturbation in within the neighborhood ofto be orthogonal to the derivative of with respect to each inactive variable . Therefore, after a successful training, the NLL method can provide a manifold that captures the low-dimensional structure of the function’s level sets. Similar to the AS method, once is determined, a regression surface can be built using the data projected onto the subspace of active variables, . It has been shown in  that NLL outperforms AS when the level sets of the function have nontrivial curvature. An improved algorithm for the NLL method was studied in . However, there still are even simple cases in which the NLL fails to effectively extract low-dimensional manifolds as shown later in this paper.
In this paper, we introduce a new Dimension Reduction via Learning Level Sets (DRiLLS) method for function approximation that improves upon existing level set learning methods in the following aspects: (1) To enhance the model’s capability, we propose a novel pseudo-reversible neural network (PRNN) to model the nonlinear transformation for extracting active variables. (2) The learning process is driven by geometric features of the unknown function, which is reflected in a loss function consisting of three terms: the pseudo-reversibility loss, the active direction fitting loss, and the bounded derivative loss. (3) A novel synthesized regression on the manifold spanned by the learned active variables is also proposed, which helps to resolve numerical oscillation issues and provides accuracy benefits over traditional local and global regressions. Extensive numerical experiments demonstrate that the proposed DRiLLS method leads to significant improvements on high-dimensional function approximations with limited or sparse data.
The rest of paper is organized as follows. In Section 2 the setting of the function approximation problem is introduced and the DRiLLS method is proposed and discussed. More specifically, the PRNN module is described in Section 2.1 and the synthesized regression module in Section 2.2. We then numerically investigate the performance of our DRiLLS method in Section 3, including ablation studies in Section 3.1, high-dimensional function approximations with limited/sparse data in Section 3.2 and a PDE-related application in Section 3.3. Finally, some concluding remarks are drawn in Section 4.
2 The proposed DRiLLS method
We consider a scalar target function, which is continuously differentiable on a bounded Lipschitz domain in :
The input variables are assumed to be independent from each other, which implies that the input space itself does not possess a low-dimensional structure. The goal is to find an approximation of the target function, given the information of and on a set of training samples in . We denote the training dataset by
which contains the input, the output and the gradient information at the samples. When the number of dimensions is large, taking a handful of random selections in each coordinate would result in a huge amount of data, which is infeasible in many application scenarios. Therefore, the sample dataset is usually sparse for high-dimensional problems.
The NLL method has achieved successes in high-dimensional function approximation on sparse data for real-world applications such as composite material design problems , however, it has difficulties in learning level sets of certain functions. In particular, NLL struggles on functions with critical points contained in the interior of the domain , such as the functions or on to be discussed later in Section 3.1
. One reason for such drawback is that the RevNet employed by NLL enforces invertibility as a hard constraint, which limits the capability of the RevNet in learning the structure of functions whose level sets are not homeomorphic to hyperplanes in the input space. Another reason is that the rate of change in the target function with respect to the inactivate variables is always zero at any interior critical points, as the gradient of the function vanishes there. Hence, the training process tends to ignore samples lying in a small neighborhood of the critical points since they do not contribute much to the training loss.
To overcome these issues and improve the performance of level set learning based function approximation, the proposed DRiLLS method consists of two major components: (1) the PRNN module that identifies active variables and reduces the dimension of input space, and (2) the synthesized regression module that selects neighboring sample points according to their Euclidean distances in the original input space and performs a local least-squares fitting based on the learned active variables to approximate the target function. A schematic diagram of the proposed method is shown in Figure 1.
2.1 The pseudo-reversible neural network
To construct the PRNN, we first define a nonlinear mapping from the input to a new point of the same dimension. In contrast to the RevNet used by the NLL method, the invertibility of this transformation is relaxed by defining another mapping from to and encouraging to be close to in distance. Thus, the reversibility is imposed as a soft constraint on the PRNN model. Specifically, the two nonlinear transformations are denoted by
respectively, where , with and being their learnable parameters. Since is not exactly invertible by definition, can be viewed as a pseudo-inverse function to . Both and are represented by a fully connected neural network (FCNN), as displayed in Figure 2
. The PRNN network structure is reminiscent of an autoencoder, but the dimension of latent space (i.e., the dimension of ) remains the same as the dimension of . While there are no theoretical restrictions on the structure of and , the experiments in Section 3 use the same FCNN architecture for both mappings.
2.1.1 The loss function
The learnable parameters and are updated synchronously during the training process by minimizing the following total loss function:
Here, is the pseudo-reversibility loss which measures the difference between and the PRNN output , is the active direction fitting loss which which enforces the tangency between and the level sets of , and is the bounded derivative loss which regularizes the sensitivity of with respect to the active variables . The weights and are hyper-parameters for balancing the three loss terms. Below each term of is discussed in detail.
The pseudo-reversibility loss
In order to train to be a pseudo inverse of , the pseudo-reversibility condition is simply enforced in the sense:
which is the same as the standard loss used to train autoencoders.
The active direction fitting loss
This loss is defined based on the fact that if the -th output of is inactive, a small perturbation of in a neighborhood of would change the target function along a direction tangent to its level sets. Specifically, we define the Jacobian matrix of the nonlinear transformation as:
In the ideal case, if is completely inactive, then the gradient vector field is orthogonal to , that is with denoting the inner product. Thus the active direction fitting loss is defined to encourage the orthogonality, i.e.,
where the scaling factors
contain the hyper-parameter and are weight hyper-parameters determining how strictly the orthogonality condition is enforced for each of the variables. A typical choice is
where denotes the dimension of the active variables/coordinates. An ideal case would be , which implies that there exists only one active variable and the intrinsic dimension of is exactly one when . The scaling factor distinguishes from the one used in , and its value changes according to the magnitude of the gradient: it approaches if gets close to and stays close to otherwise. Therefore, it serves as a rescaling factor designed to overcome the situation where the contributions of samples near interior critical points are ignored by the optimization due to their small gradients.
The bounded derivative loss
Existing methods such as NLL do not place any restrictions on the active variables, because the used RevNet imposes sufficient regularization on those variables. On the other hand, using PRNNs without regularization in may cause the network to learn an active subspace which changes too fast, producing undesirable oscillations in the target function. To address this issue, we introduce a regularization term into the loss as
where is a positive rescaling hyper-parameter. The purpose is to regularize the magnitude of to be not much greater than one. In the practical implementation, we further approximate with by considering the pseudo-reversibility of the PRNN.
2.2 The synthesized regression
The active variables (coordinates) is naturally identified based on the pre-setting values of the weights . Once the PRNN training is completed, the sample points then can be nonlinearly projected through PRNN to a much lower dimensional space spanned by . Ideally, approximating the high-dimensional function often can be achieved by approximating the low-dimensional function
. Many existing methods could be used, including classic polynomial interpolations, least-squares polynomial fitting, and regression by deep neural networks. However, because the control on is quite loose through the PRNN, could be very oscillatory with respect to or even make fail to form a function. For example, there could exist two sample points and , which are separated in the input space with different values and but mapped close together in the transformed space, i.e., . This is often the case for functions with interior critical points. The top row of Figure 3 presents an example illustration of such case, where we take and set as the active variable and the inactive variable. Consequently, general global or local regression approaches based solely on the projected information in the space of active variables are not able to effectively handle this case due to large numerical oscillations.
We develop a synthesized regression method to address this type of numerical oscillation problem. The method uses local least-squares polynomial fitting in the space of active variables, but selects neighboring sample points based on the Euclidean distance in the original input space to help to keep track of original neighborhood relationships. Our synthesized regression algorithm can be described as follows:
Given an unseen input sample , we select a set of points closest to from the set of all training samples, denoted by .
The samples are fed into the trained PRNN to generate the samples of the active variables .
We perform least-squares polynomial fitting using the data that is a subset of the training set. The approximation of , denoted by , is defined by the value of the resulting polynomial at .
Note that when the graph of in has several branches, the first two steps in the proposed synthesized regression encourages localization of the polynomial data fitting to only one of the branches. Indeed, the selected neighbors to usually stay on the same branch or intersecting region without much oscillations as shown in the bottom row of Figure 3.
3 Experimental results
The goal of this section is two-fold: the first is to test the influence of each ingredient of the proposed DRiLLS method on its overall performance, and the second is to investigate the numerical performance of the method in approximating high-dimensional functions. In particular, an ablation study is implemented in Section 3.1, including PRNN vs. RevNet and the effect of in Section 3.1.1, the effect of bounded derivative loss in Section 3.1.2, and the synthesized regression vs. some existing regression methods in Section 3.1.3. Then, through extensive comparisons with the AS and the NLL methods, we demonstrate the effectiveness and accuracy of the proposed DRiLLS method under limited/sparse data. Particularly, high-dimensional example functions are considered in Section 3.2 and a PDE-related application is given in Section 3.3.
The training dataset of size is randomly generated using the Latin hypercube sampling (LHS) method . To measure the approximation accuracy, we use the normalized root-mean-square error () and the relative error () over a test set of randomly selected input points from the domain:
where are the exact function values and and are the approximated values. In the experiments, we set for low-dimensional problems () and for high-dimensional problems (). This procedure is replicated for 10 times and the average values are reported as the final and errors for function approximation.
Our DRiLLS method is implemented using PyTorch. If not specified otherwise, we choose the followingdefault model setting: and in the PRNN are constructed by FCNNs that contain 4 hidden layers with hidden neurons per layer, respectively;
is used as the activation function; the hyper-parameters, , and are selected in the loss function; and cubic polynomial are used for the local least-squares fitting in the synthesized regression. For the training of PRNN, we use a combination of the Adam optimizer  and the L-BFGS optimizer . The Adam iteration  is first applied with the initial learning rate 0.001, and the learning rate decays every steps by a factor of for up to steps. Then, the L-BFGS iteration is applied for a maximum of steps to accelerate the convergence. The training process is immediately stopped when training error reduces to . Both the AS and the NLL methods used for comparison are implemented in ATHENA111ATHENA codes available at https://github.com/mathLab/ATHENA. , which is a Python package for parameter space dimension reduction in the context of numerical analysis. All the experiments reported in this work are performed on an Ubuntu 20.04.2 LTS desktop with a 3.6GHz AMD Ryzen 7 3700X CPU, 32GB DDR4 memory and NVIDIA RTX 2080Ti GPU.
3.1 Ablation studies
We first numerically investigate the effect of major components in the proposed DRiLLS method, including the PRNN, the loss functions, the hyper-parameters and the synthesized regression. Several functions of two dimensions are considered. Since the dimension is , it is natural to take , i.e., in (6) with being the active variable and the inactive one in the transformed space of . For the same reason, two hidden layers are used for each of the FCNNs representing and , different from the default settings. From the tests reported in Sections 3.1.2 and 3.1.1, we observe that the Adam optimization during PRNN training terminated within 20000 steps in all cases, while the tests in Section 3.1.3 required up to 60000 steps to meet the stopping criterion due to more complicated geometric structures in the target function.
To visually evaluate the function approximation, we present two types of plots: The quiver plot shows the gradient field of (blue arrows) and the vector field corresponding to the second Jacobian column (red arrows) on a uniform grid, where increased orthogonality between the red and blue arrows indicates increased accuracy in the network mapping; The regression plot draws the approximated function values (red circles) over 400 randomly generated points in the domain together with the associated exact function values (blue stars), where good performance is indicated by a thin regression curve and a large degree of overlap between the blue stars and the red circles (exact and approximate function values).
3.1.1 PRNN vs. RevNet and the effect of
One of the main differences between the proposed PRNN and the RevNet is their treatments of reversibility
: the former imposes it as a soft constraint while the latter imposes a hard constraint (realized by a special network structure). Furthermore, the special structure of the RevNet requires an equal separation of the inputs into two groups. Thus, if the input space has an odd dimension, it has to be padded with an auxiliary variable (e.g., a column of zero). On the other hand, the PRNN represents a larger class of functions than the RevNet, so that a better nonlinear transformation can be found when there is no need for explicit invertibility. To compare these two neural network structures, the following two functions are considered for testing:
where the domain of is either or . Note that both and reach their minimum at the origin, which is located in the interior of but only on the boundary of . Since we focus on the influence of reversibility in this subsection, we temporarily set and . The corresponding total loss for our DRiLLS method defined by (3) then becomes
respectively, because is automatically zero in the case of RevNet. In the following tests, the RevNet uses 10 RevNet Blocks with 2 neurons each, as the input space has the dimension two, and a step size (see  for details about the used RevNet structure). We choose the size of the sample dataset for training to be and both the PRNN and the RevNet are trained using the same dataset.
The testing results of are presented in Figure 4 for the case and in Figure 5 for the case , where several choices of are considered, i.e., the first column for , the second column for , and the third column for . It is observed that both network structures, the PRNN and the RevNet, work well for with the domain as shown in Figure 4. It is worth noting that and for any , i.e, the behavior of in is somehow monotonic. However, when the domain is changed to , the behavior of in is not monotonic anymore and the RevNet encounters difficulties in finding the appropriate active variable. Indeed, as shown in the third row of Figure 5, the gradient is not orthogonal to at many points no matter the value of is, which indicates the function value is still sensitive to the first inactivate variable . This further leads larger errors in the regression process and function approximation, as seen in the fourth row of Figure 5.
The testing results of are displayed in Figures 7 and 6 for the function respectively defined in and . We remark that the behavior of in either or in is not monotonic at all. It is observed from Figures 7 and 6 that the PRNN achieves superior performances on both domains: the quiver plots indicate that the RevNet has difficulty in ensuring the function value to be insensitive to in both and cases, and the associated regression plots show that the RevNet produces a more erroneous function approximation. The PRNN, on the contrary, still works well on both domains, which further leads to more accurate function approximations.
Meanwhile, we also observe that the value of does not have much impact on the performance of RevNet. For the PRNN, the effect of on the performance also seems negligible for the case , but becomes significantly different for the case . As increases from to and , the learned level sets and the function approximations get more and more accurate, especially for . As shown in the first rows of Figures 7 and 5, red arrows are well perpendicular to the blue arrows in the quiver plots for two larger values of , manifesting more effective dimension reductions. Moreover, less blue dots are visible in the regression plots in the third column than those in the first two column, which indicates less discrepancy between the predicted values and the exact function values.
3.1.2 The effectiveness of the bounded derivative loss
We use with the domain to investigate the effect of the bounded derivative loss that is a new loss term compared to those used in the NLL method. The purpose of is to reduce the oscillation in the function values after they are projected onto the active variable space, thus, it mainly can be regarded as a regularization term.
To check whether the proposed bounded derivative loss helps the training process of the proposed PRNN in our DRiLLS method, we vary the value of from to and while fixing the other experimental settings. The training dataset again has size 500. The evolutions of the total loss , the pseudo-reversibility loss , and the active direction fitting loss during the training process are presented in Figure 8. It is observed that the pseudo-reversibility loss is not affected by the choice of , but the total training loss and the active direction fitting loss both decay faster when than when . Conversely, the even larger value does not further accelerate the training process.
3.1.3 The synthesized regression v.s. other regression methods
Once the transformation to the active variable is obtained through the PRNN, we apply the proposed synthesized regression for approximating the target function. To better demonstrate the advantage of our synthesized regression, we consider the following example featured in Figure 3:
Due to the complicated behavior of the function in , we set the size of training dataset in the PRNN and the associated quiver and regression plots produced by our method are presented in Figure 9. The former demonstrates the efficacy of PRNN dimension reduction as the derivative in the function with respect to is tangent to the level sets, and the latter indicates accurate regressions have been obtained as almost all the blue stars and red circles coincide with each other, though the graph of has several branches.
|Synthesized Regression||Direct Local Fitting||Global Fitting||Neural Network|
The performance of the proposed synthesized regression is also compared with some other popular regression methods based on the same PRNN transformed data, including the polynomial regression in local and global fashions and the nonlinear regression by neural networks. In particular, cubic polynomial fitting is applied, and the neural network regression uses a FCNN of 3 hidden layers with 20 neurons in each layer. The function approximation errors are summarized in Table 1, which shows that the direct local fitting, the global fitting, and the neural network regression all fail to provide accurate predictions while the synthesized regression performs significantly well.
3.2 High-dimensional function approximation with limited data
Here we compare our DRiLLS method with two popular dimension reduction methods, the AS and the NLL, for function approximation with limited/sparse data. To ensure a fair comparison, the proposed synthesized regression will be applied to all compared methods for regression after active subspaces/variables are identified. In particular, the dimension of the active variables is set to and for DRiLLS and NLL and similarly for AS, which are often the typical choices in practice. We consider the following four functions:
For and , with and with are respectively used as the domain of the functions. For , we only take with is used as the domain. Obviously, the behaviors of the these functions are much more complicated in than in . We observed that for all the tests reported in this section their training processes again terminated within 60000 Adam optimization steps.