Nonlinear Level Set Learning for Function Approximation on Sparse Data with Applications to Parametric Differential Equations

A dimension reduction method based on the "Nonlinear Level set Learning" (NLL) approach is presented for the pointwise prediction of functions which have been sparsely sampled. Leveraging geometric information provided by the Implicit Function Theorem, the proposed algorithm effectively reduces the input dimension to the theoretical lower bound with minor accuracy loss, providing a one-dimensional representation of the function which can be used for regression and sensitivity analysis. Experiments and applications are presented which compare this modified NLL with the original NLL and the Active Subspaces (AS) method. While accommodating sparse input data, the proposed algorithm is shown to train quickly and provide a much more accurate and informative reduction than either AS or the original NLL on two example functions with high-dimensional domains, as well as two state-dependent quantities depending on the solutions to parametric differential equations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/02/2021

Level set learning with pseudo-reversible neural networks for nonlinear dimension reduction in function approximation

Due to the curse of dimensionality and the limitation on training data, ...
03/25/2020

Spectral methods for nonlinear functionals and functional differential equations

We develop a rigorous convergence analysis for finite-dimensional approx...
04/24/2020

CLUE: Exact maximal reduction of kinetic models by constrained lumping of differential equations

Motivation: Detailed mechanistic models of biological processes can pose...
02/20/2021

Nonlinear dimension reduction for surrogate modeling using gradient information

We introduce a method for the nonlinear dimension reduction of a high-di...
02/07/2018

Dimension Reduction Using Active Manifolds

Scientists and engineers rely on accurate mathematical models to quantif...
08/27/2020

Kernel-based Active Subspaces with application to CFD parametric problems using Discontinuous Galerkin method

A new method to perform a nonlinear reduction in parameter spaces is pro...

Code Repositories

New-NLL

An implementation of the modified NLL algorithm


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

It is frequently the case that scientists or engineers need to draw conclusions about the output of a function based on limited or incomplete data. Such situations arise, for example, when the output depends on the solution of expensive differential equations, or when lack of time and resources precludes the collection of sufficient high-quality samples. When this occurs, it becomes critical to maximize the value of the limited resources at hand, which requires informed algorithms for dimension reduction.

More specifically, let be a bounded domain and consider the problem of approximating a continuously differentiable function based on some predefined samples

of the function and its gradient vector field. In particular, let

be a probability density function supported on

such that is square-integrable with respect to , i.e.

To generate a pointwise approximation to , it is reasonable to seek a function which satisfies the minimization condition

(1)

However, if the dimension is large relative to the number of available samples (i.e. sparse data), training a regression model to approximate

directly becomes infeasible. Indeed, unless the training data itself has a hidden low-dimensional structure, unsupervised learning methods such as feed-forward neural networks are prone to overfitting, leading to poor accuracy on new data as a result of inadequate generalizability. Therefore, it is necessary to employ some kind of dimension reduction to increase the density of the sampling data to the point where it is useful for approximating solutions to (

1).

A prototypical example of this issue arises when studying the numerical solutions of differential equations with limited computational budget. Let be a multi-index, , and consider a -order parameterized system of partial differential equations (PDE) for the function ,

(2)

which may depend on some number of initial or boundary conditions. Suppose additionally that the assignment is unique, so that solutions to (2) are parameterized by the variables . For prediction and sensitivity analysis it is often necessary to compute the value of some functional on PDE solutions (e.g. temperature or total kinetic energy) which is implicitly a function of the parameters , i.e. . On the other hand, it is usually not feasible to simulate the (potentially expensive) system (2) for every parameter configuration desired, so it is necessary to have a reasonable yet inexpensive approximation to which can be computed for any in place of numerically solving (2). In the language of before, this means finding satisfying

This poses difficulty when the set of configurations for which solutions are available is sparse in , as occurs when solutions to (2) require hours (or even days) on a supercomputer to obtain. On the other hand, Section 5.3 shows that dimension reduction can be used to produce certain low-dimensional approximations to functionals like which reliably approximate the true values.

The state-of-the-art strategies for dimension reduction can be loosely categorized as either intrinsic or extrinsic based on how they organize the available data. Intrinsic methods look for patterns directly within the input samples, which can be exploited for lower-dimensional classification and clustering once revealed. At the present time, there are several effective dimension-reduction methods which operate intrinsically, including Isomap, Locally Linear Embedding, and others (see e.g. [3, 35, 4, 7] and references therein). Moreover, these methods have demonstrated good performance on low-dimensional encoding problems as well as the recovery of geodesic distances along the data manifold (see e.g. [9, 42, 28]). On the other hand, intrinsic methods are of no use when the data is unstructured, since there is no low-dimensional structure to be found. This motivates the search for extrinsic dimension-reduction methods, which do not assume the given data possesses any structure at all. Instead, these methods take advantage of the structure which is inherited from an external object acting on the input space. For example, consider a surjective mapping onto a low-dimensional target, which stratifies the source according to its level sets. If this induced level set structure can be learned, then this notion can be used to produce a low-dimensional approximation to the original function. The advantage of such extrinsic methods is that they are applicable to any kind of sampling, including sparsity patterns which are indistinguishable from random noise. Conversely, these techniques are necessarily dependent on the object the source data inherits from, meaning there can be no singular solution which applies to every notion of inherited structure.

Remark 1.

We use the notation to denote the derivative of , i.e. the linear map induced by satisfying for all vectors . Similarly, if denote the standard basis for then we let denote the derivative of with respect to . The extension of this notation to vector-valued functions is straightforward.

Thankfully, in the case of functions there is a good deal of information provided by classical theory which can be exploited for algorithm design. Given a regular value , the Implicit Function Theorem (IFT) guarantees that the level set is a differentiable submanifold of , and around any preimage there is the representation

which characterizes the local change in the function in terms of tangent vectors to the level set at the point . In particular, since is a linear subspace of (the tangent space to) and has dimension , it follows that the local dimension of is just one at almost every point in its domain. Moreover, there is only one direction in , the direction of the gradient , in which can move to produce any change in whatsoever. This fact shows that the local structure of surjective functions is actually quite rigid, and there is the potential for constructing a (locally) one-dimensional representation of any such mapping despite the size of its domain.

The present work makes steps toward this idea by building on the “Nonlinear Level set Learning” (NLL) algorithm from [41]

, which is a neural network procedure for extrinsic dimension reduction in regression applications. In particular, it is shown that the performance of this algorithm can be improved dramatically by incorporating the information provided by the IFT, which leads to less computational expense, faster training, and more accurate regression results. To accomplish this, the NLL algorithm is recast as the minimization of a Dirichlet-type energy functional whose minimizers include mappings

which send the high-dimensional inputs to simple slices where the function is constant, and which can be reliably approximated with neural network algorithms. Moreover, once a minimizer has been computed, dimension reduction becomes a matter of simple truncation, and a one-dimensional regression can be performed to recover a model which predicts at any point in the original data space, i.e. . Experiments are provided which demonstrate the improved performance of this version of NLL over the original algorithm and over the state-of-the-art linear dimension reduction method Active Subspaces.

The remainder of the work is structured as follows: Section 2 discusses related extrinsic dimension reduction methods and regression techniques, Section 3 details the NLL algorithm and its original formulation, Section 4 discusses the mentioned improvements to NLL and its connection to Dirichlet energy, and Section 5 exhibits numerical experiments which compare the present version of NLL to the original algorithm and Active Subspaces on a variety of test cases. Some concluding remarks are finally drawn in Section 6.

2. Related Work

It is well known that the regression problem (1) is difficult in the presence of sparse data, see e.g. [22, 20]

and references. As such, there are a plethora of techniques which have been developed to mitigate the high response variability that is inevitable in this situation. Ranging from optimal sampling (e.g.

[36, 16]) to central subspace methods (e.g. [29, 2]), all such works share the common themes of trying to increase predictive power and decrease model overfitting. Extrinsic methods for dimension reduction in pointwise regression applications (e.g. [13, 5, 41]) operate in the same way, although more regularity is usually assumed since a stronger type of convergence is discussed.

The dimension reduction approach most similar to ours is known as approximation by ridge functions [34, 10, 12]. In particular, consider a linear projection where and a (not necessarily differentiable) function . If the functions satisfy

for all , then is called a (generalized) ridge function (c.f. [12]). Clearly, this definition implies that is constant on the kernel of .

Using ridge functions for dimension reduction typically involves optimizing over the functions . In particular, suppose the projection is given and consider computing a which satisfies

(3)

If is nearly constant on the kernel of , then will be a reasonable pointwise approximation to the original function. Moreover, the minimization in (3) is usually much more feasible than the one in (1) when the input data is sparse, as the projection naturally increases its density.

Of course, to compute a useful solution to (3) it is necessary to have a projection mapping which adequately captures the change in . This usually involves constructing a low-dimensional “response surface” containing information about the dependence of on its independent variables (see e.g. [29, 2, 12, 25]

and their references), and frequently employs techniques including principal component analysis, projection pursuit regression, kriging, and others

[18, 40, 1, 32, 11, 33, 31, 15, 39]. One popular algorithm for response surface construction among functions is known as Active Subspaces (AS) [13]: a procedure for determining the affine subspace of where changes the most on average. In particular, AS computes a Monte Carlo approximation to the covariance matrix

, so that the eigenvalue decomposition

gives a global linear transformation of the input coordinates in terms of how much they affect the value of

. In the case that only the first eigenvalues are significant, this yields a suitable low-dimensional projection in terms of the first columns of . More precisely, decomposing in terms of its “active” and “inactive” components, it can be shown that

where is a suitably chosen ridge function (see e.g. [12, Theorem 2]) and is a Poincaré constant depending on the density . Due to this fact and others, AS projections are known to be quite useful for dimension reduction, regression, and sensitivity analysis (see e.g. [13, 30, 17, 27]).

On the other hand, it is frequently the case that the eigenvalues of the covariance matrix decay quite slowly, making a linear AS reduction ineffective for low-dimensional regression (c.f. Section 5.2). This has motivated another line of work into nonlinear methods for extrinsic dimension reduction. One such idea was introduced as the Active Manifolds (AM) algorithm [5], which takes advantage of the local decomposition of afforded by the IFT. In particular, AM aims to construct an integral curve of the (normalized) gradient field, i.e. a solution to

Once this “active manifold” has been constructed, it is necessarily the case that the entire range of on the set of level sets intersecting is represented simply by the values , i.e. for every such that and there is a value such that . Therefore, to determine the value of at any suitable point in the input space it is sufficient to know the values of along the 1-D curve as well as a projection map , . This algorithm has the benefit of zero intrinsic error (since there is no averaging involved), but comes with the significant challenge of computing the necessary projection map.

However, recent work has also shown that sufficiently deep neural networks have the ability to reproduce arbitrary measurable functions (see e.g. [24]

), motivating another line of research into dimension reduction. Most network-based intrinsic methods to date are based on constructing an autoencoder-decoder network (see e.g.

[23, 38]) which learns projection and expansion functions that compose to yield an approximation to the identity mapping on the input space. The extrinsic methods which employ neural networks are more various, including the DrLIM method in [21] which computes an invariant nonlinear function mapping the input data evenly to a low-dimensional space, or the method of DIPnets (see [33]) which uses AS in conjunction with projected neural networks for increased generalizability.

3. The NLL Algorithm

In contrast to the work previously mentioned, the present approach is based on a neural network algorithm introduced in [41] called “Nonlinear Level set Learning” (NLL). At its core, NLL uses neural network techniques to extend the idea of AS to more general transformations of the input data. In particular, consider computing a diffeomorphism (differentiable bijection with differentiable inverse) , and , which separates the domain of the push-forward function into global pairs of “active” and “inactive” coordinates. If can be constructed such that the sensitivity of to the coordinates is sufficiently low, then it is reasonable to conclude that the transformed data lies in a lower-dimensional submanifold of the ambient space , and for any inactive coordinate the domain of can be restricted to with negligible impact on the function value. Therefore, regression can be applied to obtain a lower-dimensional mapping such that . Writing to denote the first components of the function , this means computing a generalized ridge function

(4)

where acts as the (nonlinear) projection operator. Once is obtained, computing the required in (4) is automatically a more feasible regression problem than (1), since almost all of the variation in has been concentrated in the lower dimensional image . Most importantly, the necessary projection from to is simple and canonical: after truncating the domain of by the span of the inactive variables , the active variables parameterize the low-dimensional inputs by definition.

Remark 2.

In the original NLL formulation [41], regression on is still performed on the full set of inputs without dimension reduction. On the other hand, we find that this is not necessary when the mappings are well-trained. Therefore, the algorithm in Section 3 does not require the use of additional inputs beyond .

Of course, to make use of this idea it is necessary to have an efficient way to compute the diffeomorphism . The authors of [41] have demonstrated that carefully designed neural networks are well suited to this task, and a minimization procedure can be employed to obtain a mapping which attempts to concentrate the sensitivity of in some predefined number of active directions. The particular network architecture and appropriate notion of loss which support this approach will now be discussed, as well as the present modifications which improve the overall efficacy of the method.

3.1. Network Architecture

In [41] as well as presently, the constrained minimization for and its inverse is accomplished using a specialized RevNet architecture [8, 19]

based on the Verlet discretization of Hamiltonian systems. RevNets are neural networks which are reversible by construction, yielding improved memory efficiency as intermediate layer activations do not have to be stored. The primary benefit of using RevNets inside the NLL algorithm is their connection to Hamiltonian systems, whose solution curves are generated by local diffeomorphisms of the source domain. Because of this, propagating the input data through a RevNet structure (with a sufficiently small step-size) will necessarily give a configuration which is diffeomorphic to the original, removing the need for an explicit constraint during the minimization of the loss functional.

More precisely, let be a channel-wise partition of the inputs,

be an activation function, and let

resp. denote operator resp. vector valued functions for . It is shown in [8] that the dynamical system

(5)

is stable and well-posed, hence usable for forward propagation along a neural network. Defining and discretizing (5) with layers then yields the system

where is the discrete time step, and resp. are the weight matrices resp. biases at layer . It is easily checked that this mapping is invertible at each layer, therefore it is reasonable to define . In practice, the partition for is chosen so that contains the first variables and contains the remainder. Propagation forward and backward through this scheme then yields mappings which satisfy the desired diffeomorphism condition by construction, allowing for an unconstrained minimization of the loss functional. The next goal is to discuss the particular loss criterion which is used to update the parameters.

3.2. The Loss Functional from [41]

It remains to discuss the criterion by which the mappings are trained. The key observation to the approach in [41] is that if is an inactive coordinate, then the gradient vector field is orthogonal to the derivative of with respect to at any point in the input space. Said differently, this means that the derivative vector is tangent to the level set of at , hence lies in the kernel of . Enforcing this condition during neural network training leads the authors of [41] to the minimization problem

which involves the (regularized) loss functional

(6)

Here is the column-normalized Jacobian matrix of the transformation, the are user-defined weights influencing the strength of the constraint in each dimension, and is a user-defined weight influencing the strength of the regularization term. Note that the choice of column-normalization in

is introduced in order to simplify the calculation of derivatives in the original implementation, which uses a finite difference scheme. Moreover, the regularization arises for similarly practical considerations, providing extra rigidity which helps direct the minimization toward a reasonable solution. On the other hand, these additional inclusions come at the cost of diminishing the geometric meaning of the procedure, which has a substantial effect on both the rate of training and the overall effectiveness of the algorithm (c.f. Section 

5).

Remark 3.

The informed reader will notice that the publicly available implementation of the NLL algorithm in [41] uses a slightly different loss functional than defined in (6). In particular, the loss functional minimized there is

(7)

Because the practical behavior of the original algorithm seems to be strongly dependent on this choice, each of the comparisons to “Old NLL” in Section 5 uses whichever loss functional, (6) or (7), yields the best performance.

While NLL in its original form has shown promising performance in several cases (c.f. [41, Figure 2]), it requires the specification many parameters

whose significance is not clear and whose influence is not easily estimated

a priori

. Additionally, it is relatively slow-to-train and requires the use of an expensive regularization term in order to guarantee reasonable performance and stability on high-dimensional data. The goal of what follows is to describe theoretically-justified modifications to this algorithm which successfully eliminate these issues while also improving the overall efficacy of the dimension reduction. By reformulating the NLL procedure as the minimization of an appropriate energy functional, a discretization is found which achieves both faster training and more accurate results.

4. A Modified NLL Algorithm

The present algorithm augments NLL with the geometric knowledge afforded by the IFT. Consider using the same RevNet architecture to construct a bijective mapping such that the level sets of are parameterized by the images of the inactive variables , i.e. for each regular value there is a such that . Since is differentiable and scalar-valued, the IFT asserts that this can be done locally away from points where , but there is no guarantee that a global mapping exists unless everywhere and each level set in the domain is diffeomorphic to an

-dimensional hyperplane. On the other hand, as with AS it is always reasonable to ask for a function

which parameterizes these sets “the most on average”. Supposing does this sufficiently well, the problem of approximating on the full input space can be reduced to that of approximating on the one-dimensional space spanned by the active variable . In particular, an inexpensive regression approximation can be trained (e.g. simple neural network or local/global least-squares) using the information that .

Remark 4.

If form a frame of vector fields on an open set , local coordinates such that are called simulatenous flow-box coordinates on [6]. Such coordinates exist around any point provided the vectors are linearly independent and the Lie bracket identities hold on for all . In accordance with this notion, the NLL mapping can be understood as providing approximate, best-on-average flow-box coordinates for the level sets of . Of course, it is easy to check that for all .

4.1. A New Loss Functional

Computing the mapping in this way requires a loss functional which reflects the meaning of the procedure. In practice, it suffices to note that if is insensitive to perturbations in at a point , then its derivative is identically zero on the span of the inactive basis vectors . Note that this condition is readily expressed coordinate-wise as

yielding precisely the motivating statement for the original NLL algorithm: that the vectors are orthogonal to . Conversely, note that the mathematical meaning of as a rate of change is preserved only if there is no normalization of . Therefore, given a set of training samples in the original input space and a RevNet architecture as before, the goal of our modified NLL algorithm is to compute which minimizes the loss functional

(8)

where denotes the norm on the subspace orthogonal to the active direction at each point, i.e. the trace of the quadratic form with respect to the basis . This encourages the algorithm to find a mapping which reduces the composite function to a function of one variable . Moreover, since is naturally constrained to be a diffeomorphism and hence a proper map, it follows that the summand is quadratic and strongly coercive (on ) whenever is, making gradient descent based on a feasible strategy. Note the similarity to the method of AS: while AS seeks a linear subspace of the input data where the average change in is maximized, the proposed algorithm seeks a linear subspace of the transformed data where the average change in the push-forward is minimized. It follows that the complementary subspace maximizes this change, and the original nonlinear submanifold which maximizes the average change in can be recovered through .

Remark 5.

Observe the lack of regularization in (8). Experiments show (c.f. Section 5) that this criterion is sufficient to drive the descent to a minimum without additional penalty, suggesting the benefits of using un-normalized derivatives of the network mapping.

4.2. Connection to Energy Minimization

From a continuous perspective, it is meaningful to note that the defined in (8) is (up to scale) a discretization of the Dirichlet-type energy functional

(9)

where , is a bounded interval containing the range of and . To examine the structure of potential minimizers, it is useful to compute the variational derivative of . Recall that a variation of is a one-parameter family of mappings (also denoted satisfying for all in a compactly supported interval and some compactly supported . The variational derivative of a functional depending on is then the first-order term in its Taylor expansion around . In particular, there is the notation , which denotes the variational derivative (or simply the variation) of at the point in the direction of . It is a fundamental fact in the calculus of variations that is stationary if and only if for all .

Specifically to the present case, note that the variation in induces a variation in ,

so that the integrand of varies as

Moreover, since is just the inner product induced from on the slices and the variation vanishes on the boundary of each slice, integration by parts implies the equality

Putting this together with the fact that spatial and variational derivatives commute in this setting, the variation of is expressed as

(10)

Since is an arbitrary variation, this quantity (10) vanishes identically if and only if on , i.e. if and only if is harmonic on the slices . It is clear that this condition is satisfied when parameterizes the level sets of on , since is constant on each slice. Moreover, as minimization of Dirichlet-type energies such as is known to exhibit good stability and convergence properties (see e.g. [37]), it is reasonable to expect that the minimization of (8) will be similarly well-behaved. The next Section provides experimental justification for this idea.

5. Numerical Examples

This section details examples of the present NLL algorithm “New NLL” and its improvements over both AS and the original NLL algorithm “Old NLL” on sparse data sets. In particular, the effectiveness of each algorithm is measured using two metrics: the relative sensitivity of the function to the active variable(s), and the predictive ability of the low-dimensional approximation. After some discussion of implementation, two high-dimensional functions from [41] are used for validation of New NLL. Following this, New NLL is applied to the problem of predicting quantities of interest which depend on the solutions to systems of differential equations. In all cases, New NLL is shown to effectively reduce the dimension to one, providing a low-dimensional submanifold which affords accurate approximation of the desired function.

5.1. Implementation Details

The relevant algorithms are implemented in Python on an early 2015 MacBook Pro with 2.7GHz Intel i5 processor and 8GB of RAM. The implementation of AS is done in Python 2.7 following P. Constantine’s code library [14]

, while the implementation of the NLL algorithms is done in Python 3.9 using the PyTorch library 1.7. Note that the necessary derivatives of the network mapping

are computed using the PyTorch version of automatic differentiation, and not through finite differences as in [41]

. Moreover, New NLL employs the Adaptive Moment Estimator (ADAM) optimizer during training

[26]

while Old NLL uses stochastic gradient descent (SGD). In both cases, the RevNet layer step size is fixed at

, the activation function is , and 500 additional samples are used for validation as the models train. The weights for Old NLL are chosen as in [41], namely , for an inactive variable and otherwise. For visualization, the sensitivities of (computed using gradient information) are reported as percentages relative to the sum over all coordinates, and the NLL training and validation losses are reported as percentages relative to their initial values.

After dimension reduction, low-dimensional regression approximations are generated using the original training data projected onto the computed low-dimensional space. To demonstrate that both traditional and modern regression methods can be used effectively after this dimension reduction, experiments on the functions below use local or global polynomial least-squares to approximate the function value while the experiments on

use a simple feed-forward neural network. An additional 10000 uniformly distributed data samples are used for testing the regression in all cases except for

, where 500 additional samples are used. Results of the low-dimensional regressions are visualized through plots of the projected testing data against both the true and approximate function values. The error metrics reported are relative root-mean-square error (RRMSE), relative error (), and relative error (), computed as

Note finally that in the case of AS the mapping should be interpreted as the matrix from Section 2. The results for all simulations in this Section are summarized in Tables 1 and 2.

100 Samples 500 Samples 2500 Samples
Function Method Sens % RRMSE % R % R % Sens % RRMSE % R % R % Sens % RRMSE % R % R %
New NLL 78.7 3.86 8.27 10.9 89.8 1.82 3.52 5.16 94.5 0.827 1.72 2.35
Old NLL 60.4 6.63 14.5 18.8 65.9 4.58 10.5 13.0 69.2 4.02 9.11 11.4
AS 1-D 25.8 30.3 75.9 85.9 25.9 21.7 39.5 61.4 25.9 15.9 37.6 44.8
New NLL 75.1 0.920 5.79 7.92 88.6 0.370 2.78 3.97 93.8 0.154 1.63 1.98
Old NLL 1 54.6 0.699 7.48 9.40 55.4 0.942 7.26 9.52 56.1 0.784 6.91 8.05
Old NLL 2 61.8 1.80 12.9 21.1 68.7 1.03 9.22 11.1 67.5 0.894 8.16 9.69
Table 1. Results from the experiments in Section 5.2, including sensitivity measures and low-dimensional regression errors.

5.2. Two High-Dimensional Examples

Consider the following functions from [41], numbered consistently with their source,

Remark 6.

Note that the in [41] is actually in terms of the present , therefore identical up to a scale factor of and a dilation .

First, the experiment from [41] on

is repeated, which uses a 30-layer RevNet. Here both Old NLL and New NLL are trained for 5000 epochs (passes forward and backward through the training data) with learning rates of 0.5 resp. 0.003, and regression is performed through local quadratic least-squares using the 10 nearest training neighbors of each test point. The results of New NLL are compared alongside simulations of Old NLL using one resp. two active variables, where Old NLL has been trained using the loss functional

(c.f. Remark 3).

Figure 1 shows that New NLL outperforms Old NLL in training speed (b) and sensitivity concentration (a). In particular, with 500 training data New NLL concentrates 94% of the total sensitivity in in the direction, while Old NLL can only achieve 56%, or 67% of the total if two active variables are used. On the other hand, the regression results in Figure 2 illustrate that low-dimensional approximations built using New NLL are also more accurate, as can be seen in both the regression errors as well as the tightness of the projected testing data around the fit curve. Indeed, when using New NLL 500 training samples is sufficient for relative errors around 3%. Interestingly, the inclusion of two active variables in Old NLL does not appear to improve the training rate nor the regression accuracy, despite concentrating more sensitivity in the active directions.

(a)
(b)
Figure 1. (a) Relative sensitivity of to each -coordinate; (b) Relative value of loss during the first 2500 epochs of NLL training.
(a) (b) (c) (d)
Figure 2. Regression and errors on : (a) New NLL; (b) Old NLL 1; (c) Old NLL 2; (d) Errors of three approaches.

Next, performance is measured on the higher-dimensional function

. Again, a 30-layer RevNet is used, but now the regression is performed using a small feed-forward neural network of two fully-connected layers with 20 neurons each. In each case, the regression is trained for 5000 epochs with a learning rate of 0.05, while the NLL networks Old NLL resp. New NLL are trained for 5000 epochs with learning rates of 0.02 resp. 0.003. Again, the loss functional

is used to train Old NLL.

The results of this experiment are illustrated in Figures 3 and 4. Again, Figure 3 (b) shows that New NLL reaches roughly 1% of its initial training loss after just 100 epochs, while Old NLL plateaus around 15% regardless of the training length. The sensitivities of also behave as expected, even in the presence of few training samples. Indeed, for 500 samples New NLL concentrates 90% of the total sensitivity in , OLD NLL concentrates 66%, and AS concentrates 26%. Note that the number of samples does not affect the quality of the AS reduction, which is both a strength and a weakness of this method. Conversely, all algorithms benefit from increased training samples during the regression, although the approximation built using New NLL remains the most accurate. Observe that 2500 samples is enough for around 2% error with New NLL, while Old NLL still produces errors around 10%. Figure 4 shows that Active Subspaces is also able to find the general shape of the function, but the one-dimensional approximation using this linear technique still produces around 40% error.

(a)
(b)
Figure 3. (a) Relative sensitivity of to each -coordinate; (b) Relative value of loss during the first 2500 epochs of NLL training.
(a) (b) (c) (d)
Figure 4. Regression and errors on : (a) New NLL; (b) Old NLL; (c) Active Subspaces; (d) Errors of three approaches.
20 Samples 100 Samples 500 Samples
Function Method Sens % RRMSE % R % R % Sens % RRMSE % R % R % Sens % RRMSE % R % R %
New NLL 75.1 0.435 0.731 1.15 96.4 0.249 0.433 0.605 97.9 0.254 0.469 0.612
Old NLL 62.0 1.64 2.89 4.59 73.2 1.42 3.09 3.89 72.2 1.24 2.39 3.12
AS 1-D 55.0 9.18 18.5 22.1 54.4 18.2 22.0 9.14 53.6 18.8 22.7 9.43
AS 2-D 79.3 4.34 7.87 10.4 79.7 8.01 10.8 4.49 79.6 8.26 10.8 4.50
New NLL 97.6 0.425 1.12 1.27 98.3 0.186 0.496 0.555 98.3 0.101 0.502 0.540
Old NLL 80.1 3.52 9.82 10.5 80.5 3.19 8.99 9.53 80.3 3.25 9.15 9.70
AS 1-D 64.4 6.64 18.8 19.8 65.1 6.81 19.6 20.3 65.0 6.78 19.4 20.2
AS 2-D 87.5 3.32 9.54 9.90 88.7 2.64 6.96 7.88 88.7 2.65 7.06 7.91
Table 2. Results from the experiments in Section 5.3, including sensitivity measures and low-dimensional regression errors.

5.3. Application to Parametric Differential Equations

As mentioned in Section 1, one of the primary applications of dimension reduction methods such as NLL and AS lies in predicting quantities of interest which arise from systems of differential equations. To that end, we now consider two parameterized models for physical phenomena: a modified SIER model for disease spread, and an idealized model for fluid dynamics.

Predicting the Basic Reproduction Number of a Disease

Let denote the (total) time derivative of and consider the following modified SEIR model considered in [17] for the spread of Ebola in West Africa

Here represents the fraction of the population that is susceptible to infection, represents the (infected but asymptomatic) exposed population, is the infected fraction, is the hospitalized fraction, represents the infectious dead (not properly buried), represents the non-infectious dead (properly buried), represents the recovered population, and is a constant. When studying the spread of disease, it is important to compute the basic reproduction number

which depends on the eight parameters and measures the potential of the disease to transmit throughout the population. In [17], parameter ranges for this model are given around a baseline computed using data provided by the World Health Organization which was collected in the country of Liberia, and AS is used for the prediction of . It is interesting to apply NLL to this problem for comparison with the existing results. In this case, the training of the NLL models is done with a 15-layer RevNet and uniformly distributed samples drawn from the ranges in [17, Table 7]. In particular, Old NLL resp. New NLL are trained for 5000 epochs with learning rates of 0.1 resp. 0.005, and regression is performed using 10-neighbor local quadratic least-squares. Conversely, here global quartic least-squares are used for the AS regression, since global methods outperform local fitting when the spread of function values is wide (c.f. Figure 6 (c)). Note that Old NLL is trained using the loss functional , as this leads to better performance.

The results of this comparison show that the benefits of New NLL persist in this situation also. In particular, 90% of the total sensitivity in is concentrated by New NLL in the active direction (for 100+ training samples), and even a 20 sample training set is sufficient for 1% regression error. Contrast this with Old NLL and either a one-dimensional or two-dimensional AS, which yield significantly more erroneous approximations. Figure 6 illustrates the low-dimensional regressions, and Figure 5 (a) shows the errors. As expected, Figure 6 (a) shows that the approximation trained on the New NLL reduction produces a much tighter fit than existing methods. Interestingly, note that here both NLL algorithms train relatively well (c.f. Figure 5 (c)), and the Old NLL regression outperforms the 2-D AS regression despite concentrating less sensitivity in the active directions.

(a)
(b)
(c)
Figure 5. (a) Relative regression errors on ; (b) Relative sensitivity of to as a percentage of total; (c) Relative value of loss during the first 2500 epochs of NLL training (log scale).
(a) (b) (c) (d)
Figure 6. Regression on : (a) New NLL; (b) Old NLL; (c) 1-D AS; (d) 2-D AS.

Predicting the Total Kinetic Energy

The last experiment in this Section considers the parameterized one-dimensional inviscid Burgers’ equation, which is a common model for fluids whose motion can develop discontinuities. In particular, let and consider the initial value problem,

(11)

where represents the position of the fluid and . It is interesting to examine the performance of NLL and AS in predicting the total kinetic energy at time , given by

As is a function of the PDE solution, this represents a case where prediction is most valuable. Indeed, for more expensive PDE simulations it may not be possible to generate as much data as desired. Therefore, the goal is to use sparsely sampled values of obtained from simulations of (11) to train an approximation to this function at any in parameter space.

To accomplish this using the methods NLL and AS, it is necessary to have access to the gradient at the sampled points. Noting that for , simple differentiation yields

It follows that can be obtained immediately from the solution , and the components are further computable by solving sensitivity equations. In particular, differentiating (11) with respect to yields the following system of initial value problems in the variable ,

(12)

Solving (11) and (12) provides the necessary samples for applying the NLL and AS algorithms.

To study the example at hand, the parameters are chosen to take values in with . Systems (11) and (12) are discretized using simple forward Euler with upwinding with increments of and . Old NLL resp. New NLL are trained for 5000 epochs using a 7-layer RevNet with learning rates of 0.1 resp. 0.005, and regressions are performed using a neural network as in the case of function . Again, Old NLL is trained using the functional .

Results are displayed in Figures 7 and 8. Predictably, both versions of NLL are able to reproduce a reasonable one-dimensional approximation, although the New NLL approximation is much more accurate. Moreover, while a one-dimensional AS approximation can only capture the general trend in the function, a two-dimensional AS approximation (slightly) outperforms the Old NLL approximation. When using New NLL, Figure 7 (a) and (b) show that 100 training samples is sufficient for a sensitivity concentration of 98% and regression errors of around 0.5%.

(a)
(b)
(c)
Figure 7. (a) Relative regression errors on ; (b) Relative sensitivity of to as a percentage of total; (c) Relative value of loss during the first 2500 epochs of NLL training.
(a) (b) (c) (d)
Figure 8. Regression on : (a) New NLL; (b) Old NLL; (c) 1-D AS; (d) 2-D AS.

6. Conclusion

An improved version of the NLL algorithm from [41] has been proposed which reduces the input dimension to one in every case. By reformulating the central learning problem as the minimization of a Dirichlet-type energy functional, good stability and convergence properties are exhibited despite the use of sparse and high-dimensional training data. Through various illustrative examples it has been demonstrated that New NLL has several benefits over the original NLL algorithm and the linear method of Active Subspaces, including faster training than Old NLL and a sharper, more complete dimension reduction. Results show that regression approximations trained after New NLL are also more accurate, leading to confident prediction when approximating functionals of ODE/PDE solutions. Future work includes obtaining rigorous estimates on the data dependence and algorithmic convergence of New NLL, as well as studying the connection between harmonic maps and minimizers of the NLL algorithm.

Acknowledgement

This work is partially supported by U.S. Department of Energy Scientific Discovery through Advanced Computing under grants DE-SC0020270 and DE-SC0020418.

References

  • [1] H. Abdi and L. J. Williams (2010) Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 (4), pp. 433–459. Cited by: §2.
  • [2] K. P. Adragni and R. D. Cook (2009) Sufficient dimension reduction and prediction in regression. Philos. Trans. A Math. Phys. Eng. Sci. 367 (1906), pp. 4385–4405. Cited by: §2, §2.
  • [3] M. Balasubramanian, E. L. Schwartz, J. B. Tenenbaum, V. de Silva, and J. C. Langford (2002) The Isomap algorithm and topological stability. Science 295 (5552), pp. 7–7. Cited by: §1.
  • [4] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. Roux, and M. Ouimet (2003)

    Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering

    .
    NeurIPS 16, pp. 177–184. Cited by: §1.
  • [5] R. Bridges, A. Gruber, C. Felder, M. Verma, and C. Hoff (2019-09–15 Jun) Active manifolds: a non-linear analogue to Active Subspaces. In

    Proceedings of the 36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, pp. 764–772. Cited by: §2, §2.
  • [6] R. L. Bryant (1995) An introduction to Lie groups and symplectic geometry. Geometry and Quantum Field Theory 1, pp. 321–347. Cited by: Remark 4.
  • [7] M. Budninskiy, G. Yin, L. Feng, Y. Tong, and M. Desbrun (2019) Parallel transport unfolding: a connection-based manifold learning approach. SIAGA 3 (2), pp. 266–291. Cited by: §1.
  • [8] B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham (2018) Reversible architectures for arbitrarily deep residual neural networks. In Proc. Conf. AAAI Artif. Intell., Vol. 32. Cited by: §3.1, §3.1.
  • [9] H. Choi and S. Choi (2007) Robust kernel isomap. Pattern Recognit. 40 (3), pp. 853–862. Cited by: §1.
  • [10] C. K. Chui and X. Li (1992) Approximation by ridge functions and neural networks with one hidden layer. J. Approx. Theory 70 (2), pp. 131–141. Cited by: §2.
  • [11] P. G. Constantine, E. Dow, and Q. Wang (2014) Active subspace methods in theory and practice: applications to kriging surfaces. SIAM J. Sci. Comput. 36 (4), pp. A1500–A1524. Cited by: §2.
  • [12] P. G. Constantine, A. Eftekhari, J. Hokanson, and R. A. Ward (2017) A near-stationary subspace for ridge approximation. Comput. Method. Appl. M. 326, pp. 402–421. Cited by: §2, §2.
  • [13] P. G. Constantine (2015) Active Subspaces: emerging ideas for dimension reduction in parameter studies. SIAM. Cited by: §2, §2.
  • [14] P. Constantine, R. Howard, A. Glaws, Z. Grey, P. Diaz, and L. Fletcher (2016) Python active-subspaces utility library.

    J. Open Source Softw.

    1 (5), pp. 79.
    Cited by: §5.1.
  • [15] R. D. Cook (2007/2/1) Fisher lecture: dimension reduction in regression. Statistical Science 22 (1), pp. 1–26. Cited by: §2.
  • [16] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney (2009) Sampling algorithms and coresets for regression. SIAM J. Sci. Comput. 38 (5), pp. 2060–2078. Cited by: §2.
  • [17] P. Diaz, P. Constantine, K. Kalmbach, E. Jones, and S. Pankavich (2018) A modified SEIR model for the spread of Ebola in Western Africa and metrics for resource allocation. Appl. Math. Comput. 324, pp. 141–155. Cited by: §2, §5.3, §5.3.
  • [18] J. H. Friedman and W. Stuetzle (1981) Projection pursuit regression. J. Am. Stat. Assoc. 76 (376), pp. 817–823. Cited by: §2.
  • [19] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017)

    The reversible residual network: backpropagation without storing activations

    .
    In NeurIPS, Cited by: §3.1.
  • [20] S. Greenland, M. A. Mansournia, and D. G. Altman (2016) Sparse data bias: a problem hiding in plain sight. BMJ 352. Cited by: §2.
  • [21] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    ,
    Vol. 2, pp. 1735–1742. Cited by: §2.
  • [22] D. M. Hawkins (2004) The problem of overfitting. J. Chem. Inform. Comput. Sci. 44 (1), pp. 1–12. Cited by: §2.
  • [23] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. Cited by: §2.
  • [24] K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Netw. 2 (5), pp. 359–366. Cited by: §2.
  • [25] A. I. Khuri and S. Mukhopadhyay (2010) Response surface methodology. Wiley Interdiscip. Rev. Comput. Stat. 2 (2), pp. 128–149. Cited by: §2.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [27] R. R. Lam, O. Zahm, Y. M. Marzouk, and K. E. Willcox (2020) Multifidelity dimension reduction via Active Subspaces. SIAM J. Sci. Comput. 42 (2), pp. A929–A956. Cited by: §2.
  • [28] J. A. Lee, A. Lendasse, and M. Verleysen (2004) Nonlinear projection with curvilinear distances: isomap versus curvilinear distance analysis. Neurocomputing 57, pp. 49–76. Cited by: §1.
  • [29] L. Li (2007) Sparse sufficient dimension reduction. Biometrika 94 (3), pp. 603–613. Cited by: §2, §2.
  • [30] T. W. Lukaczyk, P. Constantine, F. Palacios, and J. J. Alonso (2014) Active subspaces for shape optimization. In 10th AIAA multidisciplinary design optimization conference, pp. 1171. Cited by: §2.
  • [31] Y. Ma and L. Zhu (2013) A review on dimension reduction. Int. Stat. Rev. 81 (1), pp. 134–150. Cited by: §2.
  • [32] R. H. Myers, D. C. Montgomery, and C. M. Anderson-Cook (2016) Response surface methodology: process and product optimization using designed experiments. John Wiley & Sons. Cited by: §2.
  • [33] T. O’Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas (2020) Derivative-informed projected neural networks for high-dimensional parametric maps governed by PDEs. arXiv preprint arXiv:2011.15110. Cited by: §2, §2.
  • [34] A. Pinkus et al. (1997) Approximating by ridge functions. Surface Fitting and Multiresolution Methods, pp. 279–292. Cited by: §2.
  • [35] S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. Cited by: §1.
  • [36] R. M. Royall (1970)

    On finite population sampling theory under certain linear regression models

    .
    Biometrika 57 (2), pp. 377–387. Cited by: §2.
  • [37] R. Schoen and K. Uhlenbeck (1983) Boundary regularity and the Dirichlet problem for harmonic maps. J. Diff. Geom. 18 (2), pp. 253–268. Cited by: §4.2.
  • [38] Y. Wang, H. Yao, and S. Zhao (2016) Auto-encoder based dimensionality reduction. Neurocomputing 184, pp. 232–242. Cited by: §2.
  • [39] J. Weng and D. S. Young (2017) Some dimension reduction strategies for the analysis of survey data. J. Big Data 4 (1), pp. 43. External Links: ISBN 2196-1115 Cited by: §2.
  • [40] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemom. Intell. Lab. Syst. 2 (1-3), pp. 37–52. Cited by: §2.
  • [41] G. Zhang, J. Zhang, and J. Hinkle (2019) Learning nonlinear level sets for dimensionality reduction in function approximation. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §1, §2, §3.1, §3.2, §3.2, §3.2, §3, §3, §5.1, §5.2, §5.2, §5, §6, Remark 2, Remark 3, Remark 6.
  • [42] Y. Zhang, Z. Zhang, J. Qin, L. Zhang, B. Li, and F. Li (2018)

    Semi-supervised local multi-manifold isomap by linear embedding for feature extraction

    .
    Pattern Recognit. 76, pp. 662–678. Cited by: §1.