Partition of unity networks: deep hp-approximation

01/27/2021 ∙ by Kookjin Lee, et al. ∙ Sandia National Laboratories 0

Approximation theorists have established best-in-class optimal approximation rates of deep neural networks by utilizing their ability to simultaneously emulate partitions of unity and monomials. Motivated by this, we propose partition of unity networks (POUnets) which incorporate these elements directly into the architecture. Classification architectures of the type used to learn probability measures are used to build a meshfree partition of space, while polynomial spaces with learnable coefficients are associated to each partition. The resulting hp-element-like approximation allows use of a fast least-squares optimizer, and the resulting architecture size need not scale exponentially with spatial dimension, breaking the curse of dimensionality. An abstract approximation result establishes desirable properties to guide network design. Numerical results for two choices of architecture demonstrate that POUnets yield hp-convergence for smooth functions and consistently outperform MLPs for piecewise polynomial functions with large numbers of discontinuities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Overview

We consider regression over the set , where and are point samples of a piecewise smooth function. Following successes in classification problems in high-dimensional spaces Chollet (2017), deep neural networks (DNNs) have garnered tremendous interest as tools for regression problems and numerical analysis, partially due to their apparent ability to alleviate the curse of dimensionality in the presence of latent low-dimensional structure. This is in contrast to classical methods for which the computational expense grows exponentially with , a major challenge for solution of high-dimensional PDEs Bach (2017); Bengio and Bengio (2000); Han, Jentzen, and Weinan (2018).

Understanding the performance of DNNs requires accounting for both optimal approximation error and optimization error. While one may prove existence of DNN parameters providing exponential convergence with respect to architecture size, in practice a number of issues conspire to prevent realizing such convergence. Several approximation theoretic works seek to understand the role of width/depth in the absence of optimization error He et al. (2018); Daubechies et al. (2019); Yarotsky (2017, 2018); Opschoor, Petersen, and Schwab (2019). In particular, Yarotsky and Opschoor et al. prove the existence of parameters for a deep neural network architecture that approximate algebraic operations, partitions of unity (POUs), and polynomials to exponential accuracy in the depth of the network. This shows that sufficently deep DNNs may in theory learn a spectrally convergent -element space without a hand-tailored mesh by constructing a POU to localize polynomial approximation. In practice, however, such convergent approximations are not realized when training DNNs using gradient descent optimizers, even for smooth target functions Fokina and Oseledets (2019); Adcock and Dexter (2020)

. The thesis of this work is to incorporate the POU and polynomial elements directly into a deep learning architecture. Rather than attempting to force a DNN to simultaneously perform localization and high-order approximation, introducing localized polynomial spaces frees the DNN to play to its strengths by focusing exclusively on partitioning space, as in classification problems.

An attractive property of the proposed architecture is its amenability to a fast training strategy. In previous work, we developed an optimizer which alternates between a gradient descent update of hidden layer parameters and a globally optimal least squares solve for a final linear layer Cyr et al. (2019); this was applied as well to classification problems Patel et al. (2020). A similar strategy is applied in the current work: updating the POU with gradient descent before finding a globally optimal polynomial fit at each iteration ensures an optimal representation of data over the course of training.

While DNNs have been explored as a means of solving high-dimensional PDEs Geist et al. (2020); Han, Jentzen, and Weinan (2018), optimization error prevents a practical demonstration of convergence with respect to size of either data or model parameters Beck, Jentzen, and Kuckuck (2019); Wang, Teng, and Perdikaris (2020). The relatively simple regression problem considered here provides a critical first example of how the optimization error barrier may be circumvented to provide accuracy competitive with finite element methods.

An abstract POU network

Consider a partition of unity satisfying and for all . We work with the approximant

(1)

where . For this work, we take to be the space of polynomials of order , while is parametrized as a neural network with weights and biases and output dimension :

(2)

We consider two architectures for to be specified later. Approximants of the form (1) allow a “soft” localization of the basis elements to an implicit partition of space parametrized by the . While approximation with broken polynomial spaces corresponds to taking

to consist of characteristic functions on the cells of a computational mesh, the parametrization of

by a DNN generalizes more broadly to differentiable partitions of space. For applications of partitions of unity in numerical analysis, see Strouboulis, Copps, and Babuška (2001); Fries and Belytschko (2010); Wendland (2002); for their use in differential geometry, see Spivak (2018); Hebey (2000).

In a traditional numerical procedure, is constructed prior to fitting to data through a geometric “meshing” process. We instead work with a POU in the form of a DNN (2) in which the weights and biases , which are trained to fit the data. We therefore fit both the localized basis coefficients and the localization itself simultaneously by solving the optimization problem

(3)

Error analysis and architecture design

Before specifying a choice of architecture for in (2)

, we present a basic estimate of the optimal training error using the POU-Net architecture to highlight desirable properties for the partition to have. We denote by

the diameter of a set .

Theorem 1.

Consider an approximant of the form (1) with . If and solve (3) to yield the approximant , then

(4)

where denotes the root-mean-square norm over the training data pairs in ,

(5)

and

(6)
Proof.

For each , take to be the th order Taylor polynomial of centered at any point of . Then for all ,

(7)

Define the approximant , which is of the form (1) and represented by feasible . Then by definition of and (3), we have

(8)
(9)
(10)

For each , if , then we apply (7); otherwise, the summand vanishes. So

(11)
(12)
(13)
(14)

Theorem 1 indicates that for smooth , the resulting convergence rate of the training error is independent of the specific choice of parameterization of the , and depends only upon their support. Further, the error does not explicitly depend upon the dimension of the problem; if the trained encodes a covering by supp of a low dimensional manifold containing the data locations with latent dimension , then the approximation cost scales only with and thus may break the curse of dimensionality, e.g. for linear polynomial approximation If the parameterization is able to find compactly supported quasi-uniform partitions such that the maximum diameter in (4) scales as , the training loss will exhibit a scaling of .

The above analysis indicates the advantage of enforcing highly localized, compact support of the POU functions. However, in practice we found that for a POU parametrized by a shallow RBF-Net networks (described below), rapidly decaying but globally supported POU functions exhibited more consistent training results. This is likely due to a higher tolerance to initializations poorly placed with respect to the data locations. Similarly, when the POU is parametrized by a deep ResNet, we obtained good results using ReLU activation functions while training with a regularizer to promote localization (Algorithm

2). Thus, the properties playing a role in the above analysis – localization of the partition functions and distribution of their support over a latent manifold containing the data – are the motivation for the architectures and training strategies we consider below.

POU #1 - RBF-Net: A shallow RBF-network Broomhead and Lowe (1988); Billings and Zheng (1995) implementation of is given by (1) and

(15)

Here, denotes the RBF centers and denotes RBF shape parameters, both of which evolve during training. A measure of the localization of these functions can be taken to be the magnitude of . Such an architecture works well for approximation of smooth functions, but the continuity of causes difficulty in the approximation of piecewise smooth functions.

POU #2 - ResNet: We compose a residual network architecture He et al. (2016) with a softmax layer to define (2). For the experiments considered we use a ReLU activation, allowing representation of functions with with discontinuities in their first derivative.

Initialization: All numerical experiments take data supported within the unit hypercube . To initialize the POU #1

architecture we use unit shape parameter and uniformly distribute centers

. We initialize POU #2 with the Box initialization Cyr et al. (2019). We found that these initializations provide an initial set of partitions sufficiently “well-distributed” throughout for successful training.

Fast optimizer

The least-squares structure of (3) allows application of the least-squares gradient descent (LSGD) block coordinate descent strategy Cyr et al. (2019)

. At each epoch, one may hold the hidden parameters

fixed to obtain a least squares problem for the optimal polynomial coefficients . The step may be concluded by then taking a step of gradient descent for the POU parameters, therefore evolving the partitions along the manifold corresponding to optimal representation of data; for details see Cyr et al. (2019).

Algorithm 1 with specifies the application of LSGD to Eqn. (3). We will demonstrate that while effective, several of the learned partition functions may “collapse” to near-zero values everywhere. To remedy this, we will also consider a pre-training step in Algorithm 2, which adds an regularizer to the polynomial coefficients. The intuition behind this is that a given partition regresses data using an element of the form . If is scaled by a small , the LSGD solver may pick up a scaling for and achieve the same approximation. Limiting the coefficients thus indirectly penalizes this mode of partition function collapse, promoting more quasi-uniform partitions of space.

Data:
Result:
Function LSGD( , , , , , ):
       ;
       for  do
             ;
             // solve Eqn. (1) with a regularizer
             ;
             if LSGD stagnates more than  then
                  
             end if
            
       end for
      ;
       ;
      
Algorithm 1 The regularized least-squares gradient descent (LSGD) method. Setting recovers the original LSGD method Cyr et al. (2019).
Data:
Result:
Function Two-phase-LSGD( , , , , ,):
       ;
       // Phase 1: LSGD with a regularizer
       ;
       // Phase 2: LSGD without a regularizer
      
Algorithm 2 The two-phase LSGD method with the -regularized least-squares solves.

Numerical experiments

In this section, we assess the performance of POUnets with the two POU architectures. We implement the proposed algorithms in Python, construct POUnets using Tensorflow 2.3.0 Abadi et al. (2016), and employ NumPy and Scipy packages for generating training data and solving the least squares problem. For all considered neural networks, training is performed by batch gradient descent using the Adam optimizer Kingma and Ba (2014). For a set of -variate polynomials , we choose truncated Taylor polynomials of maximal degree , providing to

polynomial coefficients per partition. The coefficients are initialized via sampling from the unit normal distribution.

Smooth functions

We consider an analytic function as our first benchmark, specifically the sine function defined on a cross-shaped one-dimensional manifold embedded in 2-dimensional domain

(a) POUnets
(b) MLPs
Figure 1: Relative -errors (log-log scale) of approximants produced by POUnets with RBF-Net partition functions for varying and varying (left) and standard MLPs for varying width and depth (right).

We test RBF-Nets for varying number of partitions, and the maximal polynomial degrees . For training, we collect data by uniformly sampling from the domain [-1,1] for 501 samples, i.e., in total, 1001 -pairs after removing the duplicate point on the origin. We initialize centers of the RBF basis functions by sampling uniformly from the domain and initialize shape parameters as ones. We then train RBF-Nets by using the LSGD method with (Algorithm 1) with the initial learning rate for Adam set to . The maximum number of epochs is set as 100, and we choose the centers and shape parameters that yield the best result on the training loss during the specified epochs.

Figure 1(a) reports the relative -errors of approximants produced by POUnets for varying and varying . The results are obtained from 10 independent runs for each value of and

, with a single log-normal standard deviation. Algebraic convergence is observed of order increasing with polynomial degree before saturating at

. We conjecture this is due to eventual loss of precision due to the ill-conditioning of the least squares matrix; however, we leave a formal study and treatment for a future work.

In comparison to the performance achieved by POUnets, we assess the performance of the standard MLPs with varying depth {4,8,12,16,20} and width {8,16,32,64,128}. The standard MLPs are trained by using Adam with an initial learning rate and the maximum number of epochs 1000. As opposed to the convergence behavior of RBF-Net-based POUnets, the standard MLP briefly exhibits roughly first order convergence before stagnating around an error of .

(a) Triangular waves
(b) Quadratic waves
Figure 2: Relative -errors (log-log scale) of approximants produced by the standard ResNets and POUnets with ResNet partition functions for varying for approximating the target functions with varying .

Piecewise Smooth Functions

We next consider piecewise linear and piecewise quadratic functions: triangle waves with varying frequencies, i.e., , and their quadratic variants , where

(16)

We study the introduction of increasingly many discontinuities by increasing the frequency . Reproduction of such sawtooth functions by ReLU networks via both wide networks He et al. (2018) and very deep networks Telgarsky (2016) can be achieved theoretically via construction of carefully chosen architecture and weights/biases, but to our knowledge has not been achieved via standard training.

The smoothness possessed by the RBF-Net partition functions (POU #1) precludes rapidly convergent approximation of piecewise smooth functions, and we instead employ ResNets (POU #2) for in this case, as the piecewise linear nature of such partition functions is better suited. As a baseline for the performance comparison, we apply regression of the same data over a mean square error using standard ResNets,

. The standard ResNets share the same architectural design and hyperparameters with the

ResNets used to parametrize the POU functions in the POUnets. The only exception is the output layer; the standard ResNets produce a scalar-valued output, whereas the POU parametrization produce

s a vector-valued output, which is followed by the softmax activation.

(a)
(b)
(c)
(d)
(e) (zoomed-in)
(f) (zoomed-in)
Figure 3: Snapshots of target functions and approximants produced by ResNet and POUnet (i.e., ) are depicted in black, light green, and orange, respectively. The target function correspond to triangular waves (left) and their quadratic variants (right). The bottom row depicts the snapshots in the domain [0.5,0.625].

We consider five frequency parameters for both target functions, , which results in piecewise linear and quadratic functions with pieces. Based on the number of pieces in the target function, we scale the width of the baseline neural networks and POUnets as , while fixing the depth as 8, and for POUnets the number of partitions are set as . For POUnets, we choose the maximal degree of polynomials to be and for the piecewise linear and quadratic target functions, respectively.

(a) Phase 1 (0)
(b) Phase 1 (15)
(c) Phase 1 (30)
(d) Phase 1 (60)
(e) Phase 2 (1000)
(f) Approximation
(g) Phase 1 (0)
(h) Phase 1 (30000)
(i) Phase 1 (60000)
(j) Phase 1 (90000)
(k) Phase 2 (1000)
(l) Approximation
Figure 4: Triangular wave with two pieces (top) and triangular wave with eight pieces (bottom): Phase 1 LSGD constructs localized disjoint partitions (4(a)4(d) and 4(g)4(j)) and Phase 2 LSGD produces an accurate approximation.

Both neural networks are trained on the same data, , where are uniformly sampled from [0,1] and . The standard ResNets are trained using the gradient descent method and the POUnets are trained using the LSGD method with (Algorithm 1). The initial learning rate for Adam is set as . The maximum number of epochs is set as 2000 and we choose the neural network weights and biases that yield the best result on the training loss during the specified epochs.

Figure 2 reports the relative -errors of approximants produced by the standard ResNets and POUnets for varying and width for the piecewise linear functions (left) and the quadratic piecewise quadratic functions (right). The statistics are obtained from five independent runs. Figure 2 essentially shows that the POUnets outperform the standard ResNets in terms of approximation accuracy; specifically, the standard ResNets with the ReLU activation function significantly fails to produce accurate approximantions, while POUnets obtain error for large numbers of discontinuities.

Figure 3 illustrates snapshots of the target functions and approximants produced by the two neural networks for target functions with the frequency parameter . Figure 3 confirms the trends shown in Figure 2; the benefits of using POUnets are more pronounced in approximating functions with larger and in approximating quadratic functions (potentially, in approximating high-order polynomials). Figures 3(c)3(f) clearly show that the POUnets accurately approximate the target functions with sub 1% errors, while the standard ResNets significantly fails to produce accurate approximations.

Results of two-phase LSGD

Now we demonstrate effectiveness of the two-phase LSGD method (Algorithm 2) in constructing partitions which are localized according to the features in the data and nearly disjoint, i.e., breakpoints in the target function, which we found leads to better approximation accuracy.

The first phase of the algorithm aims to construct such partitions. To this end, we limit the expressibility of the polynomial regression by regularizing the Frobenius norm of the coefficients in least-squares solves. This prevents a small number of partitions dominating and other partitions being collapsed per our discussion of the fast optimizer above. These "quasi-uniform" partition are then used as the initial guess for the unregularized second phase. Finally, we study the qualitative differences in the resulting POU, particularly the ability of the network to learn a disjoint partition of space. We do however stress that there is no particular "correct" form for the resulting POU - this is primarily of interest because it shows the network may recover a traditional discontinuous polynomial approximation.

In the first phase, we employ a relatively larger learning rate in order to prevent weights and biases from getting stuck in local minima. On the other hand, in the second phase we employ a relatively smaller learning rate to ensure that the training error decreases.

(a) Phase 1 (0)
(b) Phase 1 (3000)
(c) Phase 1 (6000)
(d) Phase 1 (9000)
(e) Phase 2 (1000)
(f) Approximation
(g) Phase 1 (0)
(h) Phase 1 (100000)
(i) Phase 1 (150000)
(j) Phase 1 (300000)
(k) Phase 1 (500000)
(l) Approximation
Figure 5: Quadratic wave with two pieces (top) and quadratic wave with eight pieces (bottom): Phase 1 LSGD constructs localized disjoint partitions (5(a)5(d) and 5(g)5(j)) and Phase 2 LSGD produces an accurate approximation.

Triangular waves.

We demonstrate how the two-phase LSGD works in two example cases. The first set of example problems is the triangular wave with the frequency parameter . We use the same POU network architecture described in the previous section (i.e., ResNet with width 8 and depth 8) and the same initialization scheme (i.e., the box initialization). We set 0.1 and 0.05 for the initial learning rates for phase 1 and the phase 2, respectively; we set the other LSGD parameters as , , and . Figure 4 (top) depicts both how partitions are constructed during phase 1 (Figures 4(a)4(d)), and snapshots of the partitions and the approximant at 1000th epoch of the phase 2 in Figures 4(e) and 4(f). The approximation accuracy measured in the relative -norm of the error reaches down to after 12000 epochs in phase 2.

Next we consider the triangular wave with the frequency parameter . We use the POU network architecture constructed as ResNet with width 8 and dept 10, and use the box initialization. We set 0.05 and 0.01 for the initial learning rates for phase 1 and phase 2, respectively; we set the other LSGD parameters as , , and . Figure 4 (bottom) depicts again how partitions evolve during the phase 1 and the resulting approximants. Figures 4(g)4(j) depicts that the two-phase LSGD constructs partitions that are disjoint and localized according to the features during the first phase.

Quadratic waves.

Finally, we approximate the piecewise quadratic wave with frequency parameter while employing the same network architecture used for approximating the triangular waves. For the case the learning rate set as 0.5 and 0.25 for phase 1 and phase 2. We use the same parameter settings (i.e., , , and ) as in the previous experiment. Again, in phase 1 disjoint partitions are constructed, and the accurate approximation is produced in the phase 2 (Figures 5(a)5(f)). Moreover, we observe that the partitions are further refined during phase 2. For the case we again employ the architecture used for the triangular wave with (i.e., ResNet with width 8 and depth 10). We use the same hyperparameters as in the triangular wave with (i.e., 0.05 and 0.01 for learning rates, and ). The two exceptions are the weight for the regularization, and . Figures 5(g)5(k) illustrates that the phase 1 LSGD constructs disjoint supports for partitions, but takes much more epochs. Again, the two-stage LSGD produces an accurate approximation (Figure 5(l)).

Discussion.

The results of this section have demonstrated that it is important to apply a good regularizer during a pre-training stage to obtain approximately disjoint piecewise constant partitions. For the piecewise polynomial functions considered here, such partitions allowed in some cases recovery to near-machine precision. Given the abstract error analysis, it is clear that such localized partitions which adapt to the features of the target function act similarly to traditional -approximation. There are several challenges regarding this approach, however: the strong regularization during phase 1 requires a large number of training epochs, and we were unable to obtain a set of hyperparameters which provide such clean partitions across all cases. This suggests two potential areas of future work. Firstly, an improved means of regularizing during pretraining that does not compromise accuracy may be able to train faster; deep learning strategies for parametrizing charts as in Schonsheck, Chen, and Lai (2019) may be useful for this. Secondly, it may be fruitful to understand the approximation error under less restrictive assumptions that the partitions are compactly supported or highly localized. We have been guided by the idea that polynomials defined on indicator function partitions reproduce the constructions used by the finite element community, but a less restrictive paradigm combined with an effective learning strategy may prove more flexible for general datasets.

Conclusions

Borrowing ideas from numerical analysis, we have demonstrated an novel architecture and optimization strategy the computational complexity of which need not scale exponentially with the ambient dimension and which provides high-order convergence for smooth functions and error consistently under for piecewise smooth problems. Such an architecture has the potential to provide DNN methods for solving high-dimensional PDE that converge in a manner competitive with traditional finite element spaces.

Acknowledgements

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003530. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. SAND Number: SAND2020-6022 J.

The work of M. Gulian, R. Patel, and N. Trask are supported by the U.S. Department of Energy, Office of Advanced Scientific Computing Research under the Collaboratory on Mathematics and Physics-Informed Learning Machines for Multiscale and Multiphysics Problems (PhILMs) project. E. C. Cyr and N. Trask are supported by the Department of Energy early career program. M. Gulian is supported by the John von Neumann fellowship at Sandia National Laboratories.

References