Overview
We consider regression over the set , where and are point samples of a piecewise smooth function. Following successes in classification problems in highdimensional spaces Chollet (2017), deep neural networks (DNNs) have garnered tremendous interest as tools for regression problems and numerical analysis, partially due to their apparent ability to alleviate the curse of dimensionality in the presence of latent lowdimensional structure. This is in contrast to classical methods for which the computational expense grows exponentially with , a major challenge for solution of highdimensional PDEs Bach (2017); Bengio and Bengio (2000); Han, Jentzen, and Weinan (2018).
Understanding the performance of DNNs requires accounting for both optimal approximation error and optimization error. While one may prove existence of DNN parameters providing exponential convergence with respect to architecture size, in practice a number of issues conspire to prevent realizing such convergence. Several approximation theoretic works seek to understand the role of width/depth in the absence of optimization error He et al. (2018); Daubechies et al. (2019); Yarotsky (2017, 2018); Opschoor, Petersen, and Schwab (2019). In particular, Yarotsky and Opschoor et al. prove the existence of parameters for a deep neural network architecture that approximate algebraic operations, partitions of unity (POUs), and polynomials to exponential accuracy in the depth of the network. This shows that sufficently deep DNNs may in theory learn a spectrally convergent element space without a handtailored mesh by constructing a POU to localize polynomial approximation. In practice, however, such convergent approximations are not realized when training DNNs using gradient descent optimizers, even for smooth target functions Fokina and Oseledets (2019); Adcock and Dexter (2020)
. The thesis of this work is to incorporate the POU and polynomial elements directly into a deep learning architecture. Rather than attempting to force a DNN to simultaneously perform localization and highorder approximation, introducing localized polynomial spaces frees the DNN to play to its strengths by focusing exclusively on partitioning space, as in classification problems.
An attractive property of the proposed architecture is its amenability to a fast training strategy. In previous work, we developed an optimizer which alternates between a gradient descent update of hidden layer parameters and a globally optimal least squares solve for a final linear layer Cyr et al. (2019); this was applied as well to classification problems Patel et al. (2020). A similar strategy is applied in the current work: updating the POU with gradient descent before finding a globally optimal polynomial fit at each iteration ensures an optimal representation of data over the course of training.
While DNNs have been explored as a means of solving highdimensional PDEs Geist et al. (2020); Han, Jentzen, and Weinan (2018), optimization error prevents a practical demonstration of convergence with respect to size of either data or model parameters Beck, Jentzen, and Kuckuck (2019); Wang, Teng, and Perdikaris (2020). The relatively simple regression problem considered here provides a critical first example of how the optimization error barrier may be circumvented to provide accuracy competitive with finite element methods.
An abstract POU network
Consider a partition of unity satisfying and for all . We work with the approximant
(1) 
where . For this work, we take to be the space of polynomials of order , while is parametrized as a neural network with weights and biases and output dimension :
(2) 
We consider two architectures for to be specified later. Approximants of the form (1) allow a “soft” localization of the basis elements to an implicit partition of space parametrized by the . While approximation with broken polynomial spaces corresponds to taking
to consist of characteristic functions on the cells of a computational mesh, the parametrization of
by a DNN generalizes more broadly to differentiable partitions of space. For applications of partitions of unity in numerical analysis, see Strouboulis, Copps, and Babuška (2001); Fries and Belytschko (2010); Wendland (2002); for their use in differential geometry, see Spivak (2018); Hebey (2000).In a traditional numerical procedure, is constructed prior to fitting to data through a geometric “meshing” process. We instead work with a POU in the form of a DNN (2) in which the weights and biases , which are trained to fit the data. We therefore fit both the localized basis coefficients and the localization itself simultaneously by solving the optimization problem
(3) 
Error analysis and architecture design
Before specifying a choice of architecture for in (2)
, we present a basic estimate of the optimal training error using the POUNet architecture to highlight desirable properties for the partition to have. We denote by
the diameter of a set .Theorem 1.
Consider an approximant of the form (1) with . If and solve (3) to yield the approximant , then
(4) 
where denotes the rootmeansquare norm over the training data pairs in ,
(5) 
and
(6) 
Proof.
For each , take to be the th order Taylor polynomial of centered at any point of . Then for all ,
(7) 
Define the approximant , which is of the form (1) and represented by feasible . Then by definition of and (3), we have
(8)  
(9)  
(10) 
For each , if , then we apply (7); otherwise, the summand vanishes. So
(11)  
(12)  
(13)  
(14) 
Theorem 1 indicates that for smooth , the resulting convergence rate of the training error is independent of the specific choice of parameterization of the , and depends only upon their support. Further, the error does not explicitly depend upon the dimension of the problem; if the trained encodes a covering by supp of a low dimensional manifold containing the data locations with latent dimension , then the approximation cost scales only with and thus may break the curse of dimensionality, e.g. for linear polynomial approximation If the parameterization is able to find compactly supported quasiuniform partitions such that the maximum diameter in (4) scales as , the training loss will exhibit a scaling of .
The above analysis indicates the advantage of enforcing highly localized, compact support of the POU functions. However, in practice we found that for a POU parametrized by a shallow RBFNet networks (described below), rapidly decaying but globally supported POU functions exhibited more consistent training results. This is likely due to a higher tolerance to initializations poorly placed with respect to the data locations. Similarly, when the POU is parametrized by a deep ResNet, we obtained good results using ReLU activation functions while training with a regularizer to promote localization (Algorithm
2). Thus, the properties playing a role in the above analysis – localization of the partition functions and distribution of their support over a latent manifold containing the data – are the motivation for the architectures and training strategies we consider below.POU #1  RBFNet: A shallow RBFnetwork Broomhead and Lowe (1988); Billings and Zheng (1995) implementation of is given by (1) and
(15) 
Here, denotes the RBF centers and denotes RBF shape parameters, both of which evolve during training. A measure of the localization of these functions can be taken to be the magnitude of . Such an architecture works well for approximation of smooth functions, but the continuity of causes difficulty in the approximation of piecewise smooth functions.
POU #2  ResNet: We compose a residual network architecture He et al. (2016) with a softmax layer to define (2). For the experiments considered we use a ReLU activation, allowing representation of functions with with discontinuities in their first derivative.
Initialization: All numerical experiments take data supported within the unit hypercube . To initialize the POU #1
architecture we use unit shape parameter and uniformly distribute centers
. We initialize POU #2 with the Box initialization Cyr et al. (2019). We found that these initializations provide an initial set of partitions sufficiently “welldistributed” throughout for successful training.Fast optimizer
The leastsquares structure of (3) allows application of the leastsquares gradient descent (LSGD) block coordinate descent strategy Cyr et al. (2019)
. At each epoch, one may hold the hidden parameters
fixed to obtain a least squares problem for the optimal polynomial coefficients . The step may be concluded by then taking a step of gradient descent for the POU parameters, therefore evolving the partitions along the manifold corresponding to optimal representation of data; for details see Cyr et al. (2019).Algorithm 1 with specifies the application of LSGD to Eqn. (3). We will demonstrate that while effective, several of the learned partition functions may “collapse” to nearzero values everywhere. To remedy this, we will also consider a pretraining step in Algorithm 2, which adds an regularizer to the polynomial coefficients. The intuition behind this is that a given partition regresses data using an element of the form . If is scaled by a small , the LSGD solver may pick up a scaling for and achieve the same approximation. Limiting the coefficients thus indirectly penalizes this mode of partition function collapse, promoting more quasiuniform partitions of space.
Numerical experiments
In this section, we assess the performance of POUnets with the two POU architectures. We implement the proposed algorithms in Python, construct POUnets using Tensorflow 2.3.0 Abadi et al. (2016), and employ NumPy and Scipy packages for generating training data and solving the least squares problem. For all considered neural networks, training is performed by batch gradient descent using the Adam optimizer Kingma and Ba (2014). For a set of variate polynomials , we choose truncated Taylor polynomials of maximal degree , providing to
polynomial coefficients per partition. The coefficients are initialized via sampling from the unit normal distribution.
Smooth functions
We consider an analytic function as our first benchmark, specifically the sine function defined on a crossshaped onedimensional manifold embedded in 2dimensional domain
We test RBFNets for varying number of partitions, and the maximal polynomial degrees . For training, we collect data by uniformly sampling from the domain [1,1] for 501 samples, i.e., in total, 1001 pairs after removing the duplicate point on the origin. We initialize centers of the RBF basis functions by sampling uniformly from the domain and initialize shape parameters as ones. We then train RBFNets by using the LSGD method with (Algorithm 1) with the initial learning rate for Adam set to . The maximum number of epochs is set as 100, and we choose the centers and shape parameters that yield the best result on the training loss during the specified epochs.
Figure 1(a) reports the relative errors of approximants produced by POUnets for varying and varying . The results are obtained from 10 independent runs for each value of and
, with a single lognormal standard deviation. Algebraic convergence is observed of order increasing with polynomial degree before saturating at
. We conjecture this is due to eventual loss of precision due to the illconditioning of the least squares matrix; however, we leave a formal study and treatment for a future work.In comparison to the performance achieved by POUnets, we assess the performance of the standard MLPs with varying depth {4,8,12,16,20} and width {8,16,32,64,128}. The standard MLPs are trained by using Adam with an initial learning rate and the maximum number of epochs 1000. As opposed to the convergence behavior of RBFNetbased POUnets, the standard MLP briefly exhibits roughly first order convergence before stagnating around an error of .
Piecewise Smooth Functions
We next consider piecewise linear and piecewise quadratic functions: triangle waves with varying frequencies, i.e., , and their quadratic variants , where
(16) 
We study the introduction of increasingly many discontinuities by increasing the frequency . Reproduction of such sawtooth functions by ReLU networks via both wide networks He et al. (2018) and very deep networks Telgarsky (2016) can be achieved theoretically via construction of carefully chosen architecture and weights/biases, but to our knowledge has not been achieved via standard training.
The smoothness possessed by the RBFNet partition functions (POU #1) precludes rapidly convergent approximation of piecewise smooth functions, and we instead employ ResNets (POU #2) for in this case, as the piecewise linear nature of such partition functions is better suited. As a baseline for the performance comparison, we apply regression of the same data over a mean square error using standard ResNets,
. The standard ResNets share the same architectural design and hyperparameters with the
ResNets used to parametrize the POU functions in the POUnets. The only exception is the output layer; the standard ResNets produce a scalarvalued output, whereas the POU parametrization produces a vectorvalued output, which is followed by the softmax activation.
We consider five frequency parameters for both target functions, , which results in piecewise linear and quadratic functions with pieces. Based on the number of pieces in the target function, we scale the width of the baseline neural networks and POUnets as , while fixing the depth as 8, and for POUnets the number of partitions are set as . For POUnets, we choose the maximal degree of polynomials to be and for the piecewise linear and quadratic target functions, respectively.
Both neural networks are trained on the same data, , where are uniformly sampled from [0,1] and . The standard ResNets are trained using the gradient descent method and the POUnets are trained using the LSGD method with (Algorithm 1). The initial learning rate for Adam is set as . The maximum number of epochs is set as 2000 and we choose the neural network weights and biases that yield the best result on the training loss during the specified epochs.
Figure 2 reports the relative errors of approximants produced by the standard ResNets and POUnets for varying and width for the piecewise linear functions (left) and the quadratic piecewise quadratic functions (right). The statistics are obtained from five independent runs. Figure 2 essentially shows that the POUnets outperform the standard ResNets in terms of approximation accuracy; specifically, the standard ResNets with the ReLU activation function significantly fails to produce accurate approximantions, while POUnets obtain error for large numbers of discontinuities.
Figure 3 illustrates snapshots of the target functions and approximants produced by the two neural networks for target functions with the frequency parameter . Figure 3 confirms the trends shown in Figure 2; the benefits of using POUnets are more pronounced in approximating functions with larger and in approximating quadratic functions (potentially, in approximating highorder polynomials). Figures 3(c)–3(f) clearly show that the POUnets accurately approximate the target functions with sub 1% errors, while the standard ResNets significantly fails to produce accurate approximations.
Results of twophase LSGD
Now we demonstrate effectiveness of the twophase LSGD method (Algorithm 2) in constructing partitions which are localized according to the features in the data and nearly disjoint, i.e., breakpoints in the target function, which we found leads to better approximation accuracy.
The first phase of the algorithm aims to construct such partitions. To this end, we limit the expressibility of the polynomial regression by regularizing the Frobenius norm of the coefficients in leastsquares solves. This prevents a small number of partitions dominating and other partitions being collapsed per our discussion of the fast optimizer above. These "quasiuniform" partition are then used as the initial guess for the unregularized second phase. Finally, we study the qualitative differences in the resulting POU, particularly the ability of the network to learn a disjoint partition of space. We do however stress that there is no particular "correct" form for the resulting POU  this is primarily of interest because it shows the network may recover a traditional discontinuous polynomial approximation.
In the first phase, we employ a relatively larger learning rate in order to prevent weights and biases from getting stuck in local minima. On the other hand, in the second phase we employ a relatively smaller learning rate to ensure that the training error decreases.
Triangular waves.
We demonstrate how the twophase LSGD works in two example cases. The first set of example problems is the triangular wave with the frequency parameter . We use the same POU network architecture described in the previous section (i.e., ResNet with width 8 and depth 8) and the same initialization scheme (i.e., the box initialization). We set 0.1 and 0.05 for the initial learning rates for phase 1 and the phase 2, respectively; we set the other LSGD parameters as , , and . Figure 4 (top) depicts both how partitions are constructed during phase 1 (Figures 4(a)–4(d)), and snapshots of the partitions and the approximant at 1000th epoch of the phase 2 in Figures 4(e) and 4(f). The approximation accuracy measured in the relative norm of the error reaches down to after 12000 epochs in phase 2.
Next we consider the triangular wave with the frequency parameter . We use the POU network architecture constructed as ResNet with width 8 and dept 10, and use the box initialization. We set 0.05 and 0.01 for the initial learning rates for phase 1 and phase 2, respectively; we set the other LSGD parameters as , , and . Figure 4 (bottom) depicts again how partitions evolve during the phase 1 and the resulting approximants. Figures 4(g)–4(j) depicts that the twophase LSGD constructs partitions that are disjoint and localized according to the features during the first phase.
Quadratic waves.
Finally, we approximate the piecewise quadratic wave with frequency parameter while employing the same network architecture used for approximating the triangular waves. For the case the learning rate set as 0.5 and 0.25 for phase 1 and phase 2. We use the same parameter settings (i.e., , , and ) as in the previous experiment. Again, in phase 1 disjoint partitions are constructed, and the accurate approximation is produced in the phase 2 (Figures 5(a)–5(f)). Moreover, we observe that the partitions are further refined during phase 2. For the case we again employ the architecture used for the triangular wave with (i.e., ResNet with width 8 and depth 10). We use the same hyperparameters as in the triangular wave with (i.e., 0.05 and 0.01 for learning rates, and ). The two exceptions are the weight for the regularization, and . Figures 5(g)–5(k) illustrates that the phase 1 LSGD constructs disjoint supports for partitions, but takes much more epochs. Again, the twostage LSGD produces an accurate approximation (Figure 5(l)).
Discussion.
The results of this section have demonstrated that it is important to apply a good regularizer during a pretraining stage to obtain approximately disjoint piecewise constant partitions. For the piecewise polynomial functions considered here, such partitions allowed in some cases recovery to nearmachine precision. Given the abstract error analysis, it is clear that such localized partitions which adapt to the features of the target function act similarly to traditional approximation. There are several challenges regarding this approach, however: the strong regularization during phase 1 requires a large number of training epochs, and we were unable to obtain a set of hyperparameters which provide such clean partitions across all cases. This suggests two potential areas of future work. Firstly, an improved means of regularizing during pretraining that does not compromise accuracy may be able to train faster; deep learning strategies for parametrizing charts as in Schonsheck, Chen, and Lai (2019) may be useful for this. Secondly, it may be fruitful to understand the approximation error under less restrictive assumptions that the partitions are compactly supported or highly localized. We have been guided by the idea that polynomials defined on indicator function partitions reproduce the constructions used by the finite element community, but a less restrictive paradigm combined with an effective learning strategy may prove more flexible for general datasets.
Conclusions
Borrowing ideas from numerical analysis, we have demonstrated an novel architecture and optimization strategy the computational complexity of which need not scale exponentially with the ambient dimension and which provides highorder convergence for smooth functions and error consistently under for piecewise smooth problems. Such an architecture has the potential to provide DNN methods for solving highdimensional PDE that converge in a manner competitive with traditional finite element spaces.
Acknowledgements
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DENA0003530. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. SAND Number: SAND20206022 J.
The work of M. Gulian, R. Patel, and N. Trask are supported by the U.S. Department of Energy, Office of Advanced Scientific Computing Research under the Collaboratory on Mathematics and PhysicsInformed Learning Machines for Multiscale and Multiphysics Problems (PhILMs) project. E. C. Cyr and N. Trask are supported by the Department of Energy early career program. M. Gulian is supported by the John von Neumann fellowship at Sandia National Laboratories.
References

Abadi et al. (2016)
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.;
Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016.
Tensorflow: A system for largescale machine learning.
In 12th USENIX symposium on operating systems design and implementation (OSDI 16), 265–283.  Adcock and Dexter (2020) Adcock, B.; and Dexter, N. 2020. The gap between theory and practice in function approximation with deep neural networks. arXiv preprint arXiv:2001.07523 .
 Bach (2017) Bach, F. 2017. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research 18(1): 629–681.
 Beck, Jentzen, and Kuckuck (2019) Beck, C.; Jentzen, A.; and Kuckuck, B. 2019. Full error analysis for the training of deep neural networks. arXiv preprint arXiv:1910.00121 .

Bengio and Bengio (2000)
Bengio, S.; and Bengio, Y. 2000.
Taking on the curse of dimensionality in joint distributions using neural networks.
IEEE Transactions on Neural Networks 11(3): 550–557.  Billings and Zheng (1995) Billings, S. A.; and Zheng, G. L. 1995. Radial basis function network configuration using genetic algorithms. Neural Networks 8(6): 877–890.
 Broomhead and Lowe (1988) Broomhead, D. S.; and Lowe, D. 1988. Radial basis functions, multivariable functional interpolation and adaptive networks. Technical report, Royal Signals and Radar Establishment Malvern (United Kingdom).
 Chollet (2017) Chollet, F. 2017. Deep learning with Python. Manning Publications Company.
 Cyr et al. (2019) Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and Trask, N. A. 2019. Robust training and initialization of deep neural networks: An adaptive basis viewpoint. arXiv preprint arXiv:1912.04862 .
 Daubechies et al. (2019) Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; and Petrova, G. 2019. Nonlinear approximation and (deep) ReLU networks. arXiv preprint arXiv:1905.02199 .
 Fokina and Oseledets (2019) Fokina, D.; and Oseledets, I. 2019. Growing axons: greedy learning of neural networks with application to function approximation. arXiv preprint arXiv:1910.12686 .
 Fries and Belytschko (2010) Fries, T.P.; and Belytschko, T. 2010. The extended/generalized finite element method: an overview of the method and its applications. International journal for numerical methods in engineering 84(3): 253–304.
 Geist et al. (2020) Geist, M.; Petersen, P.; Raslan, M.; Schneider, R.; and Kutyniok, G. 2020. Numerical solution of the parametric diffusion equation by deep neural networks. arXiv preprint arXiv:2004.12131 .

Han, Jentzen, and Weinan (2018)
Han, J.; Jentzen, A.; and Weinan, E. 2018.
Solving highdimensional partial differential equations using deep learning.
Proceedings of the National Academy of Sciences 115(34): 8505–8510.  He et al. (2018) He, J.; Li, L.; Xu, J.; and Zheng, C. 2018. ReLU deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 .

He et al. (2016)
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, 770–778.  Hebey (2000) Hebey, E. 2000. Nonlinear Analysis on Manifolds: Sobolev Spaces and Inequalities, volume 5. American Mathematical Soc.
 Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 Opschoor, Petersen, and Schwab (2019) Opschoor, J. A.; Petersen, P.; and Schwab, C. 2019. Deep ReLU networks and highorder finite element methods. SAM, ETH Zürich .
 Patel et al. (2020) Patel, R. G.; Trask, N. A.; Gulian, M. A.; and Cyr, E. C. 2020. A block coordinate descent optimizer for classification problems exploiting convexity. arXiv preprint arXiv:2006.10123 .
 Schonsheck, Chen, and Lai (2019) Schonsheck, S.; Chen, J.; and Lai, R. 2019. Chart AutoEncoders for Manifold Structured Data. arXiv preprint arXiv:1912.10094 .
 Spivak (2018) Spivak, M. 2018. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. CRC press.
 Strouboulis, Copps, and Babuška (2001) Strouboulis, T.; Copps, K.; and Babuška, I. 2001. The generalized finite element method. Computer methods in applied mechanics and engineering 190(3233): 4081–4193.
 Telgarsky (2016) Telgarsky, M. 2016. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485 .
 Wang, Teng, and Perdikaris (2020) Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understanding and mitigating gradient pathologies in physicsinformed neural networks. arXiv preprint arXiv:2001.04536 .

Wendland (2002)
Wendland, H. 2002.
Fast evaluation of radial basis functions: Methods based on partition
of unity.
In Approximation Theory X: Wavelets, Splines, and
Applications
. Citeseer.
 Yarotsky (2017) Yarotsky, D. 2017. Error bounds for approximations with deep ReLU networks. Neural Networks 94: 103–114.
 Yarotsky (2018) Yarotsky, D. 2018. Optimal approximation of continuous functions by very deep ReLU networks. arXiv preprint arXiv:1802.03620 .
Comments
There are no comments yet.