1 Introduction
The emergence of PhysicsInformed Neural Networks (PINNs) (Raissi2017) has sparked a lot of interest in domains that see themselves regularly confronted with problems in the low data regime. By leveraging wellknown physical laws and incorporating them as implicit prior into the deeplearning pipeline, PINNs were shown to require little to no data in order to approximate partial differential equations (PDE) of varying complexity (Raissi2017; hongweicollocationbendinganalysis).
We consider the case where PINNs are used for finding the unknown, underlying function proper to a parameterised PDE. PDEs generally consist of a governing equation and a set of boundary as well as initial conditions. When trained jointly, i.e. as multiobjective optimisation (MOO), these equations form a set of objective functions that drive the model to approximate a function that satisfies the PDE. Several established and wellstudied numerical methods already exist for addressing this problem, such as the Finite Elements Method (FEM), Finite Difference Method or Wavelets and Laplace Transform Method (bathe1996finite; hughes2012finite; smith1985a; Grossmann2007NumericalTO)
. However, PINNs present a differentiable, meshfree approach and avoid the curse of dimensionality
(grohs2018proof; poggio2017and). They could therefore prove useful in several engineering applications, such as inversion and surrogate modeling in solid mechanics (haghighat2021physics), design optimisation (martins_ning_2021) or structural health monitoring and system identification (yuan2020machine). PINNs also found rich applications in computational fluid mechanics and dynamics for surrogate modelling of numerically expensive fluid flow simulations (xiang2021self), identification of hidden quantities of interest (velocity, pressure) from spatiotemporal visualisations of a passive scaler (dye or smoke) (raissi2020hidden) or in an inverse heat transfer application setting in flow past a cylinder without thermal boundaries (cai2020heat).However, until the point where PINNs find their application in practice, further research is necessary to tackle current failure modes (wang2020; mcclenny2020a), one of which is the issue of gradient pathologies arising from imbalanced loss terms (wanggradientpathologies)
. With the various terms in the objective function stemming from physical laws, they are naturally bound to units of measurements that can vary significantly in magnitude. Consequently, the signal strengths of backpropagated gradients might differ from term to term and lead to pathologies that were shown to impede proper training and cause imbalanced solutions
(wanggradientpathologies), hence posing challenges to global optimisation methods such as Adam, Stochastic Gradient Descent (SGD), or LBFGS
(zhang1801a; adam; theodoridis2015a; kendall2018a). As a countermeasure, every individual term may be scaled by a factorin order to balance its contribution to the total gradient. However, manual tuning of these scaling factors requires laborious grid search and becomes intractable as the number of terms grows due to the sensitivity and interdependence of these hyperparameters.
This work investigates different algorithms aiming at adaptively balancing the contributions of multiple terms and their gradients in the loss function by selecting the optimal scaling factors in order to improve approximation capabilities of PINNs. To this end, we compare the effectiveness of Learning Rate Annealing (LRAnnealing) (wanggradientpathologies)
, proposed in the context of PINNs, to two approaches originating from Computer Vision applications: GradNorm
(gradnorm) and SoftAdapt (softadapt). In addition, we derive and present our own variation of an adaptive loss scaling technique, ReLoBRaLo (Relative Loss Balancing with Random Lookback), that we found to be more effective at similar efficiency compared to stateoftheart by testing the algorithms on various benchmark problems for PINNs in the forward and inverse setting: Helmholtz, Burgers and Kirchhoff PDEs. In a future paper (kraufbischof_relobralo) we show that besides the application to scaling the loss in PINNs, the proposed ReLoBRaLo approach generalises to problems handling multitask penalty problems (as defined by (coello2000use)) and provides an effective, selfadaptive approach for optimising the penalty factors.This paper is organised as follows: we first provide a short introduction to the problem as well as the stateoftheart of PINNs in sec. 2. Further methodical background on multiobjective optimisation (MOO), the framework of PhysicsInformed Neural Networks (PINNs) and loss balancing for PINNs training is presented in sec. 3. In sec. 4 we introduce ReLoBRaLo as a novel selfadaptive loss balancing method. Sec. 6 reports numerical results of the developed approach against stateoftheart methods for several examples in the forward and inverse setting: Burgers, Kirchoff and Helmholtz PDEs. Sec. 7 presents results of ablation studies as well as a discussion of findings and drawing further conclusions on ReLoBRaLo and its hyperparameter settings across all examples of this paper. Finally, a summary together with an outlook is given in sec. 8. All code produced within this publication is freely available and open access here: https://github.com/rbischof/relative_balancing
2 PhysicsInformed Neural Networks (PINNs)
This section reviews basic PhysicsInformed Neural Networks (PINNs) concepts and recent developements.
2.1 Problem Statement and Basic Concept
Consider the following abstract parameterised and nonlinear PDE problem:
(1) 
where is the spatial coordinate and is the time; denotes the residual of the PDE, containing the differential operators (i.e. ); are the PDE parameters; is the solution of the PDE with initial condition and boundary condition (which can be Dirichlet, Neumann or mixed); , and represent the spatial domain resp. boundary. A special example considered in this paper is the Burgers equation (given in eq. 12): with PDE parameter as viscosity coefficient . This paper is concerned with solving forward as well as inverse problems from different fields of application. For the forward problem, solutions of PDEs are to be inferred with fixed parameters , while for the inverse problem setting, is unknown and has to be learned from observed data together with the PDE solution.
Sticking with the initial "vanilla" implementation of PINNs (Raissi2017), a fullyconnected feedforward neural network (FCNN) is used to approximate the function which solves the PDE. A FCNN consists of multiple hidden layers with trainable parameters (weights and biases; denoted by ) and takes as inputs the space and time coordinates , cf. fig. 1. The losses are then defined as follows:
(2)  
where is a set of collocation points on the physical domain, for the boundary conditions (BC), for the initial conditions (IC) and represents a set of measurements (data); the function maps to measurements at those coordinates; is the output from the neural network . PINNs are generally trained using the L2norm (mean squared error / MSE) on uniformly sampled collocation points defined as a data set prior to training. Note that the number of points (denoted by in eq. 2) may vary for different loss terms.
The four objectives in eq. 2 are trained jointly and hence fall into the class of multiobjective optimisation (MOO) (cf. eq. 4)
(3) 
where the individual terms can be interpreted the following way:

the first term penalises the residual of the governing equations (PDEs), included in both the forward and inverse problem.

the following terms enforce the boundary conditions (BCs), included only in the forward problem.

the following terms enforce the initial conditions (ICs), included only in the forward problem.

the last term makes the network approximate the measurements, included both in the forward (albeit not strictly necessary) and inverse problem.
A common approach at handling MOO is through linear scalarisation, described in more detail in sec. 3.1.
2.2 StateoftheArt and Related Work
Using neural networks to approximate the solutions of Ordinary Differential Equations (ODEs) and PDEs has been intensively studied over the past decade. Initially,
(lagaris1998artificial; lagaris_2000) trainied neural networks to solve ODEs and PDEs on a predefined set of grid points, while (Sirignano2018) proposed a method for solving highdimensional PDEs through approximation of the solution by a neural network and especially emphasising training efficiency by incorporating minibatch sampling in high dimensional settings compared to the computationally intractable finite meshbased schemes.(Raissi2018; wu_2018; hochreiter_2003; attention_is_all_you_need; Sirignano2018; jagtap2020a; SUN2020) recently reported PhysicsInformed Neural Networks (PINNs) to be a notable method for solving a variety of PDEs or PDE based systems using observed data in a supervised regression manner while satisfying all of the physical properties specified by nonlinear PDEs. In (Raissi2018), the authors provided empirical justification by numerical simulations for a variety of nonlinear PDEs, including the Navier–Stokes equation and the Burgers equation. (shin_2020) provided first theoretical justification for PINNs by demonstrating convergence of linear elliptic and parabolic PDEs in L2 sense. Furthermore, a significant advantage of PINNs is the fact that the pipeline for solving the forward problem can be turned into a datadriven PDE discovery, also known as the inverse problem, with just minor adaptations of the code.
However, PINN training efficiency, convergence and accuracy remain serious challenges (raissi2019physics; xiang2021self). Current research may be ordered into four main approaches: modifying structure of the NN, divideandconquer/domain decomposition, parameter initialisation and loss balancing.
Only few studies addressed the acceleration of convergence by modifying the structure of PINNs. (jagtap2020b; jagtap2020a)
introduced parameters that scale the input to the activation functions and get updated alongside the network’s parameters
through gradient descent. The authors showed that the adaptive activation function significantly accelerated convergence and also improved solution accuracy. (kim2020) presented a fast and accurate PINN ROM with a nonlinear manifold solution representation, where the NN structure included an encoder and a decoder part. Furthermore a shallow masked encoder was trained using data from the full order model simulations in order to use the trained decoder as representation of the nonlinear manifold solution. (peng2020) proposed dictionarybased PINNs to store and retrieve features and speed up convergence by merging prior information into the structure of NNs.Other research focused on decomposing the computational domain in order to accelerate convergence. (jagtap2020c; jagtap2020d) proposed conservative PINNs and extended PINNs that decompose the computational domain into several discrete subdomains, each one solved independently using a separate, shallow PINN. Inspired by the work of (jagtap2020c; jagtap2020d), (shukla2021) derived and investigated a distributed training framework for PINNs that used domain decomposition methods in both space and timespace. To accelerate convergence, the distributed framework combined the benefits of conservative and extended PINNs. The timespace domain may become very large when solving PDEs with long time integration, causing the training cost of NNs to become extremely expensive. To that end, (meng2020) proposed a parareal PINN to address the longstanding issue. The authors decomposed the longtime domain into many discrete shorttime domains using a fast coarsegrained solver. Training multiple PINNs with many small data sets was much faster than training a single PINN with a large data set. For PDEs with longtime integration, the parareal PINN achieved a significant speedup. (kharazmi2021) introduced hpvariational PINNs to divide the computational space into the trial space and test space by combining domain decomposition and projection onto highorder polynomials.
In most works, researchers resort to the Xavier initialisation (glorot_2010) for selecting the PINN’s initial weights and biases. The effects of using more refined initialisation procedures has recently been gaining in attention, with (liu2021novel)
showing that a good initialisation can provide PINNs with a head start, allowing them to achieve fast convergence and improved accuracy. Transfer learning for PINNs was introduced by
(CHAKRABORTY2021) and (GOSWAMI2020) to initialise PINNs for dealing with multifidelity problems and brittle fracture problems, respectively. After their success in other fields of Deep Learning, metalearning algorithms have also been implemented in the context of PINNs (Rajeswaran2019; Smith2009; Finn2017a; Finn2018), with ModelAgnostic MetaLearning (MAML) being amongst the most popular ones (Finn2017b). Its secondorder objective is to find an initialisation that is in on itself suboptimal, but from where the network requires only few labeled training samples and optimisation steps in order to specialise on a task and achieve high accuracy (fewshot learning). Subsequently, (Nichol2018) proposed the REPTILE algorithm, which turns the secondorder optimisation of MAML into a firstorder approximation and therefore requires significantly less computation and memory while achieving similar performance. (liu2021novel) applied the REPTILE algorithm to PINNs by regarding modifications of PDE parameters as separate tasks. The resulting initialisation is such that the PINN converges in just a few optimisation steps for any choice of PDE parameters.Using derivative information of the target function during training of a neural network was introduced by (Czarnecki2017) under the term Sobolev Training. Sobolev Training proved to be more efficient in many applicable fields due to lower sample complexity compared to regular training. In (son2021sobolev) the concept of Sobolev Training is enhanced in the strict mathematical sense using Sobolev norms in loss functions of neural networks for solving PDEs. It was found that these novel Sobolev loss functions lead to significantly faster convergence on investigated examples compared to traditional L2 loss functions. NNs were used in plain as well as a Sobolev Training manner for constitutive modelling in (vlassis2021sobolev; vlassis2020geometric; krausphdthesis; kraus2020artificial), where it was shown that mechanical relations can be seen as Sobolev training to successfully encapsulate several aspects of the constitutive behavior, such as strainstressrelationships arising from derivatives of a Helmholtz potential in hyperelasticity.
(colby2020) observed that a weighted scalarisation of the multiple loss functions, defined by the sampled data and physical laws for PINNs training, plays a significant role for convergence. (wanggradientpathologies) recently published a learning rate annealing algorithm that employs backpropagated gradient statistics in the training procedure in order to adaptively balance the terms’ contributions to the final loss. (wang2020) investigated the issue of vanishing and exploding gradients that currently limits the applicability of PINNs. To that end, the authors introduced a Neural Tangent Kernel (NTK), which appropriately assigns weights to each loss term at subtle performance improvement, in order to comprehend the training process for PINNs. (shin_2020) developed the Lipschitz regularised loss for solving linear secondorder elliptic and parabolic type PDEs. (mcclenny2020a) proposed a method for updating the adaptation weights in the loss function in relation to network parameters. Selfadaptive PINNs are forced to meet all physical constraints as equitably as possible. Although many studies have been conducted to confirm the effects of loss functions on generalisation performance, the competitive relationship between the physical objectives is seldom taken into account. According to (elhamod2020)
, tuning the competing physicsguided (PG) loss functions at various neural network learning stages is crucial. To that end, two approaches for selecting the tradeoff weights of loss terms with different characteristics were proposed: annealing and cold starting, which affect the initial or final epochs. A drawback of that method however is the fact that selecting the appropriate type of sigmoid function, which primarily influences accuracy, is still timeconsuming.
As a result of the literature review, the selfadaptive PINNs training procedures can be regarded as a PDEconstrained optimisation problem. This paper is concerned with careful consideration of competitiveness and adaptability and seeks inspiration from loss balancing techniques proposed across several fields of machine learning. Investigations on the performance, convergence and accuracy are conducted for the forward and inverse problem setting for several PDEs from engineering and natural sciences.
3 Methodology
This section introduces relevant methodical background from multiobjective optimisation (MOO), adaptive training and hyperparamter tuning.
3.1 MultiObjective Optimisation
Multiobjective optimisation (MOO) is concerned with simultaneously optimising a set of , potentially conflicting objectives (multitask_caruana; jones2002multi).
(4) 
Many problems in engineering, natural sciences or economics can be formulated as multiobjective optimisations and generally require tradeoffs to simultaneously satisfy all objectives to a certain degree (moo_engineering). The solution of MOO models is usually expressed as a set of Pareto optima, representing these optimal tradeoffs between given criteria according to the following definitions (pareto_multi_task):
Definition 1
A solution Pareto dominates solution (denoted ) if and only if and such that .
Definition 2
A solution is said to be Pareto optimal if . The set of all Pareto optimal points is called the Pareto set and the image of the Pareto set in the loss space is called the Pareto front.
A multiobjective optimisation can be turned into a single objective through linear scalarisation:
(5) 
In theory, a Pareto optimal solution is independent of the scalarisation (efficient_moo). However, when using neural networks for MOO, the solution space becomes highly nonconvex. Thus, although neural networks are universal function approximators (Hornik1990), they are not guaranteed to find the optimal solution through firstorder gradientbased optimisation. Scaling the loss space therefore provides the option of guiding the gradients into having an a priori deemed desirable property. However, manually finding optimal requires laborious grid search and becomes intractable as gets large. Furthermore, one might want to let
evolve over time. This raises the need for an automated heuristic to dynamically choose the scalings
.3.2 Adaptive Loss Balancing Methods
This section reviews different methods aiming at balancing the various terms within multiobjective optimisation. To this end, we compare the effectiveness of Learning Rate Annealing (wanggradientpathologies), proposed in the context of PINNs as well as two approaches originating from Computer Vision applications: GradNorm (gradnorm) and SoftAdapt (softadapt). This forms the basis for deriving and presenting our own loss balancing method as given in sec. 4
3.2.1 Learning Rate Annealing
wanggradientpathologies conducted a study on gradients in PINNs and identified pathologies that explained some failure modes. One pathology is gradient stiffness in the boundary conditions caused by the imbalance amongst the different loss terms. As a remedy, it is proposed to adaptively scale the loss using gradient statistics, thus reducing the laborious tuning of these hyperparameters.
(6)  
where is the mean of the gradient w.r.t. the parameters ; is a hyperparameter with a value recommended by the authors.
With this method, whenever the maximum value of grows considerably larger than the average value in , the scalings correct for this discrepancy such that all gradients have similar magnitudes. Additionaly, exponential decay is used in order to smoothen the balancing and avoid drastic changes of the loss space between optimisation steps.
This procedure induces a few drawbacks. Its unboundedness potentially involves up or downscaling of terms by means of several orders of magnitude. The upscaling in particular can cause problems similar to the effect of choosing a learning rate that is too large and therefore potentially overshooting the objective. Furthermore, scaling all terms to have the same magnitude throughout training can incite the network to optimise for the "low hanging fruit". A term, whose loss decreased considerably in the last optimisation step, will see its contribution to the total gradient scaled back up to the same magnitude to match the other terms. Therefore, the network might focus on the objectives that are easiest to optimise for.
3.2.2 GradNorm
(gradnorm) takes a different approach and makes the scalings trainable. The updates on these trainable scalings are chosen such that all terms improve at the same relative rate w.r.t. their initial loss and performed by a separate optimiser. A term that improved at a higher rate since the beginning of training compared to the other terms, gets a weaker scaling until all terms have made the same percentual progress. Therefore, one could argue that they weakly enforce each optimisationstep to Pareto dominate (cf. definition 1) its predecessor. The loss for updating the scalings within GradNorm is computed as follows:
(7) 
where is the norm of the gradient w.r.t. the networks parameters for the scaled loss of objective ; is the average of all gradient norms; defines the rate at which term improved so far; is a hyperparameter representing the strength of the restoring force which pulls tasks back to a common training rate. Note that is the desirable value that should take on, so gradients must be prevented from flowing through this expression. The final loss for updating the networks parameters is then simply a linear scalarisation with the scalings that were previously updated:
(8) 
This algorithm is fairly evolved and, despite solving some of Learning Rate Annealing’s issues, it still requires a separate backwardpass for each task, which becomes prohibitively expensive as gets large. Furthermore, it relies on two separate optimisation rounds at each step: one for adapting the scalings and another for updating the weights . By means of eq. 4, GradNorm can thus be formulated as a scalarised MOO objective via:
(9) 
which in turn requires empirical hyperparameter tuning (learning rate, initialisation, etc.) to keep the system balanced  exactly the problem we are actually trying to solve through the use of adaptive loss balancing techniques.
3.2.3 SoftAdapt
Similar to GradNorm, SoftAdapt (softadapt) leverages the ansatz of relative progress in order to balance the loss terms. However, the authors relax it by only considering the previous timestep . The scalings are then normalised by using a softmax function:
(10) 
where is the loss of term at optimisation step .
SoftAdapt also differs from GradNorm in the sense that it does not require gradient statistics and thus eliminates the need of performing separate backward passes for each objective. Instead, it makes use of the fact that magnitudes in the gradients directly depend on the magnitudes of the terms in the loss function and therefore aims at achieving the balance solely through loss statistics. Of course, this is only true if the same loss function is used for every objective (e.g. the loss). However, this setting generalises to a vast majority of applications invovling PINNs.
4 Relative Loss Balancing with Random Lookback (ReLoBRaLo)
Drawing inspiration from existing balancing techniques as outlined in sec. 3.2, we propose a novel method and implementation for balancing the multiple terms in the scalarised MOO loss function for training of PINNs upon:

SoftAdapt’s balancing method is employed, using the rate of change between consecutive training steps and normalising them through a softmax function.

similarly to Learning Rate Annealing, the scalings are updated using an exponential decay in order to utilise loss statistics from more than just one training step in the past.

in addition, a random lookback (called saudade ) is introduced into the exponential decay, which decides whether to use the pervious steps’ loss statistics to computed the scalings, or whether to look all the way back until the start of training .
(11)  
where
is a Bernoulli random variable and
should be chosen close to 1.This method is an attempt at combining the best attributes of the aforementioned approaches into a new heuristic for scalarised MOO objective functions. First and foremost, it still weakly enforces every training step to Pareto dominate its predecessor, which is an important property in physical applications. It also avoids using gradient statistics, making it considerably more efficient than Learning Rate Annealing and GradNorm. Furthermore, it reduces drastic changes in the loss space by using exponential decay and can easily be adapted to use more or fewer information of past optimisation steps by tuning the hyperparameter . One can think of as being the model’s ability to remember the past, with a high alpha giving lots of weight to past loss statistics, while a lower alpha increases stochasticity. Setting results in each term’s relative progress being computed w.r.t. the initial loss . However, we found this to be too restrictive, since it causes the model to stop making progress as soon as one term reaches a local minimum. We chose values between 0.9 and 0.999 and report the effects of varying this hyperparameter in sec. 7. Note that and make Relative Loss Balancing equivalent to SoftAdapt.
Choosing the value of also requires to make a tradeoff: a high value means the model will remember potential deteriorations of certain terms for longer and therefore leave a longer timeframe in order to compensate them. However, it also induces a latency between a term starting to deteriorate and the scalings reacting accordingly. We therefore study the effect of introducing the saudade Bernoulli random variable that causes the model to occasionally look back until the start of training. is maximum saudade as it always takes the loss value of the initial training step, while corresponds to minimum saudade, taking only into account the last value from the history of the th scaling factor. Selecting somewhere between 0 and 1 allows to set a lower value for , thus making the model more flexible while still occasionally "reminding" it of the progress made since the start of training. Furthermore, the random lookback can give episodic new impulses and let the model escape local minima by changing the loss space, as well as inciting it to explore more of the parameter space. In case the impulse would turn out to have a negative effect on the accuracy, one can still choose to roll back and reset the network’s parameters to the previous state.
The last hyperparameter is the socalled temperature . Setting recalibrates the softmax to output uniform values and thus all . On the other hand, essentially turns the softmax into an argmax function, with the scaling resulting for the term with the lowest relative progress and for all others.
A pedagogical example with interpretation of expected behaviours and how to draw conclusions from the histories of the scalings is given for Burgers’ equation in sec. 6.1.
5 Hyperparameter Tuning and Meta Learning
This paper uses grid search in combination with Bayesian Optimisation (BO) (gridsearch; bayesian_optimization_snoek; bayesian_opt_implementation) for hyperparameter tuning. This study uses hyperparameters for defining the NN architecture ( hidden layers and neurons per layer) and training settings (learning rate , exponential decay rate and saudade ). Tab. 1
contains the ranges and distributions for the hyperparameters. BO reduces the empiricism of selecting the PINNs hyperparameters to learn an optimal NN structure. First, 20 random points in the hyperparameter space are sampled and evaluated. The model’s performance at those points serves as evidence for fitting prior Gaussian Processes in order to estimate the unknown loss function w.r.t. the hyperparameters. Using Expected Improvement (EI)
(expected_improvement), further 80 points are then sampled and evaluated to refine the prior. This procedure provides an educated guess as to which are the optimal hyperparameters for the task at hand. Finally, we finetune the results by performing finegrained grid search around the hyperparameters returned by BO.Hyperparameter  Range  Logscaling 

Learning Rate  []  yes 
Layers  [2, 4]  no 
Neurons per Layer  [32, 512]  no 
Exponential Decay Rate  [0, 1]  no 
Temperature  []  yes 
Expected Saudade  [0, 1]  no 
Activation function  {tanh, sigmoid}  no 
6 Results
We test the different balancing techniques on three problems (Burgers equation, Kirchhoff plate bending and Helmholtz equation) originating from physicsinformed deep learning, where the objective function consists of various terms of potentially considerably different magnitudes and compare their performances, as well as their computational efficiency. Training was done on networks of varying depth and width (acc. to tab. 1) and limited to steps of gradient descent (GD) using the Adam optimiser (adam) and an initial learning rate of 0.001. Additionally, we reduced the learning rate by a multiplicative factor of 0.1 whenever the optimisation stopped making progress for over 3’000 optimisation steps and finally used early stopping in case of 9’000 steps without improvement. When addressing the inverse problem, i.e. approximating a set of measurements while subjecting the network to PDEconstraints for finding unknown PDEparameters , we further investigated the payoff of using two separate optimisers: one for updating network weights , and a separate one for updating PDE parameters . Further details on hyperparameter tuning and meta learning is given in secs. 5 and 7.
6.1 Burgers’ Equation
Burgers’ equation is a onedimensional NavierStokes equation used i.a. to model shock waves, gas dynamics or traffic flow (burgers_pde). Using Dirichlet boundary conditions, the PDE takes the following form:
(12)  
At first, we investigate the solution of the forward problem, where we set the PDE parameter . In order to find the latent function , we can parameterise it with a neural network and turn the set of equations into a linear scalarised objective (cf. eq. 5) of Mean Squared Errors (MSE). This loss function will weakly enforce the network to approximate the PDE solution .
(13)  
For the forward problem, PINNs training induces the following loss function employed during training:
(14) 
After successful convergence of PINNs training using ReLoBRaLo, we obtain the results displayed in fig. 2(b), whereas the final algorithm settings are reported in tab. 8. As there is no analytical solution available for the Burgers equation, we computed a reference solution using the finite element method (FEM), displayed in fig. 2(a). A plot of the squared difference in as given by the FEM and PINNs is shown in fig. 2(c) and delivers a relative max error of below 5%.
(a) FEMResult  (b) PINNResult  (c) Squared Error 

However, Burgers’ equation can also be turned into an inverse problem by regarding the PDE parameter as an unknown to be estimated from a set of observations (i.e. data) over the spatial and temporal domain. In this setting, the PINNs induced loss function to be deployed reads:
(15)  
Similar to the network’s weights and biases, the additional trainable PDE variable (here viscosity ) is now also updated through gradient descent:
(16)  
Measurement data for the inverse problem setting were obtained from our reference solution computed using the FEM without addition of noise. At every iteration, we sample from the available data in order to generate a batch of collocation points.




loss over multiple training runs (i/left) of Burgers’ equation and the mean and variance of the corresponding scaling factors
(ii/right) computed with ReLoBRaLo.Fig. 3 shows the scaling factors of our ReLoBRaLo method with varying hyperparameters for Burgers’ equation. As can be expected, a larger value for leads to a smoother curve because past loss statistics are dragged on longer, therefore countering the stochasticity that might arise at every optimisation step. On the other hand, the temperature influences the magnitude of the scalings. Another general tendency in the plots of the scaling factors is the fact that the relatively lower loss contributions (here: BC 1 and BC 2) correspond to higher scaling values (potentially greater than 1) while the larger loss contributions (here: PDE and IC) correspond to lower scaling values (potentially less than 1). Note that, whenever "log" is used in this and all subsequent figures, we refer to the natural logarithm of that quantity.
Burgers  Manual  GradNorm  LR anneal.  SoftAdapt  ReLoBRaLo  

Forward  train  5.5  6.6  9.9  4.5  5.6 
val  1.2  2.0  1.6  1.8  1.4  
std val  5.7  2.1  2.3  3.2  6.8  
Inverse  val  1.90  6.82  2.45  1.07  2.17 
std  1.23  5.07  3.44  4.27  2.11 
training and validation loss on Burgers’ equation. The reported values are the median over four independent runs with identical settings. Additionally, we report the standard deviation over the runs of the best performing model on the validation loss.
The relatively small variances across training runs with suggest that the optimisation progress follows similar patterns, even when varying depth and width of the network. Therefore, these values can provide valuable insight into the training and help identifying possibilities of improving the model. E.g. the fact that the scaling for the governing equation has the largest value after 50,000 epochs, indicates that it was the first term to stop making progress. On the other hand, fig. 6 shows that the opposite is true for Helmholtz’ and Kirchhoff’s equations, where the boundary conditions have more difficulties making progress. This knowledge can help taking informed decisions to improve the framework, e.g. by adapting the activation functions, the loss function or the model’s architecture accordingly.
Tab. 2 summarises the performances of the different balancing techniques against a baseline for the forward and inverese problem setting, where we manually chose the optimal scalings
through grid search. As can be observed, the adaptive scaling techniques perform similarly well to the baseline, with Learning Rate Annealing and ReLoBRaLo reaching a considerably lower validation error. The results show that either one of these methods greatly reduces the amount of work required for hyperparameter search, while still achieving great results with high probability.
Convergence of PINNs Parameter Estimation for Burgers’ equation 
Besides the accuracy, the computational efficiency is another important metric for evaluating adaptive loss balancing methods. By designing our ReLoBRaLo method such that it requires only one backward pass, it’s computational overhead can be expected to be relatively small compared to GradNorm and Learning Rate Annealing, which both utilise gradient statistics and hence separate backward passes for each term. Indeed, tab. 3 shows that Burgers’ equation with it’s four terms in the loss function can be solved by ReLoBRaLo about 40% faster than Learning Rate Annealing and 70% faster than GradNorm and thus adds to efficiency and sustainability of PINNs training. Note that the reported values in tab. 3 stem from tasks where the balancing operation was performed at every optimisation step. Both GradNorm and Learning Rate Annealing can be made more efficient by updating the scaling terms once every arbitrary number of iterations. However, this introduces a tradeoff between flexibility and efficiency and therefore an additional, very sensitive hyperparameter with a high impact on the method’s accuracy and efficiency. On the other hand, ReLoBRaLo adapts its scalings at every iteration and very low computational cost.
Manual  GradNorm  LR annealing  SoftAdapt  ReLoBRaLo  
per 1’000 it.  3.7  14.3  7.2  4.1  4.3 
It is interesting to comprehend that, while the forward problem induced a loss function consisting of four terms, the inverse problem requires only two terms (eq. 15). Hence, selecting the scalings manually is significantly less timeconsuming than it is for the forward problem. Consequently, tab. 2 shows that the baseline was harder to outperform, with Learning Rate Annealing (LR annealing) and ReLoBRaLo being the only methods yielding better results. It is worth noting however that learning rate annealing approximates the true value of significantly faster than ReLoBRaLo and is therefore the optimal choice for this particular problem setting, cf. fig. 4. Further conclusions and comparisons across different loss balancing methods are made in sec. 7.
(a) Analytical Result  (b) PINNResult  (c) Squared Error 

6.2 Kirchhoff Plate Bending Equation
The Kirchhoff–Love theory of plates arose from civil and mechanical engineering and consists of a twodimensional mathematical model used to determine stresses and deformations in thin plates subjected to forces and moments
bathe1996finite. The Kirchhoff plate bending problem assumes that a midsurface plane can be used to represent a threedimensional plate in twodimensional form and together with a linear elastic material a fourthorder PDE can be derived to describe its mechanical behaviour:(17)  
where is the load acting on the plate at coordinates ; is the plate’s flexural stiffness computed with Young’s modulus , the plate’s thickness and Poisson’s ratio . The Kirchhoff plate bending problem poses several severe problems to FEM solutions bathe1996finite, yet analytical solutions can be inferred e.g. using Fourier series for special cases such as an applied sinusoidal load:
(18)  
In this paper we employ a concrete plate possessing width , length , base load , Young’s modulus , plate height and Poisson’s ratio of for simply supported edge boundary conditions as it arises in typical civil engineering structures such as slabs. We hence consider the following boundary conditions (BC):
(19)  
where and are bending moments computed as and .
In total, we obtain 8 boundary conditions and therefore 9 terms in the PINNs loss function, making this a challenging task for balancing the contributions of the various objectives:
(20)  
(i) PINNs Convergence  (ii) ReLoBRaLO Loss Scaling 
After successful convergence of PINNs training, we obtain the results displayed in fig. 5(b) and compare it to the analytically available solution displayed in fig. 5(a). A plot of the squared difference in as given by the analytical and PINNs results is shown in fig. 5(c) and delivers a negligible max error. The final algorithm settings are reported in tab. 8.
Fig. 6 shows an example of ReLoBRaLo’s training process on Kirchhoff’s equation. In this particular example, one can notice the larger variance of scaling values towards the end of training. Also, the scalings did not converge towards the value 1, thus suggesting that the training stopped without all terms having stopped making progress, i.e. the scalings for the boundary conditions on the moments (yellow) were increasing at the end of training, while the boundary conditions on the displacements (red) were decreasing. This gives a strong indication as to where the model’s limitations lie. In this case, additional attention should be paid to the moments, e.g. by selecting an activation function which is better behaved in the second derivative than (siren).
Kirchhoff  Manual  GradNorm  LR anneal.  SoftAdapt  ReLoBRaLo  

Forward  train  1.2  5.3  9.1  2.0  6.0 
val  1.3  1.7  2.7  4.2  4.0  
std val  3.9  2.2  1.0  4.7  7.7  
Inverse  val  2.13  3.60  5.99  9.53  3.23 
std  1.56  4.72  0.80  4.58  2.91 
Convergence of PINNs Parameter Estimation for Kirchhoff’s equation 
Concerning performance, ReLoBRaLo outperforms the baseline and other algorithms by almost an order of magnitude in accuracy, while also yielding a very small standard deviation and hence being very consistent across training runs (cf. tab. 4). The results show its effectiveness, even on Kirchhoff’s challenging problem with a total of 9 terms (cf. eq. 20). Furthermore, the execution times in tab 5 underline the efficiency benefit (up to sixfold speedup) of balancing the loss without gradient statistics, as separate backwards passes for each term become increasingly computationally expensive as the number of terms in the loss function grows. Further conclusions and comparisons across different loss balancing methods are made in sec. 7.
Manual  GradNorm  LR annealing  SoftAdapt  ReLoBRaLo  
per 1’000 it.  17.3  128.6  139.7  20.2  22.5 
For the inverse Kirchhoff problem setting, we select the PDE parameter (i.e. flexural stiffness) to be learned for given data, which we obtained by sampling from the analytically known solution. More specifically, we initialised and tasked the network with approximating . Given the large disparity between the initialisation and the target, we empirically found the use of two separate optimisers beneficial in this case, where one optimiser is used for updating the network’s parameters and a different one for updating the PDE parameter . Differently from Burgers’ equation, ReLoBRaLo also sets a new benchmark in Kirchhoff’s inverse problem, both in accuracy as well as convergence speed, cf. fig. 7 and tab. 5.
6.3 Helmholtz equation
The Helmholtz equation represents a timeindependent form of the wave equation and arises in many physical and engineering problems such as acoustics and electromagnetism (pde_sommerfeld). The equations has the form:
(21) 
where is the wave number. This represents a common problem to benchmark PINNs and possesses an analytical solution in combination with Dirichlet boundaries:
(22)  
(a) Analytical Result  (b) PINNResult  (c) Squared Error 

Both, the and input variables are bounded below by 1 and bounded above by 1. Therefore, the boundary conditions add four terms to the loss function of the forward problem, resulting in a 5term total physicsinformed loss:
(23)  
where is the parameterisation of the latent function using a neural network with parameters .
(a) GradNorm Result  (b) LR Annealing Result 
After successful convergence of PINNs training, we obtain the results displayed in fig. 8(b) and compare it to the analytically available solution displayed in fig. 8(a). A plot of the squared difference in as given by the analytical and PINNs results is shown in fig. 8(c) and delivers a negligible max error. The final algorithm settings are reported in tab. 8.
(a) PINNs Convergence  (b) ReLoBRaLO Loss Scaling 
The Helmholtz equation reveals a limitation of our basic loss balancing approach and motivates the introduction of the random lookback. GradNorm and Learning Rate Annealing both achieve impressive results and substantially outperform the baseline as well as ReLoBRaLo with in terms of accuracy for the BC terms, cf. fig. 9. This is likely due to the considerable initial difference in magnitudes between the governing equation and the boundary conditions. Furthermore, the high values of , necessary for "remembering" the deteriorations longer, induce a latency between the increase of a term’s loss until the scaling reacts accordingly (cf. fig. 10). On the other hand, GradNorm and Learning Rate Annealing do not succeed in decreasing the error as much as ReLoBRaLo for the governing equation term. Fig. 9 shows that both GradNorm and Learning Rate Annealing focus on improving the boundary conditions right from the beginning of training, whereas ReLoBRaLo with counters the initial deterioration, but eventually "forgets" and instead focuses on the more dominant governing equation. This is also reflected in the discrepancy between the training and validation loss: GradNorm and Learning Rate Annealing have a higher training loss than ReLoBRaLo, but still exceed at approximating the underlying function (cf. tab. 6). This triggered further investigation on the saudade and temperature parameter as described in the remainder of the next section.
Helmholtz  Manual  GradNorm  LR anneal.  SoftAdapt  ReLoBRaLo  

Forward  train  1.4  7.1  2.7  4.9  4.7 
val  7.1  5.6  1.4  8.4  2.6  
val std  8.1  1.9  7.6  7.3  8.2  
Inverse  val  2.68  1.48  5.10  7.8  3.65 
std  4.95  3.55  7.21  1.9  2.54 
Manual  GradNorm  LR annealing  SoftAdapt  ReLoBRaLo  
per 1’000 it.  4.8  10.4  6.7  5.0  5.2 
Convergence of PINNs Parameter Estimation for Helmholtz’s equation 
For the inverse Helmholtz problem setting, we select the wave number to be learned for given data, which we obtained by sampling from the analytically known solution. Furthermore, we initialised and tasked the network with approximating . In the Helmholtz inverse problem setting similarly to the inverse Burgers problem, also just one optimiser was chosen for updating the network’s parameters together with the PDE parameter . ReLoBRaLo also sets a new benchmark for Helmholtz’s inverse problem, both in accuracy as well as convergence speed, cf. fig. 11 and tab. 7.
Hyperparameter  Burgers  Kirchhoff  Helmholtz 

Learning Rate  
Layers  4  4  2 
Neurons per Layer  256  360  256 
Exponential Decay Rate  0.999  0.999  0.99 
Temperature  
Expected Saudade  0.9999  0.9999  0.99 
Activation function  tanh  tanh  tanh 
7 Ablation and Sensitivity Study, Discussion and Conclusions
The proposed ReLoBRaLo loss balancing approach together with the PINNs architecture incorporate many hyperparameters, which have potential influence on performance, efficiency and accuracy. In order to investigate further the tendencies, an ablation and sensitivity study w.r.t. the hyperparameters Temperature , exponential decay rate , and expected saudade given the three presented PDE examples is conducted in this section. After showing the results of these investigations, we discuss findings so far with our novel method and draw more general conclusions upon the presented content of this paper.
(a) Burgers’  (b) Kirchhoff’s  (c) Helmholtz’s 

Fig. 12 visualises the models’ sensitivity to the exponential decay rate and the temperature . A larger causes the network to "remember" longer, while controls how much the scalings "sheer out". Fig. 12(c) shows that Helmholtz’s equation benefits most from small values for , which turn the balancing more aggressive. This is in line with the findings in the previous section (cf. sec. 6.3), where we noted that the large difference in magnitudes between the terms in the loss function caused issues to ReLoBRaLo and that resolute balancing was necessary to avoid the boundary conditions to be neglected. In fact, we found the optimal to be (cf. tab. 8). On the other hand, Burgers and Kirchhoff require smoother scalings with a tendency towards higher and . It is worth noting that all three tasks benefit from the relaxation through the exponential decay, as setting always causes a deterioration of the model’s performance.
However, the relaxation through the exponential decay induces a new tradeoff between making the model remember longer and letting it adapt quickly to changes during training. We therefore study the effects of a random lookback through a Bernoulli random variable (saudade). It allows setting a lower value for , thus making the model more flexible, while occasionally "reminding" it of its progress since the start of training .
Helmholtz  Burgers  Kirchhoff  

0.0  2.0  1.0  1.5 
0.5  5.2  9.5  2.7 
0.9  4.0  1.3  2.1 
0.99  2.6  4.9  6.9 
0.999  4.1  3.8  5.6 
0.9999  1.2  1.4  4.0 
1  8.1  4.7  7.4 
Tab. 9 summarises the change in performance when varying the expected saudade on all three experiments as a comparison. It is apparent that Helmholtz benefits more from frequent lookbacks, as it hits its best performance at whereas Burgers and Kirchhoff only require an expected lookback every 10’000 optimisation steps. Figs. 13 and 3 illustrate the effect of random lookbacks. While the stochasticity in the scaling factor increases and therefore makes them less interpretable, it increases the weight on the boundary conditions. The scaled contribution of the boundary conditions consequently leads to a better approximation of the underlying function .
(a) PINNs Convergence  (b) ReLoBRaLO Loss Scaling 
It is worth noting that the addition of the random lookback improves the accuracy on Helmholtz’ equation by more than an order of magnitude and by almost one order of magnitude for Burgers’ and Kirchhoff’s equation.
8 Synopsis and Outlook
From previous work we observe that a competitive relationship between physics loss items in the training of PINNs exists and potentially spoils training success, performance or efficiency. This paper investigated different methods aiming at adaptively balancing a loss function consisting of various, potentially conflicting objectives as it may arise in scalarised MOO in PINNs. We proposed a novel adaptive loss balancing method by (i) combining the best attributes of existing approaches and, (ii) introducing a saudade parameter to occasionally incorporate historic loss contribution. This forms a new heuristic called Relative Loss Balancing with Random Lookback (ReLoBRaLo) for selecting bespoke weights in order to combine multiple loss terms for the training of PINNs. The effectiveness and merits of using ReLoBRaLo is then demonstrated empirically by investigating several standard PDEs, including solving Helmholtz equation, Burgers’ equation and Kirchhoff plate bending equation, and considering both forward problems as well as inverse problems, where unknown parameters in the PDEs are estimated. Our computations show that ReLoBRaLo is able to consistently outperform the baseline of existing scaling methods (GradNorm, Learning Rate Annealing, SoftAdapt or manual scaling) in terms of accuracy, while also being up to six times more computationally efficient (training epochs or wallclock time). Using BO instead of GridSearch turned out to be an effective tool to reduce the laborious work of finding optimal hyperparameters for the PINNs. Finally, we showed that the adaptively chosen scalings can be inspected to learn about the PINNs training process and identify weak points. This allows to take informed decisions in order to improve the framework.
Future research is concerned with inspection of performance, efficiency, robustness and scalability of ReLoBRaLo to further PDE classes such as NavierStokes equations etc. The adoption of Sobolev Training with Sobolev norms together with ReLoBRaLo may solve the drawback associated with the high costs involved in estimating the neural network solutions of PDEs. In addition, the generalising capabilities of ReLoBRaLo to the wider class of penalised optimisation problems, including PDEconstrained and Sobolev training problems, is addressed there as well.
Comments
There are no comments yet.