Multi-Objective Loss Balancing for Physics-Informed Deep Learning

by   Rafael Bischof, et al.
ETH Zurich

Physics Informed Neural Networks (PINN) are algorithms from deep learning leveraging physical laws by including partial differential equations (PDE) together with a respective set of boundary and initial conditions (BC / IC) as penalty terms into their loss function. As the PDE, BC and IC loss function parts can significantly differ in magnitudes, due to their underlying physical units or stochasticity of initialisation, training of PINNs may suffer from severe convergence and efficiency problems, causing PINNs to stay beyond desirable approximation quality. In this work, we observe the significant role of correctly weighting the combination of multiple competitive loss functions for training PINNs effectively. To that end, we implement and evaluate different methods aiming at balancing the contributions of multiple terms of the PINNs loss function and their gradients. After review of three existing loss scaling approaches (Learning Rate Annealing, GradNorm as well as SoftAdapt), we propose a novel self-adaptive loss balancing of PINNs called ReLoBRaLo (Relative Loss Balancing with Random Lookback). Finally, the performance of ReLoBRaLo is compared and verified against these approaches by solving both forward as well as inverse problems on three benchmark PDEs for PINNs: Burgers' equation, Kirchhoff's plate bending equation and Helmholtz's equation. Our simulation studies show that ReLoBRaLo training is much faster and achieves higher accuracy than training PINNs with other balancing methods and hence is very effective and increases sustainability of PINNs algorithms. The adaptability of ReLoBRaLo illustrates robustness across different PDE problem settings. The proposed method can also be employed to the wider class of penalised optimisation problems, including PDE-constrained and Sobolev training apart from the studied PINNs examples.



There are no comments yet.


page 14

page 17

page 20

page 24


Variational Physics-Informed Neural Networks For Solving Partial Differential Equations

Physics-informed neural networks (PINNs) [31] use automatic differentiat...

MTAdam: Automatic Balancing of Multiple Training Loss Terms

When training neural models, it is common to combine multiple loss terms...

Efficient training of physics-informed neural networks via importance sampling

Physics-Informed Neural Networks (PINNs) are a class of deep neural netw...

Applying physics-based loss functions to neural networks for improved generalizability in mechanics problems

Physics-Informed Machine Learning (PIML) has gained momentum in the last...

Physics-Informed Neural Network for Modelling the Thermochemical Curing Process of Composite-Tool Systems During Manufacture

We present a Physics-Informed Neural Network (PINN) to simulate the ther...

Physics-informed graph neural Galerkin networks: A unified framework for solving PDE-governed forward and inverse problems

Despite the great promise of the physics-informed neural networks (PINNs...

A Physics-Guided Neural Network Framework for Elastic Plates: Comparison of Governing Equations-Based and Energy-Based Approaches

One of the obstacles hindering the scaling-up of the initial successes o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The emergence of Physics-Informed Neural Networks (PINNs) (Raissi2017) has sparked a lot of interest in domains that see themselves regularly confronted with problems in the low data regime. By leveraging well-known physical laws and incorporating them as implicit prior into the deep-learning pipeline, PINNs were shown to require little to no data in order to approximate partial differential equations (PDE) of varying complexity (Raissi2017; hongweicollocationbendinganalysis).

We consider the case where PINNs are used for finding the unknown, underlying function proper to a parameterised PDE. PDEs generally consist of a governing equation and a set of boundary- as well as initial conditions. When trained jointly, i.e. as multi-objective optimisation (MOO), these equations form a set of objective functions that drive the model to approximate a function that satisfies the PDE. Several established and well-studied numerical methods already exist for addressing this problem, such as the Finite Elements Method (FEM), Finite Difference Method or Wavelets and Laplace Transform Method (bathe1996finite; hughes2012finite; smith1985a; Grossmann2007NumericalTO)

. However, PINNs present a differentiable, mesh-free approach and avoid the curse of dimensionality

(grohs2018proof; poggio2017and). They could therefore prove useful in several engineering applications, such as inversion and surrogate modeling in solid mechanics (haghighat2021physics), design optimisation (martins_ning_2021) or structural health monitoring and system identification (yuan2020machine). PINNs also found rich applications in computational fluid mechanics and dynamics for surrogate modelling of numerically expensive fluid flow simulations (xiang2021self), identification of hidden quantities of interest (velocity, pressure) from spatio-temporal visualisations of a passive scaler (dye or smoke) (raissi2020hidden) or in an inverse heat transfer application setting in flow past a cylinder without thermal boundaries (cai2020heat).

However, until the point where PINNs find their application in practice, further research is necessary to tackle current failure modes (wang2020; mcclenny2020a), one of which is the issue of gradient pathologies arising from imbalanced loss terms (wanggradientpathologies)

. With the various terms in the objective function stemming from physical laws, they are naturally bound to units of measurements that can vary significantly in magnitude. Consequently, the signal strengths of backpropagated gradients might differ from term to term and lead to pathologies that were shown to impede proper training and cause imbalanced solutions


, hence posing challenges to global optimisation methods such as Adam, Stochastic Gradient Descent (SGD), or L-BFGS

(zhang1801a; adam; theodoridis2015a; kendall2018a). As a counter-measure, every individual term may be scaled by a factor

in order to balance its contribution to the total gradient. However, manual tuning of these scaling factors requires laborious grid search and becomes intractable as the number of terms grows due to the sensitivity and interdependence of these hyperparameters.

This work investigates different algorithms aiming at adaptively balancing the contributions of multiple terms and their gradients in the loss function by selecting the optimal scaling factors in order to improve approximation capabilities of PINNs. To this end, we compare the effectiveness of Learning Rate Annealing (LRAnnealing) (wanggradientpathologies)

, proposed in the context of PINNs, to two approaches originating from Computer Vision applications: GradNorm

(gradnorm) and SoftAdapt (softadapt). In addition, we derive and present our own variation of an adaptive loss scaling technique, ReLoBRaLo (Relative Loss Balancing with Random Lookback), that we found to be more effective at similar efficiency compared to state-of-the-art by testing the algorithms on various benchmark problems for PINNs in the forward and inverse setting: Helmholtz, Burgers and Kirchhoff PDEs. In a future paper (kraufbischof_relobralo) we show that besides the application to scaling the loss in PINNs, the proposed ReLoBRaLo approach generalises to problems handling multi-task penalty problems (as defined by (coello2000use)) and provides an effective, self-adaptive approach for optimising the penalty factors.

This paper is organised as follows: we first provide a short introduction to the problem as well as the state-of-the-art of PINNs in sec. 2. Further methodical background on multi-objective optimisation (MOO), the framework of Physics-Informed Neural Networks (PINNs) and loss balancing for PINNs training is presented in sec. 3. In sec. 4 we introduce ReLoBRaLo as a novel self-adaptive loss balancing method. Sec. 6 reports numerical results of the developed approach against state-of-the-art methods for several examples in the forward and inverse setting: Burgers, Kirchoff and Helmholtz PDEs. Sec. 7 presents results of ablation studies as well as a discussion of findings and drawing further conclusions on ReLoBRaLo and its hyperparameter settings across all examples of this paper. Finally, a summary together with an outlook is given in sec. 8. All code produced within this publication is freely available and open access here:

2 Physics-Informed Neural Networks (PINNs)

This section reviews basic Physics-Informed Neural Networks (PINNs) concepts and recent developements.

2.1 Problem Statement and Basic Concept

Consider the following abstract parameterised and nonlinear PDE problem:


where is the spatial coordinate and is the time; denotes the residual of the PDE, containing the differential operators (i.e. ); are the PDE parameters; is the solution of the PDE with initial condition and boundary condition (which can be Dirichlet, Neumann or mixed); , and represent the spatial domain resp. boundary. A special example considered in this paper is the Burgers equation (given in eq. 12): with PDE parameter as viscosity coefficient . This paper is concerned with solving forward as well as inverse problems from different fields of application. For the forward problem, solutions of PDEs are to be inferred with fixed parameters , while for the inverse problem setting, is unknown and has to be learned from observed data together with the PDE solution.

Figure 1:

Schematic of a Physics-Informed Neural Network (PINN): A fully-connected feed-forward neural network with space and time coordinates

as inputs, approximating a solution . Derivatives of w.r.t. inputs are computed by automatic differentiation (AD) and then incorporated into residuals of the governing equations as the loss function, which is composed of multiple terms weighted by different coefficients. Parameters of the FCNN and the unknown PDE parameters may be optimised simultaneously by minimising the loss function.

Sticking with the initial "vanilla" implementation of PINNs (Raissi2017), a fully-connected feed-forward neural network (FCNN) is used to approximate the function which solves the PDE. A FCNN consists of multiple hidden layers with trainable parameters (weights and biases; denoted by ) and takes as inputs the space and time coordinates , cf. fig. 1. The losses are then defined as follows:


where is a set of collocation points on the physical domain, for the boundary conditions (BC), for the initial conditions (IC) and represents a set of measurements (data); the function maps to measurements at those coordinates; is the output from the neural network . PINNs are generally trained using the L2-norm (mean squared error / MSE) on uniformly sampled collocation points defined as a data set prior to training. Note that the number of points (denoted by in eq. 2) may vary for different loss terms.

The four objectives in eq. 2 are trained jointly and hence fall into the class of multi-objective optimisation (MOO) (cf. eq. 4)


where the individual terms can be interpreted the following way:

  • the first term penalises the residual of the governing equations (PDEs), included in both the forward and inverse problem.

  • the following terms enforce the boundary conditions (BCs), included only in the forward problem.

  • the following terms enforce the initial conditions (ICs), included only in the forward problem.

  • the last term makes the network approximate the measurements, included both in the forward (albeit not strictly necessary) and inverse problem.

A common approach at handling MOO is through linear scalarisation, described in more detail in sec. 3.1.

2.2 State-of-the-Art and Related Work

Using neural networks to approximate the solutions of Ordinary Differential Equations (ODEs) and PDEs has been intensively studied over the past decade. Initially,

(lagaris1998artificial; lagaris_2000) trainied neural networks to solve ODEs and PDEs on a predefined set of grid points, while (Sirignano2018) proposed a method for solving high-dimensional PDEs through approximation of the solution by a neural network and especially emphasising training efficiency by incorporating mini-batch sampling in high dimensional settings compared to the computationally intractable finite mesh-based schemes.

(Raissi2018; wu_2018; hochreiter_2003; attention_is_all_you_need; Sirignano2018; jagtap2020a; SUN2020) recently reported Physics-Informed Neural Networks (PINNs) to be a notable method for solving a variety of PDEs or PDE based systems using observed data in a supervised regression manner while satisfying all of the physical properties specified by nonlinear PDEs. In (Raissi2018), the authors provided empirical justification by numerical simulations for a variety of nonlinear PDEs, including the Navier–Stokes equation and the Burgers equation. (shin_2020) provided first theoretical justification for PINNs by demonstrating convergence of linear elliptic and parabolic PDEs in L2 sense. Furthermore, a significant advantage of PINNs is the fact that the pipeline for solving the forward problem can be turned into a data-driven PDE discovery, also known as the inverse problem, with just minor adaptations of the code.

However, PINN training efficiency, convergence and accuracy remain serious challenges (raissi2019physics; xiang2021self). Current research may be ordered into four main approaches: modifying structure of the NN, divide-and-conquer/domain decomposition, parameter initialisation and loss balancing.

Only few studies addressed the acceleration of convergence by modifying the structure of PINNs. (jagtap2020b; jagtap2020a)

introduced parameters that scale the input to the activation functions and get updated alongside the network’s parameters

through gradient descent. The authors showed that the adaptive activation function significantly accelerated convergence and also improved solution accuracy. (kim2020) presented a fast and accurate PINN ROM with a nonlinear manifold solution representation, where the NN structure included an encoder and a decoder part. Furthermore a shallow masked encoder was trained using data from the full order model simulations in order to use the trained decoder as representation of the nonlinear manifold solution. (peng2020) proposed dictionary-based PINNs to store and retrieve features and speed up convergence by merging prior information into the structure of NNs.

Other research focused on decomposing the computational domain in order to accelerate convergence. (jagtap2020c; jagtap2020d) proposed conservative PINNs and extended PINNs that decompose the computational domain into several discrete sub-domains, each one solved independently using a separate, shallow PINN. Inspired by the work of (jagtap2020c; jagtap2020d), (shukla2021) derived and investigated a distributed training framework for PINNs that used domain decomposition methods in both space and time-space. To accelerate convergence, the distributed framework combined the benefits of conservative and extended PINNs. The time-space domain may become very large when solving PDEs with long time integration, causing the training cost of NNs to become extremely expensive. To that end, (meng2020) proposed a parareal PINN to address the long-standing issue. The authors decomposed the long-time domain into many discrete short-time domains using a fast coarse-grained solver. Training multiple PINNs with many small data sets was much faster than training a single PINN with a large data set. For PDEs with long-time integration, the parareal PINN achieved a significant speedup. (kharazmi2021) introduced hp-variational PINNs to divide the computational space into the trial space and test space by combining domain decomposition and projection onto high-order polynomials.

In most works, researchers resort to the Xavier initialisation (glorot_2010) for selecting the PINN’s initial weights and biases. The effects of using more refined initialisation procedures has recently been gaining in attention, with (liu2021novel)

showing that a good initialisation can provide PINNs with a head start, allowing them to achieve fast convergence and improved accuracy. Transfer learning for PINNs was introduced by

(CHAKRABORTY2021) and (GOSWAMI2020) to initialise PINNs for dealing with multi-fidelity problems and brittle fracture problems, respectively. After their success in other fields of Deep Learning, meta-learning algorithms have also been implemented in the context of PINNs (Rajeswaran2019; Smith2009; Finn2017a; Finn2018), with Model-Agnostic Meta-Learning (MAML) being amongst the most popular ones (Finn2017b). Its second-order objective is to find an initialisation that is in on itself sub-optimal, but from where the network requires only few labeled training samples and optimisation steps in order to specialise on a task and achieve high accuracy (few-shot learning). Subsequently, (Nichol2018) proposed the REPTILE algorithm, which turns the second-order optimisation of MAML into a first-order approximation and therefore requires significantly less computation and memory while achieving similar performance. (liu2021novel) applied the REPTILE algorithm to PINNs by regarding modifications of PDE parameters as separate tasks. The resulting initialisation is such that the PINN converges in just a few optimisation steps for any choice of PDE parameters.

Using derivative information of the target function during training of a neural network was introduced by (Czarnecki2017) under the term Sobolev Training. Sobolev Training proved to be more efficient in many applicable fields due to lower sample complexity compared to regular training. In (son2021sobolev) the concept of Sobolev Training is enhanced in the strict mathematical sense using Sobolev norms in loss functions of neural networks for solving PDEs. It was found that these novel Sobolev loss functions lead to significantly faster convergence on investigated examples compared to traditional L2 loss functions. NNs were used in plain as well as a Sobolev Training manner for constitutive modelling in (vlassis2021sobolev; vlassis2020geometric; krausphdthesis; kraus2020artificial), where it was shown that mechanical relations can be seen as Sobolev training to successfully encapsulate several aspects of the constitutive behavior, such as strain-stress-relationships arising from derivatives of a Helmholtz potential in hyperelasticity.

(colby2020) observed that a weighted scalarisation of the multiple loss functions, defined by the sampled data and physical laws for PINNs training, plays a significant role for convergence. (wanggradientpathologies) recently published a learning rate annealing algorithm that employs back-propagated gradient statistics in the training procedure in order to adaptively balance the terms’ contributions to the final loss. (wang2020) investigated the issue of vanishing and exploding gradients that currently limits the applicability of PINNs. To that end, the authors introduced a Neural Tangent Kernel (NTK), which appropriately assigns weights to each loss term at subtle performance improvement, in order to comprehend the training process for PINNs. (shin_2020) developed the Lipschitz regularised loss for solving linear second-order elliptic and parabolic type PDEs. (mcclenny2020a) proposed a method for updating the adaptation weights in the loss function in relation to network parameters. Self-adaptive PINNs are forced to meet all physical constraints as equitably as possible. Although many studies have been conducted to confirm the effects of loss functions on generalisation performance, the competitive relationship between the physical objectives is seldom taken into account. According to (elhamod2020)

, tuning the competing physics-guided (PG) loss functions at various neural network learning stages is crucial. To that end, two approaches for selecting the trade-off weights of loss terms with different characteristics were proposed: annealing and cold starting, which affect the initial or final epochs. A drawback of that method however is the fact that selecting the appropriate type of sigmoid function, which primarily influences accuracy, is still time-consuming.

As a result of the literature review, the self-adaptive PINNs training procedures can be regarded as a PDE-constrained optimisation problem. This paper is concerned with careful consideration of competitiveness and adaptability and seeks inspiration from loss balancing techniques proposed across several fields of machine learning. Investigations on the performance, convergence and accuracy are conducted for the forward and inverse problem setting for several PDEs from engineering and natural sciences.

3 Methodology

This section introduces relevant methodical background from multi-objective optimisation (MOO), adaptive training and hyperparamter tuning.

3.1 Multi-Objective Optimisation

Multi-objective optimisation (MOO) is concerned with simultaneously optimising a set of , potentially conflicting objectives (multitask_caruana; jones2002multi).


Many problems in engineering, natural sciences or economics can be formulated as multi-objective optimisations and generally require trade-offs to simultaneously satisfy all objectives to a certain degree (moo_engineering). The solution of MOO models is usually expressed as a set of Pareto optima, representing these optimal trade-offs between given criteria according to the following definitions (pareto_multi_task):

Definition 1

A solution Pareto dominates solution (denoted ) if and only if and such that .

Definition 2

A solution is said to be Pareto optimal if . The set of all Pareto optimal points is called the Pareto set and the image of the Pareto set in the loss space is called the Pareto front.

A multi-objective optimisation can be turned into a single objective through linear scalarisation:


In theory, a Pareto optimal solution is independent of the scalarisation (efficient_moo). However, when using neural networks for MOO, the solution space becomes highly non-convex. Thus, although neural networks are universal function approximators (Hornik1990), they are not guaranteed to find the optimal solution through first-order gradient-based optimisation. Scaling the loss space therefore provides the option of guiding the gradients into having an a priori deemed desirable property. However, manually finding optimal requires laborious grid search and becomes intractable as gets large. Furthermore, one might want to let

evolve over time. This raises the need for an automated heuristic to dynamically choose the scalings


3.2 Adaptive Loss Balancing Methods

This section reviews different methods aiming at balancing the various terms within multi-objective optimisation. To this end, we compare the effectiveness of Learning Rate Annealing (wanggradientpathologies), proposed in the context of PINNs as well as two approaches originating from Computer Vision applications: GradNorm (gradnorm) and SoftAdapt (softadapt). This forms the basis for deriving and presenting our own loss balancing method as given in sec. 4

3.2.1 Learning Rate Annealing

wanggradientpathologies conducted a study on gradients in PINNs and identified pathologies that explained some failure modes. One pathology is gradient stiffness in the boundary conditions caused by the imbalance amongst the different loss terms. As a remedy, it is proposed to adaptively scale the loss using gradient statistics, thus reducing the laborious tuning of these hyperparameters.


where is the mean of the gradient w.r.t. the parameters ; is a hyperparameter with a value recommended by the authors.

With this method, whenever the maximum value of grows considerably larger than the average value in , the scalings correct for this discrepancy such that all gradients have similar magnitudes. Additionaly, exponential decay is used in order to smoothen the balancing and avoid drastic changes of the loss space between optimisation steps.

This procedure induces a few drawbacks. Its unboundedness potentially involves up- or down-scaling of terms by means of several orders of magnitude. The up-scaling in particular can cause problems similar to the effect of choosing a learning rate that is too large and therefore potentially overshooting the objective. Furthermore, scaling all terms to have the same magnitude throughout training can incite the network to optimise for the "low hanging fruit". A term, whose loss decreased considerably in the last optimisation step, will see its contribution to the total gradient scaled back up to the same magnitude to match the other terms. Therefore, the network might focus on the objectives that are easiest to optimise for.

3.2.2 GradNorm

(gradnorm) takes a different approach and makes the scalings trainable. The updates on these trainable scalings are chosen such that all terms improve at the same relative rate w.r.t. their initial loss and performed by a separate optimiser. A term that improved at a higher rate since the beginning of training compared to the other terms, gets a weaker scaling until all terms have made the same percentual progress. Therefore, one could argue that they weakly enforce each optimisation-step to Pareto dominate (cf. definition 1) its predecessor. The loss for updating the scalings within GradNorm is computed as follows:


where is the norm of the gradient w.r.t. the networks parameters for the scaled loss of objective ; is the average of all gradient norms; defines the rate at which term improved so far; is a hyperparameter representing the strength of the restoring force which pulls tasks back to a common training rate. Note that is the desirable value that should take on, so gradients must be prevented from flowing through this expression. The final loss for updating the networks parameters is then simply a linear scalarisation with the scalings that were previously updated:


This algorithm is fairly evolved and, despite solving some of Learning Rate Annealing’s issues, it still requires a separate backward-pass for each task, which becomes prohibitively expensive as gets large. Furthermore, it relies on two separate optimisation rounds at each step: one for adapting the scalings and another for updating the weights . By means of eq. 4, GradNorm can thus be formulated as a scalarised MOO objective via:


which in turn requires empirical hyperparameter tuning (learning rate, initialisation, etc.) to keep the system balanced - exactly the problem we are actually trying to solve through the use of adaptive loss balancing techniques.

3.2.3 SoftAdapt

Similar to GradNorm, SoftAdapt (softadapt) leverages the ansatz of relative progress in order to balance the loss terms. However, the authors relax it by only considering the previous time-step . The scalings are then normalised by using a softmax function:


where is the loss of term at optimisation step .

SoftAdapt also differs from GradNorm in the sense that it does not require gradient statistics and thus eliminates the need of performing separate backward passes for each objective. Instead, it makes use of the fact that magnitudes in the gradients directly depend on the magnitudes of the terms in the loss function and therefore aims at achieving the balance solely through loss statistics. Of course, this is only true if the same loss function is used for every objective (e.g. the loss). However, this setting generalises to a vast majority of applications invovling PINNs.

4 Relative Loss Balancing with Random Lookback (ReLoBRaLo)

Drawing inspiration from existing balancing techniques as outlined in sec. 3.2, we propose a novel method and implementation for balancing the multiple terms in the scalarised MOO loss function for training of PINNs upon:

  • SoftAdapt’s balancing method is employed, using the rate of change between consecutive training steps and normalising them through a softmax function.

  • similarly to Learning Rate Annealing, the scalings are updated using an exponential decay in order to utilise loss statistics from more than just one training step in the past.

  • in addition, a random lookback (called saudade ) is introduced into the exponential decay, which decides whether to use the pervious steps’ loss statistics to computed the scalings, or whether to look all the way back until the start of training .



is a Bernoulli random variable and

should be chosen close to 1.

This method is an attempt at combining the best attributes of the aforementioned approaches into a new heuristic for scalarised MOO objective functions. First and foremost, it still weakly enforces every training step to Pareto dominate its predecessor, which is an important property in physical applications. It also avoids using gradient statistics, making it considerably more efficient than Learning Rate Annealing and GradNorm. Furthermore, it reduces drastic changes in the loss space by using exponential decay and can easily be adapted to use more or fewer information of past optimisation steps by tuning the hyperparameter . One can think of as being the model’s ability to remember the past, with a high alpha giving lots of weight to past loss statistics, while a lower alpha increases stochasticity. Setting results in each term’s relative progress being computed w.r.t. the initial loss . However, we found this to be too restrictive, since it causes the model to stop making progress as soon as one term reaches a local minimum. We chose values between 0.9 and 0.999 and report the effects of varying this hyperparameter in sec. 7. Note that and make Relative Loss Balancing equivalent to SoftAdapt.

Choosing the value of also requires to make a trade-off: a high value means the model will remember potential deteriorations of certain terms for longer and therefore leave a longer time-frame in order to compensate them. However, it also induces a latency between a term starting to deteriorate and the scalings reacting accordingly. We therefore study the effect of introducing the saudade Bernoulli random variable that causes the model to occasionally look back until the start of training. is maximum saudade as it always takes the loss value of the initial training step, while corresponds to minimum saudade, taking only into account the last value from the history of the -th scaling factor. Selecting somewhere between 0 and 1 allows to set a lower value for , thus making the model more flexible while still occasionally "reminding" it of the progress made since the start of training. Furthermore, the random lookback can give episodic new impulses and let the model escape local minima by changing the loss space, as well as inciting it to explore more of the parameter space. In case the impulse would turn out to have a negative effect on the accuracy, one can still choose to roll back and reset the network’s parameters to the previous state.

The last hyperparameter is the so-called temperature . Setting recalibrates the softmax to output uniform values and thus all . On the other hand, essentially turns the softmax into an argmax function, with the scaling resulting for the term with the lowest relative progress and for all others.

A pedagogical example with interpretation of expected behaviours and how to draw conclusions from the histories of the scalings is given for Burgers’ equation in sec. 6.1.

5 Hyperparameter Tuning and Meta Learning

This paper uses grid search in combination with Bayesian Optimisation (BO) (gridsearch; bayesian_optimization_snoek; bayesian_opt_implementation) for hyperparameter tuning. This study uses hyperparameters for defining the NN architecture ( hidden layers and neurons per layer) and training settings (learning rate , exponential decay rate and saudade ). Tab. 1

contains the ranges and distributions for the hyperparameters. BO reduces the empiricism of selecting the PINNs hyperparameters to learn an optimal NN structure. First, 20 random points in the hyperparameter space are sampled and evaluated. The model’s performance at those points serves as evidence for fitting prior Gaussian Processes in order to estimate the unknown loss function w.r.t. the hyperparameters. Using Expected Improvement (EI)

(expected_improvement), further 80 points are then sampled and evaluated to refine the prior. This procedure provides an educated guess as to which are the optimal hyperparameters for the task at hand. Finally, we fine-tune the results by performing fine-grained grid search around the hyperparameters returned by BO.

Hyperparameter Range Log-scaling
Learning Rate [] yes
Layers [2, 4] no
Neurons per Layer [32, 512] no
Exponential Decay Rate [0, 1] no
Temperature [] yes
Expected Saudade [0, 1] no
Activation function {tanh, sigmoid} no
Table 1: Hyperparameters for architecture and training settings together with ranges as used for Bayesian Optimisation

Within this study, the exact same BO configuration was used for all examples presented in sec. 6, hence it is sufficient to only display tab. 1. Respective results of the Grid Search and BO can also be found in sec. 6.

6 Results

We test the different balancing techniques on three problems (Burgers equation, Kirchhoff plate bending and Helmholtz equation) originating from physics-informed deep learning, where the objective function consists of various terms of potentially considerably different magnitudes and compare their performances, as well as their computational efficiency. Training was done on networks of varying depth and width (acc. to tab. 1) and limited to steps of gradient descent (GD) using the Adam optimiser (adam) and an initial learning rate of 0.001. Additionally, we reduced the learning rate by a multiplicative factor of 0.1 whenever the optimisation stopped making progress for over 3’000 optimisation steps and finally used early stopping in case of 9’000 steps without improvement. When addressing the inverse problem, i.e. approximating a set of measurements while subjecting the network to PDE-constraints for finding unknown PDE-parameters , we further investigated the payoff of using two separate optimisers: one for updating network weights , and a separate one for updating PDE parameters . Further details on hyperparameter tuning and meta learning is given in secs. 5 and 7.

6.1 Burgers’ Equation

Burgers’ equation is a one-dimensional Navier-Stokes equation used i.a. to model shock waves, gas dynamics or traffic flow (burgers_pde). Using Dirichlet boundary conditions, the PDE takes the following form:


At first, we investigate the solution of the forward problem, where we set the PDE parameter . In order to find the latent function , we can parameterise it with a neural network and turn the set of equations into a linear scalarised objective (cf. eq. 5) of Mean Squared Errors (MSE). This loss function will weakly enforce the network to approximate the PDE solution .


For the forward problem, PINNs training induces the following loss function employed during training:


After successful convergence of PINNs training using ReLoBRaLo, we obtain the results displayed in fig. 2(b), whereas the final algorithm settings are reported in tab. 8. As there is no analytical solution available for the Burgers equation, we computed a reference solution using the finite element method (FEM), displayed in fig. 2(a). A plot of the squared difference in as given by the FEM and PINNs is shown in fig. 2(c) and delivers a relative max error of below 5%.

(a) FEM-Result (b) PINN-Result (c) Squared Error
Figure 2: Burgers’ equation problem: (a) FEM reference solution, (b) PINNs results predicted with a fully-connected network consisting of two layers and 128 nodes each, and (c) squared error.

However, Burgers’ equation can also be turned into an inverse problem by regarding the PDE parameter as an unknown to be estimated from a set of observations (i.e. data) over the spatial and temporal domain. In this setting, the PINNs induced loss function to be deployed reads:


Similar to the network’s weights and biases, the additional trainable PDE variable (here viscosity ) is now also updated through gradient descent:


Measurement data for the inverse problem setting were obtained from our reference solution computed using the FEM without addition of noise. At every iteration, we sample from the available data in order to generate a batch of collocation points.

(a) , ,
(b) , ,
(c) , ,
(d) , ,
Figure 3: Median of the log

loss over multiple training runs (i/left) of Burgers’ equation and the mean and variance of the corresponding scaling factors

(ii/right) computed with ReLoBRaLo.

Fig. 3 shows the scaling factors of our ReLoBRaLo method with varying hyperparameters for Burgers’ equation. As can be expected, a larger value for leads to a smoother curve because past loss statistics are dragged on longer, therefore countering the stochasticity that might arise at every optimisation step. On the other hand, the temperature influences the magnitude of the scalings. Another general tendency in the plots of the scaling factors is the fact that the relatively lower loss contributions (here: BC 1 and BC 2) correspond to higher scaling values (potentially greater than 1) while the larger loss contributions (here: PDE and IC) correspond to lower scaling values (potentially less than 1). Note that, whenever "log" is used in this and all subsequent figures, we refer to the natural logarithm of that quantity.

Burgers Manual GradNorm LR anneal. SoftAdapt ReLoBRaLo
Forward train 5.5 6.6 9.9 4.5 5.6
val 1.2 2.0 1.6 1.8 1.4
std val 5.7 2.1 2.3 3.2 6.8
Inverse val 1.90 6.82 2.45 1.07 2.17
std 1.23 5.07 3.44 4.27 2.11
Table 2: Comparison of the median

training and validation loss on Burgers’ equation. The reported values are the median over four independent runs with identical settings. Additionally, we report the standard deviation over the runs of the best performing model on the validation loss.

The relatively small variances across training runs with suggest that the optimisation progress follows similar patterns, even when varying depth and width of the network. Therefore, these values can provide valuable insight into the training and help identifying possibilities of improving the model. E.g. the fact that the scaling for the governing equation has the largest value after 50,000 epochs, indicates that it was the first term to stop making progress. On the other hand, fig. 6 shows that the opposite is true for Helmholtz’ and Kirchhoff’s equations, where the boundary conditions have more difficulties making progress. This knowledge can help taking informed decisions to improve the framework, e.g. by adapting the activation functions, the loss function or the model’s architecture accordingly.

Tab. 2 summarises the performances of the different balancing techniques against a baseline for the forward and inverese problem setting, where we manually chose the optimal scalings

through grid search. As can be observed, the adaptive scaling techniques perform similarly well to the baseline, with Learning Rate Annealing and ReLoBRaLo reaching a considerably lower validation error. The results show that either one of these methods greatly reduces the amount of work required for hyperparameter search, while still achieving great results with high probability.

Convergence of PINNs Parameter Estimation for Burgers’ equation
Figure 4: Approximation of the true PDE parameter value (dashed line) for the inverse problem setting of Burgers’ equation. Reported values are the mean (solid line) and standard deviation (shaded area) of four independent computational runs.

Besides the accuracy, the computational efficiency is another important metric for evaluating adaptive loss balancing methods. By designing our ReLoBRaLo method such that it requires only one backward pass, it’s computational overhead can be expected to be relatively small compared to GradNorm and Learning Rate Annealing, which both utilise gradient statistics and hence separate backward passes for each term. Indeed, tab. 3 shows that Burgers’ equation with it’s four terms in the loss function can be solved by ReLoBRaLo about 40% faster than Learning Rate Annealing and 70% faster than GradNorm and thus adds to efficiency and sustainability of PINNs training. Note that the reported values in tab. 3 stem from tasks where the balancing operation was performed at every optimisation step. Both GradNorm and Learning Rate Annealing can be made more efficient by updating the scaling terms once every arbitrary number of iterations. However, this introduces a trade-off between flexibility and efficiency and therefore an additional, very sensitive hyperparameter with a high impact on the method’s accuracy and efficiency. On the other hand, ReLoBRaLo adapts its scalings at every iteration and very low computational cost.

Manual GradNorm LR annealing SoftAdapt ReLoBRaLo
 per 1’000 it. 3.7 14.3 7.2 4.1 4.3
Table 3: Median execution times (in s) per 1’000 optimisation steps for different balancing methods on Burgers’ equation.

It is interesting to comprehend that, while the forward problem induced a loss function consisting of four terms, the inverse problem requires only two terms (eq. 15). Hence, selecting the scalings manually is significantly less time-consuming than it is for the forward problem. Consequently, tab. 2 shows that the baseline was harder to outperform, with Learning Rate Annealing (LR annealing) and ReLoBRaLo being the only methods yielding better results. It is worth noting however that learning rate annealing approximates the true value of significantly faster than ReLoBRaLo and is therefore the optimal choice for this particular problem setting, cf. fig. 4. Further conclusions and comparisons across different loss balancing methods are made in sec. 7.

(a) Analytical Result (b) PINN-Result (c) Squared Error
Figure 5: Kirchhoff plate bending problem: (a) analytical reference solution, (b) PINNs results predicted with a fully-connected network consisting of three layers and 128 nodes each, and (c) squared error.

6.2 Kirchhoff Plate Bending Equation

The Kirchhoff–Love theory of plates arose from civil and mechanical engineering and consists of a two-dimensional mathematical model used to determine stresses and deformations in thin plates subjected to forces and moments

bathe1996finite. The Kirchhoff plate bending problem assumes that a mid-surface plane can be used to represent a three-dimensional plate in two-dimensional form and together with a linear elastic material a fourth-order PDE can be derived to describe its mechanical behaviour:


where is the load acting on the plate at coordinates ; is the plate’s flexural stiffness computed with Young’s modulus , the plate’s thickness and Poisson’s ratio . The Kirchhoff plate bending problem poses several severe problems to FEM solutions bathe1996finite, yet analytical solutions can be inferred e.g. using Fourier series for special cases such as an applied sinusoidal load:


In this paper we employ a concrete plate possessing width , length , base load , Young’s modulus , plate height and Poisson’s ratio of for simply supported edge boundary conditions as it arises in typical civil engineering structures such as slabs. We hence consider the following boundary conditions (BC):


where and are bending moments computed as and .

In total, we obtain 8 boundary conditions and therefore 9 terms in the PINNs loss function, making this a challenging task for balancing the contributions of the various objectives:

(i) PINNs Convergence (ii) ReLoBRaLO Loss Scaling
Figure 6: Median of the log loss over multiple training runs (a) and the mean and variance of the corresponding scaling factors (b) computed with ReLoBRaLo on Kirchhoff’s equation with , , . For the sake of readability, the boundary conditions and were aggregated by taking the mean value.

After successful convergence of PINNs training, we obtain the results displayed in fig. 5(b) and compare it to the analytically available solution displayed in fig. 5(a). A plot of the squared difference in as given by the analytical and PINNs results is shown in fig. 5(c) and delivers a negligible max error. The final algorithm settings are reported in tab. 8.

Fig. 6 shows an example of ReLoBRaLo’s training process on Kirchhoff’s equation. In this particular example, one can notice the larger variance of scaling values towards the end of training. Also, the scalings did not converge towards the value 1, thus suggesting that the training stopped without all terms having stopped making progress, i.e. the scalings for the boundary conditions on the moments (yellow) were increasing at the end of training, while the boundary conditions on the displacements (red) were decreasing. This gives a strong indication as to where the model’s limitations lie. In this case, additional attention should be paid to the moments, e.g. by selecting an activation function which is better behaved in the second derivative than (siren).

Kirchhoff Manual GradNorm LR anneal. SoftAdapt ReLoBRaLo
Forward train 1.2 5.3 9.1 2.0 6.0
val 1.3 1.7 2.7 4.2 4.0
std val 3.9 2.2 1.0 4.7 7.7
Inverse val 2.13 3.60 5.99 9.53 3.23
std 1.56 4.72 0.80 4.58 2.91
Table 4: Comparison of the median training and validation loss on Kirchhoff’s equation. The reported values are the median over four independent runs with identical settings. Additionally, we report the standard deviation over the runs of the best performing model on the validation loss.
Convergence of PINNs Parameter Estimation for Kirchhoff’s equation
Figure 7: Approximation of the true PDE parameter value (dashed line) for the inverse problem setting of Kirchhoff plate bending. Reported values are the mean (solid line) and standard deviation (shaded area) of four independent runs.

Concerning performance, ReLoBRaLo outperforms the baseline and other algorithms by almost an order of magnitude in accuracy, while also yielding a very small standard deviation and hence being very consistent across training runs (cf. tab. 4). The results show its effectiveness, even on Kirchhoff’s challenging problem with a total of 9 terms (cf. eq. 20). Furthermore, the execution times in tab 5 underline the efficiency benefit (up to sixfold speedup) of balancing the loss without gradient statistics, as separate backwards passes for each term become increasingly computationally expensive as the number of terms in the loss function grows. Further conclusions and comparisons across different loss balancing methods are made in sec. 7.

Manual GradNorm LR annealing SoftAdapt ReLoBRaLo
 per 1’000 it. 17.3 128.6 139.7 20.2 22.5
Table 5: Median execution times (in s) per 1’000 optimisation steps for different balancing methods on Kirchhoff’s equation.

For the inverse Kirchhoff problem setting, we select the PDE parameter (i.e. flexural stiffness) to be learned for given data, which we obtained by sampling from the analytically known solution. More specifically, we initialised and tasked the network with approximating . Given the large disparity between the initialisation and the target, we empirically found the use of two separate optimisers beneficial in this case, where one optimiser is used for updating the network’s parameters and a different one for updating the PDE parameter . Differently from Burgers’ equation, ReLoBRaLo also sets a new benchmark in Kirchhoff’s inverse problem, both in accuracy as well as convergence speed, cf. fig. 7 and tab. 5.

6.3 Helmholtz equation

The Helmholtz equation represents a time-independent form of the wave equation and arises in many physical and engineering problems such as acoustics and electromagnetism (pde_sommerfeld). The equations has the form:


where is the wave number. This represents a common problem to benchmark PINNs and possesses an analytical solution in combination with Dirichlet boundaries:

(a) Analytical Result (b) PINN-Result (c) Squared Error
Figure 8: Helmholtz’s problem: (a) analytical reference solution, (b) PINNs results predicted with a fully-connected network consisting of two layers and 128 nodes each, and (c) squared error.

Both, the and input variables are bounded below by -1 and bounded above by 1. Therefore, the boundary conditions add four terms to the loss function of the forward problem, resulting in a 5-term total physics-informed loss:


where is the parameterisation of the latent function using a neural network with parameters .

(a) GradNorm Result (b) LR Annealing Result
Figure 9: Median of the log loss over multiple training runs on the Helmholtz’s equation using (a) GradNorm and (b) Learning Rate Annealing loss balancing methods.

After successful convergence of PINNs training, we obtain the results displayed in fig. 8(b) and compare it to the analytically available solution displayed in fig. 8(a). A plot of the squared difference in as given by the analytical and PINNs results is shown in fig. 8(c) and delivers a negligible max error. The final algorithm settings are reported in tab. 8.

(a) PINNs Convergence (b) ReLoBRaLO Loss Scaling
Figure 10: Median of the log loss over multiple training runs (a) and the mean and variance of the corresponding scaling factors (b) computed with ReLoBRaLo on Helmholtz’s and , , .

The Helmholtz equation reveals a limitation of our basic loss balancing approach and motivates the introduction of the random lookback. GradNorm and Learning Rate Annealing both achieve impressive results and substantially outperform the baseline as well as ReLoBRaLo with in terms of accuracy for the BC terms, cf. fig. 9. This is likely due to the considerable initial difference in magnitudes between the governing equation and the boundary conditions. Furthermore, the high values of , necessary for "remembering" the deteriorations longer, induce a latency between the increase of a term’s loss until the scaling reacts accordingly (cf. fig. 10). On the other hand, GradNorm and Learning Rate Annealing do not succeed in decreasing the error as much as ReLoBRaLo for the governing equation term. Fig. 9 shows that both GradNorm and Learning Rate Annealing focus on improving the boundary conditions right from the beginning of training, whereas ReLoBRaLo with counters the initial deterioration, but eventually "forgets" and instead focuses on the more dominant governing equation. This is also reflected in the discrepancy between the training and validation loss: GradNorm and Learning Rate Annealing have a higher training loss than ReLoBRaLo, but still exceed at approximating the underlying function (cf. tab. 6). This triggered further investigation on the saudade and temperature parameter as described in the remainder of the next section.

Helmholtz Manual GradNorm LR anneal. SoftAdapt ReLoBRaLo
Forward train 1.4 7.1 2.7 4.9 4.7
val 7.1 5.6 1.4 8.4 2.6
val std 8.1 1.9 7.6 7.3 8.2
Inverse val 2.68 1.48 5.10 7.8 3.65
std 4.95 3.55 7.21 1.9 2.54
Table 6: Comparison of the median training and validation loss on Helmholtz’s equation. The reported values are the median over four independent runs with identical settings. Additionally, we report the standard deviation over the runs of the best performing model on the validation loss.
Manual GradNorm LR annealing SoftAdapt ReLoBRaLo
 per 1’000 it. 4.8 10.4 6.7 5.0 5.2
Table 7: Median execution times (in s) per 1’000 optimisation steps for different balancing methods on Helmholtz’s equation.
Convergence of PINNs Parameter Estimation for Helmholtz’s equation
Figure 11: Approximation of the true PDE parameter value (dashed line) in the inverse problem setting of Helmholtz’s equation. Reported values are the mean (solid line) and standard deviation (shaded area) of four independent runs.

For the inverse Helmholtz problem setting, we select the wave number to be learned for given data, which we obtained by sampling from the analytically known solution. Furthermore, we initialised and tasked the network with approximating . In the Helmholtz inverse problem setting similarly to the inverse Burgers problem, also just one optimiser was chosen for updating the network’s parameters together with the PDE parameter . ReLoBRaLo also sets a new benchmark for Helmholtz’s inverse problem, both in accuracy as well as convergence speed, cf. fig. 11 and tab. 7.

Hyperparameter Burgers Kirchhoff Helmholtz
Learning Rate
Layers 4 4 2
Neurons per Layer 256 360 256
Exponential Decay Rate 0.999 0.999 0.99
Expected Saudade 0.9999 0.9999 0.99
Activation function tanh tanh tanh
Table 8: Final choices of hyperparameters for architecture and training settings

7 Ablation and Sensitivity Study, Discussion and Conclusions

The proposed ReLoBRaLo loss balancing approach together with the PINNs architecture incorporate many hyperparameters, which have potential influence on performance, efficiency and accuracy. In order to investigate further the tendencies, an ablation and sensitivity study w.r.t. the hyperparameters Temperature , exponential decay rate , and expected saudade given the three presented PDE examples is conducted in this section. After showing the results of these investigations, we discuss findings so far with our novel method and draw more general conclusions upon the presented content of this paper.

(a) Burgers’ (b) Kirchhoff’s (c) Helmholtz’s
Figure 12: Ablation of the model’s performance when varying and with . The reported values are the median of the log loss over multiple training runs.

Fig. 12 visualises the models’ sensitivity to the exponential decay rate and the temperature . A larger causes the network to "remember" longer, while controls how much the scalings "sheer out". Fig. 12(c) shows that Helmholtz’s equation benefits most from small values for , which turn the balancing more aggressive. This is in line with the findings in the previous section (cf. sec. 6.3), where we noted that the large difference in magnitudes between the terms in the loss function caused issues to ReLoBRaLo and that resolute balancing was necessary to avoid the boundary conditions to be neglected. In fact, we found the optimal to be (cf. tab. 8). On the other hand, Burgers and Kirchhoff require smoother scalings with a tendency towards higher and . It is worth noting that all three tasks benefit from the relaxation through the exponential decay, as setting always causes a deterioration of the model’s performance.

However, the relaxation through the exponential decay induces a new trade-off between making the model remember longer and letting it adapt quickly to changes during training. We therefore study the effects of a random lookback through a Bernoulli random variable (saudade). It allows setting a lower value for , thus making the model more flexible, while occasionally "reminding" it of its progress since the start of training .

Helmholtz Burgers Kirchhoff
0.0 2.0 1.0 1.5
0.5 5.2 9.5 2.7
0.9 4.0 1.3 2.1
0.99 2.6 4.9 6.9
0.999 4.1 3.8 5.6
0.9999 1.2 1.4 4.0
1 8.1 4.7 7.4
Table 9: Validation loss when varying the expected value of . The reported values are the median over three independent runs.

Tab. 9 summarises the change in performance when varying the expected saudade on all three experiments as a comparison. It is apparent that Helmholtz benefits more from frequent lookbacks, as it hits its best performance at whereas Burgers and Kirchhoff only require an expected lookback every 10’000 optimisation steps. Figs. 13 and 3 illustrate the effect of random lookbacks. While the stochasticity in the scaling factor increases and therefore makes them less interpretable, it increases the weight on the boundary conditions. The scaled contribution of the boundary conditions consequently leads to a better approximation of the underlying function .

(a) PINNs Convergence (b) ReLoBRaLO Loss Scaling
Figure 13: Example of a single training process of ReLoBRaLo on Helmholtz’s equation with , , .

It is worth noting that the addition of the random lookback improves the accuracy on Helmholtz’ equation by more than an order of magnitude and by almost one order of magnitude for Burgers’ and Kirchhoff’s equation.

8 Synopsis and Outlook

From previous work we observe that a competitive relationship between physics loss items in the training of PINNs exists and potentially spoils training success, performance or efficiency. This paper investigated different methods aiming at adaptively balancing a loss function consisting of various, potentially conflicting objectives as it may arise in scalarised MOO in PINNs. We proposed a novel adaptive loss balancing method by (i) combining the best attributes of existing approaches and, (ii) introducing a saudade parameter to occasionally incorporate historic loss contribution. This forms a new heuristic called Relative Loss Balancing with Random Lookback (ReLoBRaLo) for selecting bespoke weights in order to combine multiple loss terms for the training of PINNs. The effectiveness and merits of using ReLoBRaLo is then demonstrated empirically by investigating several standard PDEs, including solving Helmholtz equation, Burgers’ equation and Kirchhoff plate bending equation, and considering both forward problems as well as inverse problems, where unknown parameters in the PDEs are estimated. Our computations show that ReLoBRaLo is able to consistently outperform the baseline of existing scaling methods (GradNorm, Learning Rate Annealing, SoftAdapt or manual scaling) in terms of accuracy, while also being up to six times more computationally efficient (training epochs or wall-clock time). Using BO instead of GridSearch turned out to be an effective tool to reduce the laborious work of finding optimal hyperparameters for the PINNs. Finally, we showed that the adaptively chosen scalings can be inspected to learn about the PINNs training process and identify weak points. This allows to take informed decisions in order to improve the framework.

Future research is concerned with inspection of performance, efficiency, robustness and scalability of ReLoBRaLo to further PDE classes such as Navier-Stokes equations etc. The adoption of Sobolev Training with Sobolev norms together with ReLoBRaLo may solve the drawback associated with the high costs involved in estimating the neural network solutions of PDEs. In addition, the generalising capabilities of ReLoBRaLo to the wider class of penalised optimisation problems, including PDE-constrained and Sobolev training problems, is addressed there as well.