DeepAI
Log In Sign Up

Learning Neural Networks with Competing Physics Objectives: An Application in Quantum Mechanics

07/02/2020
by   Jie Bu, et al.
Binghamton University
Virginia Polytechnic Institute and State University
0

Physics-guided Machine Learning (PGML) is an emerging field of research in machine learning (ML) that aims to harness the power of ML advances without ignoring the rich knowledge of physics underlying scientific phenomena. One of the promising directions in PGML is to modify the objective function of neural networks by adding physics-guided (PG) loss functions that measure the violation of physics objectives in the ANN outputs. Existing PGML approaches generally focus on incorporating a single physics objective as a PG loss, using constant trade-off parameters. However, in the presence of multiple physics objectives with competing non-convex PG loss terms, there is a need to adaptively tune the importance of competing PG loss terms during the process of neural network training. We present a novel approach to handle competing PG loss terms in the illustrative application of quantum mechanics, where the two competing physics objectives are minimizing the energy while satisfying the Schrodinger equation. We conducted a systematic evaluation of the effects of PG loss on the generalization ability of neural networks in comparison with several baseline methods in PGML. All the code and data used in this work is available at https://github.com/jayroxis/Cophy-PGNN.

READ FULL TEXT VIEW PDF
11/14/2022

Physics-Guided, Physics-Informed, and Physics-Encoded Neural Networks in Scientific Computing

Recent breakthroughs in computing power have made it feasible to use mac...
01/13/2020

Understanding and mitigating gradient pathologies in physics-informed neural networks

The widespread use of neural networks across different scientific domain...
10/14/2021

Multi-task problems are not multi-objective

Multi-objective optimization (MOO) aims at finding a set of optimal conf...
10/17/2022

An introduction to programming Physics-Informed Neural Network-based computational solid mechanics

Physics-informed neural network (PINN) has recently gained increasing in...
04/30/2021

Applying physics-based loss functions to neural networks for improved generalizability in mechanics problems

Physics-Informed Machine Learning (PIML) has gained momentum in the last...
08/02/2022

What can we Learn by Predicting Accuracy?

This paper seeks to answer the following question: "What can we learn by...

1 Introduction

With the increasing impact of machine learning (ML) methods in diverse scientific disciplines (Appenzeller (2017); Graham-Rowe et al. (2008)), there is growing realization in the scientific community to harness the power of ML advances without ignoring the rich knowledge of physics underlying scientific phenomena, thus using both physics and ML at an equal footing (Karpatne et al. (2017a); Gil (2017)). This is the emerging field of physics-guided machine learning (PGML) (Karpatne et al. (2017a); Raissi et al. (2019); de Bezenac et al. (2019); Karpatne et al. (2017b); Jia et al. (2019); Daw et al. (2020); Wang et al. (2017a); Pilania et al. (2018)) that is gaining widespread attention in several scientific disciplines including geoscience, climate science, fluid dynamics, and thermodynamics. One of the promising directions in PGML is to modify the objective function of neural networks by adding loss functions that measure the violation of physics in the ANN outputs, termed as physics-guided (PG) loss functions (Karpatne et al. (2017c); Stewart and Ermon (2017)). By anchoring ANN models to be consistent with physics, PG loss functions help in learning generalizable solutions from data. They has been used for incorporating a variety of physics objectives such as energy conservation (Jia et al. (2019)), monotonic relations (Karpatne et al. (2017b); Muralidhar et al. (2018)

), and partial differential equations (PDEs) (

Raissi et al. (2017a, 2019); de Bezenac et al. (2019)) in existing research.

While some existing methods in PGML learn neural networks by solely minimizing PG loss (and thus being label-free) (Raissi et al. (2019); Stewart and Ermon (2017)), others use both PG loss and data label loss in their objective function using appropriate trade-off hyper-parameters (Karpatne et al. (2017b); Jia et al. (2019)). However, what is even more challenging is when there are multiple physics objectives with competing PG loss terms that need to be minimized together, where each PG loss may show multiple local minima. In such situations, simple addition of PG loss terms in the objective function with constant trade-off hyper-parameters may result in the learning of non-generalizable solutions. This may seem counter-intuitive since the incorporation of PG loss is generally assumed to offer generalizability in the PGML literature (Karpatne et al. (2017b); de Bezenac et al. (2019); Shin et al. (2020)).

Figure 1 shows a toy example with two competing physics objectives to illustrate their effects on the generalizability of learned solutions. If we add the two objectives with constant weights and optimize the weighted sum using gradient descent methods, it is easy to end up at local minima and

. However, if we pay importance to physics objective 2 in the first few epochs of gradient descent, and then optimize physics objective 1 at later epochs, we are more likely to arrive at the global minimum

. This simple observation, although derived from an artificially constructed toy problem, motivates us to ask the question: is it possible to adaptively balance the importance of competing physics objectives at different stages of neural network learning to arrive at generalizable PGML solutions?

[width=0.44]figures/cophy

Figure 1: A toy example showing competing physics objectives that can lead to local minima when minimized together.

In this work, we introduce a novel framework of CoPhy-PGNN, which is an abbreviation for Competing Physics Objectives Physics-Guided Neural N

etworks, to handle competing physics objectives as PG loss terms in neural network learning. We specifically consider the domain of scientific problems where physics objectives are represented as eigenvalue equations and we are required to solve for the highest or lowest eigen-solution. This representation is common to many types of physics such as the Schrödinger equation and Maxwell’s equations and routinely shows up in applications in quantum mechanics, computational chemistry, and electromagnetism.

We specifically focus on the target application in quantum mechanics to predict the ground-state wave function of an Ising chain model (Bonfim et al. (2019)) with particles. In this testbed problem, the two competing physics objectives are minimizing the energy (-Loss) while satisfying the Schrödinger equation (-Loss). As we empirically demonstrate in this paper, -Lossis fraught by exponentially many local minima that challenges the generalization performance of existing PGML formulations. In contrast, CoPhy-PGNN effectively balances -Loss and -Loss in an adaptive way throughout the learning process leading to better generalizability even on input distributions with no labels during training. We compare our results with several baseline methods in PGML and discuss novel insights about: (a) the effects of PG loss on the loss landscape visualizations of neural networks, (b) effective ways of incorporating PG loss in the learning process, and (c) the advantage of using data (labeled and/or unlabeled) along with physics.

2 Related work

PGML has found successful applications in several disciplines including fluid dynamics (Wang et al. (2017a, 2016, b)), climate science (de Bezenac et al. (2019)), and lake modeling (Karpatne et al. (2017b); Jia et al. (2019); Daw et al. (2020)). However, to the best of our knowledge, PGML formulations have not been explored yet for our target application of wave function prediction in quantum mechanics. Existing work in PGML can be broadly divided into two categories. The first category involves label-free learning by only minimizing PG loss without using any labeled data. An early work in this category includes work by Stewart et al. (Stewart and Ermon (2017)) on the use of domain constraints (e.g., the kinematic equations of motion) as PG loss to supervise neural networks. More recently, Physics-informed neural networks (PINNs) and its variants (Raissi et al. (2019, 2017a, 2017b)) have been developed to solve PDEs by minimizing PG loss terms for simple canonical problems such as Burger’s equation. Since these methods are label-free, they do not explore the interplay between PG loss and label loss. We consider an analogue of PINN for our target application as a baseline in our experiments.

The second category of methods incorporate PG loss as additional terms in the objective function along with label loss, using trade-off hyper-parameters. This includes work in Physics-guided Neural Networks (PGNNs) (Karpatne et al. (2017b); Jia et al. (2019)) for the target application of lake temperature modeling, where the PG loss terms to capture conversation of energy as well as monotonic density-depth physics. We use an analogue of PGNN as a baseline in our experiments.

While some recent works in PGML have investigated the effects of PG loss on generalization performance (Shin et al. (2020)) and the importance of normalizing the scale of hyper-parameters corresponding to PG loss terms (Wang et al. (2020)), they do not study the effects of competing physics objectives which is the focus of this paper. Our work is related to the field of multi-task learning (MTL) (Caruana (1993)), as the minimization of physics objectives and label loss can be viewed as multiple shared tasks. For example, alternating minimization techniques in MTL (Kang et al. (2011)) in MTL can be used to alternate between minimizing different PG loss and label loss terms over different mini-batches. We consider this as a baseline approach in our experiments.

3 Methodology

3.1 Problem statement

We consider a widely studied problem in quantum mechanics: the Ising chain model, as the testbed application in this paper (see Supplementary materials for a detailed description of the physics of the Ising model). From an ML perspective, we are given a collection of training pairs, , where is a Hamiltonian matrix and is its corresponding ground-state wave-function and energy, generated by diagonalization solvers. We consider the problem of learning an ANN model, , that can predict for any Hamiltonian matrix, , where are the learnable parameters of ANN. We are also given a set of unlabeled examples, , which will be used for testing. We consider a simple feed-forward architecture of in all our formulations.

3.2 Designing physics-guided loss functions

A naïve approach for learning is to minimize the mean sum of squared errors (MSE) of predictions on the training set, referred to as the ). However, instead of solely relying on Train-MSE, we consider the following PG loss terms to guide the learning of to generalizable solutions:

Schrödinger Loss:

A fundamental equation we want to satisfy in our predictions, , for any input is the Schrödinger’s equation, . Hence, we consider minimizing the following equation:

(1)

where the denominator term ensures that resides on a unit hyper-sphere with , thus avoiding scaling issues. Note that by construction, -Loss only depends on the predictions of and does not rely on true labels, . Hence, -Loss can be evaluated even on the unlabeled test data, .

Energy Loss:

Note that there are many non-interesting solutions of that can appear as “local minima” in the optimization landscape of -Loss. For example, every input Hamiltonian, , there are possible eigen-solutions (where is the length of ), each of which will result in a perfectly low value of -Loss , thus acting as a local minima. However, we are only interested in the ground-state eigen-solution for every . Therefore, we consider minimizing another PG loss term that ensures the predicted energy at every sample is low as follows:

(2)

The use of function ensures that E-Loss is always positive, even when predicted energies are negative.

3.3 Adaptive tuning of PG loss weights

A simple strategy for incorporating PG loss terms in the learning objective of is to add them to Train-MSE using trade-off weight parameters, and , for -Loss and -Loss, respectively. Conventionally, such trade-off weights are kept constant to a certain value across all epochs of gradient descent. This inherently assumes that the importance of PG loss terms in guiding the learning of towards a generalizable solution is constant across all stages (or epochs) of gradient descent, and they are in agreement with each other. However, in practice, we empirically find that -Loss, -Loss, and Train-MSE compete with each other and have varying importance at different stages (or epochs) of ANN learning. Hence, we consider the following ways of adaptively tuning the trade-off weights of -Loss and -Loss, and as a function of the epoch number .

Annealing :

The first observation we make is that -Loss plays a critical role in the initial stages of learning, where gradient descent has a tendency to move towards a local minima solution and then refine the solution until convergence. Having a large value of in the beginning few epochs is thus helpful to avoid the selection of local minima and instead converge towards a generalizable solution. Further, note that -Loss is designed such that it would always be non-zero, even when we have converged at a generalizable solution with the lowest energy at every sample, which can result in instabilities during convergence. To avoid this, we consider performing a simulated annealing of that takes on a high value in the beginning epochs, that slowly decays to 0 after sufficiently many epochs. Specifically, we consider the following annealing procedure for :

(3)

where, is a hyper-parameter denoting the starting value of at epoch 0, is a hyper-parameter that controls the rate of annealing, and is a hyper-parameter that scales the annealing process across epochs.

Cold Starting :

The second observation we make is on the effect of -Loss on the convergence of gradient descent towards a generalizable solution. Note that -Loss, while being critical in ensuring physical consistency of our predictions with the Schrödinger’s equation, suffers from a large number of local minima and hence is susceptible to favoring the learning of non-generalizable solutions due to its high non-convexity. Hence, in the beginning epochs, when we are taking large steps in the gradient descent algorithm to move towards a minimum, it is important to keep -Loss turned off so that the learning process does not get stuck in one of the non-generalizable minima of -Loss. Once we have crossed a sufficient number of epochs and have already zoomed into a region in the parameter space in close vicinity to a generalizable solution, we can safely turn on -Loss so that it can help refine to converge to the generalizable solution. Based on this observation, we consider “cold starting” , where its value is kept to 0 in the beginning epochs after which it is raised to a constant value, as given by the following procedure:

(4)

where, is a hyper-parameter denoting the constant value of after a sufficient number of epochs,

is a hyper-parameter that dictates the rate of growth of the sigmoid function, and

is a hyper-parameter that controls the cutoff number of epochs after which is activated from a cold start of 0.

Overall Learning Objective:

Combining all of the innovations described above in designing and incorporating PG loss functions, we consider the following overall learning objective:

(5)

Note that Train-Loss is only computed over the labeled training set, , whereas the PG loss terms, -Loss and -Loss, are computed over as well as the set of unlabeled samples, . We refer to our proposed model trained using the above learning objective as CoPhy-PGNN, which is an abbreviation for Competing Physics Objectives PGNN.

4 Evaluation setup

Data and Experiment Design:

We considered spin systems of Ising chain models for predicting their ground-state wave-function under varying influences of two controlling parameters: and , which represent the strength of external magnetic field along the axis (parallel to the direction of Ising chain), and axis (perpendicular to the direction of the Ising chain), respectively. The Hamiltonian matrix for these systems is then given as:

(6)

where are Pauli operators and ring boundary conditions are imposed. Note that the size of is . We set to be equal to 0.01 to break the ground state degeneracy, while

was sampled from a uniform distribution from the interval [0, 2].

Note that when , the system is said to be in a ferromagnetic phase, since all the spins prefer to either point upward or downward collectively. However, when , the system transitions to paramagnetic phase, where both upward and downward spins are equally possible. Because the ground-state wave-function behaves differently in the two regions, the system actually exhibit different physical properties. Hence, in order to test for the generalizability of ANN models when training and test distributions are different, we generate training data only from the region deep inside the ferromagnetic phase for , while the test data is generated from a much wider range , covering both ferromagnetic and paramagnetic phases. In particular, the training set comprises of points with uniformly sampled from to , while the test set comprises of points with uniformly sampled from to . Labels for the ground-state wave-function for all training and test points where obtained by direct diagonalization of the Ising Hamiltonian using Intel’s implementation of LAPACK (MKL). We used uniform sub-sampling and varied from to to study the effect of training size on the generalization performance of comparative ANN models. For validation, we also used sub-sampling on the training set to obtain a validation set of samples. We performed 10 random runs of uniform sampling for every value of

, to show the mean and variance of the performance metrics of comparative ANN models, where at every run, a different random initializtion of the ANN models is also used. Unless otherwise stated, the results in any experiment are presented over training size

. Links to all code and data used in this paper as well as details about hyper-parameter tuning of all models are provided in the supplementary materials.

Baseline Methods:

Since there does not exist any related work in PGML that has been explored for our target application, we construct analogue versions of PINN-analogue (Raissi et al. (2019)) and PGNN-analogue (Karpatne et al. (2017b)) adapted to our problem using their major features. We describe these baselines along with others in the following:

  1. Black-box NN (or NN): This refers to the “black-box” ANN model trained just using Train-Loss without any PG loss terms.

  2. PGNN-analogue: The analogue version of PGNN (Karpatne et al. (2017b)) for our problem where the hyper-parameters corresponding to -Loss and -Loss are set to a constant value.

  3. PINN-analogue: The analogue version of PINN (Raissi et al. (2019)) for our problem that performs label-free learning only using PG loss terms with constant weights. Note that the PG loss terms are not defined as PDEs in our problem.

  4. MTL-PGNN: Multi-task Learning (MTL) variant of PGNN where PG loss terms are optimized alternatively (Kang et al. (2011)) by randomly selecting one from all the loss terms for each mini-batch in every epoch.

We also consider the following ablation models of CoPhy-PGNN.

  1. CoPhy-PGNN (only-): This is an ablation model where the PG loss terms are only trained over the training set, . Comparing our results with this model will help in evaluating the importance of using unlabeled samples in the computation of PG loss.

  2. CoPhy-PGNN (w/o -Loss): This is another ablation model where we only consider -Loss in the learning objective, while discarding -Loss.

  3. CoPhy-PGNN (Label-free): This ablation model drops Train-MSE from the learning objective and hence performs label-free (LF) learning only using PG loss terms.

Evaluation Metrics:

We use two evaluation metrics: (a) Test MSE, and (b) Cosine Similarity between our predicted wave-function,

, and the ground-truth, , averaged across all test samples. We particularly chose the cosine similarity since Euclidean distances are not very meaningful in high-dimensional spaces of wave-functions, such as the ones we are considering in our analyses. Further, an ideal cosine similarity of 1 provides an intuitive baseline to evaluate goodness of results.

5 Results and analysis

  Models MSE Cosine Similarity CoPhy-PGNN (proposed) % Black-box NN % PINN-analogue % PGNN-analogue % MTL-PGNN % CoPhy-PGNN (only-) % CoPhy-PGNN (w/o -Loss) % CoPhy-PGNN (Label-free) %  
Table 1: Test-MSE and Cosine Similarity of comparative ANN models on training size .
[width=1.0]figures/Cosine_Similarity_vs_Training_Size Figure 2: Cosine similarity w.r.t. different training sizes.

Table 1 provides a summary of the comparison of CoPhy

-PGNN model with baseline methods, where we can see that our proposed model shows significantly better performance in terms of both Test-MSE and Cosine Similarity. In fact, the cosine similarity of our proposed model is almost 1, indicating almost perfect fit with test labels. (Note that even a small drop in cosine similarity can lead to cascading errors in the estimation of other physical properties derived from the ground-state wave-function.) An interesting observation from Table

1 is that CoPhy-PGNN (Label-free) actually performs even worse than black-box NN. This shows that solely relying on PG loss without considering Train-MSE is fraught with challenges in arriving at a generalizable solution. Indeed, using a small number of labeled examples to compute Train-MSE provides a signficant nudge to ANN learning to arrive at more accurate solutions. Another interesting observation is that CoPhy-PGNN (only-) again performs even worse than black-box Black-box NN. This demonstrates that it is important to use unlabeled samples in , which are representative of the test set, to compute the PG loss. Furthermore, notice that CoPhy-PGNN (w/o -Loss) actually performance worst across all models, possibly due to the highly non-convex nature of -Loss function that can easily lead to local minima when used without -Loss. This sheds light on another important aspect of PGML that is often over-looked, which is that it does not suffice to simply add a PG-Loss term in the objective function in order to achieve generalizable solutions. In fact, an improper use of PG Loss can result in worse performance than a black-box model.

5.1 Effect of varying training size

Fig. 2 shows the differences in performance of comparative algorithms as we vary the training size from to . We can see that PGNN-analogue, which does not perform adaptive tuning, shows a high variance in its results across training sizes. This is because without cold starting , -Loss can be quite unstable in the beginning epochs and can guide the gradient descent into one of its many local minima, especially when the gradients of train-MSE are weak due to paucity of training data. On the other hand, CoPhy-PGNN performs consistently better than all other baseline methods, with smallest variance in its results across 10 random runs. In fact, our proposed model is able to perform well even over 100 training samples.

5.2 Studying convergence across epochs

Figure 3 shows the variations in Train-MSE, Test-MSE, and -Loss terms for four comparative models at every epoch of gradient descent. We can see that all models are able to achieve a reasonably low value of Train-MSE at the final solution expect CoPhy-PGNN (Label-free), which is expected since it does not consider minimizing Train-MSE in the learning objective. Black-box NN actually shows the lowest value of Train-MSE than all other models. However, the quantity that we really care to minimize is not the Train-MSE but the Test-MSE, which is indicative of generalization performance. We can see that while our proposed model, CoPhy-PGNN shows slightly higher Train-MSE than Black-box NN, it shows drastically smaller Test-MSE at the converged solution, demonstrating the effectiveness of our proposed approach.

[width=0.33]figures/Convergence_On_The_Training_Set [width=0.33]figures/Convergence_On_The_Test_Set [width=0.33]figures/Schrodinger_Loss_On_Training_and_Test_Set

Figure 3: Convergence plots showing Train-MSE, Test-MSE, and S-Loss over epochs.

[width=0.38]figures/CosineSimilarity.pdf

Figure 4: Cosine Similarity on test samples as a function of . The dashed line represents the boundary between the interval used for training (left) and one used for testing (right).

A contrasting feature of the convergence plots of CoPhy-PGNN relative to Black-box NN is the presence of an initial jump in the Test-MSE values during the first few epochs. This likely arises due to the competing nature of two different loss terms that we are trying to minimize in the beginning epochs: the Train-MSE, that tries to move towards local minima solutions favorable to training data, and -Loss, that pushes the gradient descent towards generalizable solutions. Indeed, this initial jump in Test-MSE helps in moving out of local minima solutions, after which the Test-MSE plummets to significantly smaller values. Notice that CoPhy-PGNN (Label-free) shows a similar jump in Test-MSE in the beginning epochs, because it experiences a similar effect of -Loss gradients during the initial stages of ANN learning. However, we can see that its Test-MSE is never able to drop beyond a certain value after the initial jump, as it does not receive the necessary gradients of Train-MSE that helps in converging towards generalizable solutions.

Another interesting observation is that CoPhy-PGNN (w/o -Loss) does not show any jump in Test-MSE during the beginning epochs in contrast to our proposed model, since it is not affected by -Loss. If we further look at -Loss curves, we can see that CoPhy-PGNN (w/o -Loss) achieves lowest values, since it only considers -Loss as the PG loss term to be minimized in the learning objective. However, we know that -Loss is home to a large number of local minima, and for that reason, even though CoPhy-PGNN (w/o -Loss) shows low -Loss values, its test-MSE quickly grows to a large value, indicating its convergence on a local minima. These results demonstrate that a careful trade-off of PG loss terms along with Train-MSE is critical to ensure good generalization performance, such as that of our proposed model.

To better understand the behavior of competing loss terms, we conducted a novel gradient analysis that can be found in the supplementary materials.

5.3 Evaluating generalization power

Instead of computing the average cosine similarity across all test samples, Figure 4 analyzes the trends in cosine similarity over test samples with different values of , for four comparative models. Note that none of these models have observed any labeled data during training outside the interval of . Hence, by testing for the cosine similarity over test samples with , we are directly testing for the ability of ANN models to generalize outside the data distributions it has been trained upon. Evidently, all label-aware models perform well on the interval of . However, except for CoPhy-PGNN, all baseline models degrade significantly outside that interval, proving their lack of generalizability. Moreover, the label-free, CoPhy-PGNN (Label-free), model is highly erratic, and performs poorly across the board.

5.4 Analysis of loss landscapes

To truly understand the effect of adding PG loss to ANN’s generalization performance, here we visualize the landscape of different loss functions w.r.t. ANN model parameters. In particular, we use the code in (Bernardi (2019)) to plot a 2D view of the landscape of different loss functions, namely Train-MSE, Test-MSE, and PG-Loss (sum of -Loss and -Loss), in the neighborhood of a model solution, as shown in Figure 5. In each of the sub-figures of this plot, the model’s parameters are treated with filter normalization as described in (Li et al. (2018)), and hence, the coordinate values of the axes are unit-less. Also, the model solutions are represented by blue dots. As can be seen, all label-aware models have found a minimum in Train-MSE landscape. However, when the test-MSE loss surface is plotted, it is clear that while the CoPhy-PGNN model is still at a minimum, the other baseline models are not. This is a strong indication that using the PG loss with unlabeled data can lead to better extrapolation; it allows the model to generalize beyond in-distribution data. We can see that the without using labels, CoPhy-PGNN (Label-free) fails to reach a good minimum of Test-MSE, even though it arrives at a minimum of PG Loss.

[width=1]figures/Train-MSE [width=1]figures/Test-MSE [width=1]figures/PG-loss

Figure 5: A comprehensive comparison between CoPhy-PGNN and different baseline models. The 1st and 2nd columns show that without using unlabeled data, the model does not generalize well. On the other hand, the 3rd column shows that without labeled data, the model fails to reach a good minimum. Only the last column, our proposed model, shows a good fit across both labeled and unlabeled data. Also, the best performing model is also the model that best optimizes the PG loss.

To understand the interactions among competing PG loss terms, we further computed the projection of the gradient of every loss term w.r.t. the optimal gradient direction (computed empirically) at every epoch and investigated the importance of PG loss terms in guiding towards the optimal gradient at different stages of neural network learning. See supplementary materials for more details.

6 Conclusions and future work

This work proposed novel strategies to address the problem of competing physics objectives in PGML. For the target problem of quantum mechanics, we designed a physics-guided machine learning (PGML) model CoPhy-PGNN to predict ground-state wave-function and energy for Ising chain models. We also designed comprehensive evaluations that demonstrated the efficacy of our CoPhy-PGNN model. From our results, we found that: 1) PG loss helps to extrapolate and gives the model better generalizablity; 2) Using labeled data along with PG loss results in more stable PGML models. Moreover, we visualized the loss landscape to give a better understanding of how the combination of both labeled data and PG loss leads to better generalization performance.

Future extensions of our work can explore the applicability of CoPhy-PGNN to other application domains with varying types of physics objectives. While this work empirically demonstrated the value of CoPhy-PGNN in combating with competing PG loss terms, future work can focus on theoretical analyses. Finally, future work can focus on the scalability of our model to larger systems in quantum mechanics.

7 Broader impacts

This research has several societal and scientific implications. Our work is motivated by a real-world scientific application to predict the wave function of an

-particle system under varying external parameters. This is key to predicting several properties of the system such as magnetization, entanhlement entropy, and quantum phase transitions. While the analysis presented in our paper is a proof-of-concept of the validity of neural networks in predicting wave function for

-particle systems in simplified external parameters, we hope that this work serves as a stepping stone in accelerating scientific discovery in quantum science by harnessing the power of neural networks for studying large and complex quantum systems while respecting the physics of the problem to ensure generalizability. Existing numerical approaches for solving for the ground-state wave function of -particle systems involve computationally expensive techniques for diagonalizing the Hamiltonian matrix of the system, whose size grows exponentially () with the size of the system. This makes it practically impossible to scale such methods to system with using modern high-performance computing infrastructure. Even for systems with smaller , e.g.,

, a great number of calculations for different external parameters have to be performed in order to construct the full phase diagram. In contrast, deep learning models, once trained, are quite inexpensive to perform inference on test Hamiltonian matrices, as observed during our preliminary scalability experiments to test the feasibility of our work to larger systems (see Supplementary Materials for more details). We hope that future extensions of our work explore the scalability of our proposed

CoPhy-PGNN model on larger systems, without compromising on the generalizability of neural network solutions. Our work also has broader impacts in several other scientific applications involving the solution of eigenvalue equations for the lowest (or equivalently) the highest eigenvalues. Such problems ubiquitously appear in fields such as electromagnetism, non-linear optics, and materials science requiring quantum mechanical descriptions.

References

Appendix A Relevant Physics Background

a.1 Ising Chain and Quantum Mechanics

Quantum mechanics provides a theoretically rigorous framework to investigate physical properties of quantum materials by solving the Schrödinger’s equation—the fundamental law in quantum mechanics. The Schrödinger’s equation is essentially a PDE that can be easily transformed into an eigenvalue problem of the form: , where is the Hamiltonian Matrix, is the wave-function, and

is the energy, a scalar quantity. (Note that many other PDEs in physical sciences, e.g., Maxwell’s equations, yield to a similar transformation to an eigenvalue problem.) All information related to the dynamics of the quantum system is encoded in the eigen-vectors of

, i.e., . Among these eigen-vectors, the ground state wave-function, , defined as the eigen-vector with lowest energy, , is a fundamental quantity for understanding the properties of quantum systems. Exploring how evolves with controlling parameters, e.g., magnetic field and bias voltage, is an important subject of study in material science.

A major computational bottleneck in solving for the ground-state wave-function is the diagonalization of the Hamiltonian matrix, , whose dimension grows exponentially with the size of the system. In order to study the effects of controlling parameters on the physical properties of a quantum system, theorists routinely have to perform diagonalizations on an entire family of Hamiltonian matrices, with the same structure but slightly different parameters.

[width=0.45]figures/spin-chain

Figure 6: Schematic illustration of the Ising spin chain. Each site is occupied by a spin that can only take two values, either spin up (+1) or spin down (-1). The external magnetic field is applied along the chain direction.

Here we study a quintessential model of the transverse field: Ising chain model (Bonfim et al. [2019]), which is a uni-dimensional spin chain model under the influence of a transverse magnetic field , as shown in Fig. 6. Spin is the intrinsic angular momentum possessed by elementary particles including electrons, protons, and neutrons. The Ising spin chain model describes a system in which multiple spins are located along a chain and they interact only with their neighbors. By adding an external magnetic field (), the ground state wave-function could change dramatically. This model and its derivatives have been used to study a number of novel quantum materials(Brando et al. [2016], Zhou et al. [2017]) and can also be used for quantum computing(Terhal [2015]

), since the qubit, the basic unit of quantum computing, can also be represented as a spin. However, the challenge in finding the ground-state wave-function of this model is that the dimension

of the Hamiltonian grows exponentially as , where equals the number of spins. We aim to develop PGML approaches that can learn the predictive mapping from the space of Hamiltonians, , to ground-state wave-functions, , using the physics of the Schrödinger’s equation along with labels produced by diagonalization solvers on training set.

Appendix B Hyper-parameter Selection

b.1 Hyperparameter Search

To exploit the best potential of the models, we conducted hyperparameter search prior to many of our experiments

111All the code and data used in this work is available at https://github.com/jayroxis/Cophy-PGNN. A complete set of code, data, pretrained models, and stored variables can be found at: https://osf.io/ps3wx/?view_only=9681ddd5c43e48ed91af0db019bf285a (cophy-pgnn.tar.gz). on training set of size , by doing random sampling in a fixed range for every hyper-parameter value. We chose the average of the top-5 hyper-parameter settings that showed the lowest error on the validation set, which was 2000 instances sampled from the training set. For the proposed CoPhy-PGNN model, this resulted in the following set of hyper-parameters: , and we chose fixed for all models. The same hyper-parameter values were used across all values of in our experiments to show the robustness of these values.

We searched for the best model architecture using simple multi-layer neural networks that does not show significant overfitting or underfitting, then we fixed the architecture for all the models in our work. The models comprise of four fully-connected layers with activation and an linear output layer. The widths of all the hidden states are 100. All the experiments used Adamax (Kingma and Ba [2014]) optimizer and set maximum training epochs to 500 that most of the models will converge before that limit.

Since different models may use different loss terms, the numbers of hyperparameters to search are different, and some of them were not being searched. We use random search (Bergstra and Bengio [2012]) and run around 300 to 500 runs per model to keep a balance between search quality and the time spent. The hyperparameters we searched include:

  1. For -Loss: , .

  2. For -Loss: , ,

b.2 Sigmoid Cold-start and Other Different Modes

Additionally, to further prove our choice on is indeed effective. We compared three other modes with : quick-drop (Eq. 8), quick-start (Eq. 9), inversed-sigmoid (Eq. 7).

(7)
(8)
(9)

We replaced the cold-start with the three modes in CoPhy-PGNN and ran 400 times for each mode to do hyperparameter search on , , . Using the average hyperparameter values for the top-5 models that have the lowest validation error. The results are:

  1. quick-drop: , , .

  2. quick-start: , , .

  3. sigmoid: , , .

  4. inversed-sigmoid: , , .

[width=0.53]figures/different_modes_lambda_s [width=0.44]figures/different_modes_barplot

Figure 7: Wave function cosine similarity of different adaptive -loss w.r.t. different training size. Left: cosine similarity on different training sizes. Right: Mean cosine similarity over all training sizes.

Using the hyperparameter values to set up models, and run 10 times per setting on different training sizes, the results are shown in Fig. 7. We can see that consistently performed better than other modes in both stability and accuracy. Another important information it conveys is that, quick-start, like though it increases the weight of -Loss much faster, results in a much more unstable results. Actually our results show that dominates the leaderboard of both top-10 best and worst performances (also showed in the barplot in Figure 7). It implies that a smooth and gradual switch of dominance between different loss terms is better in terms of stability.

Appendix C Analysis of gradients

c.1 Contribution of Loss Terms

The complicated interactions between competing loss terms motivate us to further investigate the different role each loss term plays in the aggregated loss function. The sharp bulge in both and -Loss in the first few epochs (Fig. 3 in the main document) shows that the optimization process is not quite smooth. Our speculation is that -Loss and -Loss are competing loss terms in the multi-objective optimization. To monitor the contribution of every loss term in the learning process, we need to measure if the gradients of a loss term points towards the optimal direction of descent to a generalizable model. One way to achieve this is to compute the component of the gradient of a loss term in the optimal direction of descent (leading to a generalizable model). Suppose the desired (or optimal) direction is and the gradient of a loss term is . We can then compute the projection of along the direction of at the epoch as:

(10)

A higher projection value indicates a larger step toward the optimal direction at the epoch, , which is defined as:

(11)

where is the model parameters (i.e., weights and biases of the neural network) at the epoch and is the optimal state of the model that is known to be generalizable. Note that finding an exact solution for that is the global optima of the loss function is practically infeasible for deep neural networks (Choromanska et al. [2014]). Hence, in our experiments, we consider the final model arrived on convergence of the training process as a reasonable approximation of . For methods such as PGNN-analogue and CoPhy-PGNN, the final models at convergence performed significantly well and showed a cosine similarity of with ground-truth. This is very close to a model trained directly on the test set that only reaches 99.8%. This gives some confidence that the final models at convergence are good approximations to . To compute the inner products between and , we used a flattened representation of the model parameters by concatenating the weights and biases across the layers.

c.2 Experiment Results

[width=0.55]figures/Gradient_Analysis

Figure 8: The projection of gradient of each term on the optimal direction. The optimal direction of a certain iteration points from current state to the optimal state.

We analyze the role of -Loss and -Loss in the training process of the two methods: CoPhy-PGNN and PGNN-analogue. Both these methods were run with the same initialization of model parameters. Training size is 2000 and the rest of settings are same as in Section 5 of the main document.

Figure 8 shows that in the early epochs, the -Loss has positive projection values, which means that it is helping the method to move towards the optimal state. On the other hand, the projection of -Loss starts with a negative value, indicating that the gradient of the -Loss term is counterproductive at the beginning. Hence, -Loss helps in moving out of the neighborhood of the local minima of -Loss towards a generalizable solution. However, the projection of -Loss does not remain negative (and thus counterproductive) across all epochs. In fact, -Loss makes a significant contribution by having a large positive projection value after around 50 epochs. This shows that as long as we manage to escape from the initial trap caused by the local minima of the -Loss, then it can turn to guide the model towards desired direction . By initially setting close to zero, it allows the -Loss to dominate in the initial epochs and move out of the local minima. Later, we let -Loss to recover to a reasonable value and it will start to play its role. These findings align quite well with the cold-start and annealing idea proposed in this work and show that it works best when the two loss terms are combined together using adaptive weights.

Note that for this analysis method to produce valid findings, we need to ensure that the loss terms are not pointing towards the direction of an equally good that can be arrived at from the same initialization. To ensure this, we investigated how similar the trained models (optimal states) are when started from the same initialization for the two methods. The parameters of the PGNN-analogue and CoPhy-PGNN showed an average cosine similarity of 98.6%, and in many cases reached 99%. This gave us more confidence to believe that our approximations to the optimal model were sufficient.