1 Introduction
With the increasing impact of machine learning (ML) methods in diverse scientific disciplines (Appenzeller (2017); GrahamRowe et al. (2008)), there is growing realization in the scientific community to harness the power of ML advances without ignoring the rich knowledge of physics underlying scientific phenomena, thus using both physics and ML at an equal footing (Karpatne et al. (2017a); Gil (2017)). This is the emerging field of physicsguided machine learning (PGML) (Karpatne et al. (2017a); Raissi et al. (2019); de Bezenac et al. (2019); Karpatne et al. (2017b); Jia et al. (2019); Daw et al. (2020); Wang et al. (2017a); Pilania et al. (2018)) that is gaining widespread attention in several scientific disciplines including geoscience, climate science, fluid dynamics, and thermodynamics. One of the promising directions in PGML is to modify the objective function of neural networks by adding loss functions that measure the violation of physics in the ANN outputs, termed as physicsguided (PG) loss functions (Karpatne et al. (2017c); Stewart and Ermon (2017)). By anchoring ANN models to be consistent with physics, PG loss functions help in learning generalizable solutions from data. They has been used for incorporating a variety of physics objectives such as energy conservation (Jia et al. (2019)), monotonic relations (Karpatne et al. (2017b); Muralidhar et al. (2018)
), and partial differential equations (PDEs) (
Raissi et al. (2017a, 2019); de Bezenac et al. (2019)) in existing research.While some existing methods in PGML learn neural networks by solely minimizing PG loss (and thus being labelfree) (Raissi et al. (2019); Stewart and Ermon (2017)), others use both PG loss and data label loss in their objective function using appropriate tradeoff hyperparameters (Karpatne et al. (2017b); Jia et al. (2019)). However, what is even more challenging is when there are multiple physics objectives with competing PG loss terms that need to be minimized together, where each PG loss may show multiple local minima. In such situations, simple addition of PG loss terms in the objective function with constant tradeoff hyperparameters may result in the learning of nongeneralizable solutions. This may seem counterintuitive since the incorporation of PG loss is generally assumed to offer generalizability in the PGML literature (Karpatne et al. (2017b); de Bezenac et al. (2019); Shin et al. (2020)).
Figure 1 shows a toy example with two competing physics objectives to illustrate their effects on the generalizability of learned solutions. If we add the two objectives with constant weights and optimize the weighted sum using gradient descent methods, it is easy to end up at local minima and
. However, if we pay importance to physics objective 2 in the first few epochs of gradient descent, and then optimize physics objective 1 at later epochs, we are more likely to arrive at the global minimum
. This simple observation, although derived from an artificially constructed toy problem, motivates us to ask the question: is it possible to adaptively balance the importance of competing physics objectives at different stages of neural network learning to arrive at generalizable PGML solutions?In this work, we introduce a novel framework of CoPhyPGNN, which is an abbreviation for Competing Physics Objectives PhysicsGuided Neural N
etworks, to handle competing physics objectives as PG loss terms in neural network learning. We specifically consider the domain of scientific problems where physics objectives are represented as eigenvalue equations and we are required to solve for the highest or lowest eigensolution. This representation is common to many types of physics such as the Schrödinger equation and Maxwell’s equations and routinely shows up in applications in quantum mechanics, computational chemistry, and electromagnetism.
We specifically focus on the target application in quantum mechanics to predict the groundstate wave function of an Ising chain model (Bonfim et al. (2019)) with particles. In this testbed problem, the two competing physics objectives are minimizing the energy (Loss) while satisfying the Schrödinger equation (Loss). As we empirically demonstrate in this paper, Lossis fraught by exponentially many local minima that challenges the generalization performance of existing PGML formulations. In contrast, CoPhyPGNN effectively balances Loss and Loss in an adaptive way throughout the learning process leading to better generalizability even on input distributions with no labels during training. We compare our results with several baseline methods in PGML and discuss novel insights about: (a) the effects of PG loss on the loss landscape visualizations of neural networks, (b) effective ways of incorporating PG loss in the learning process, and (c) the advantage of using data (labeled and/or unlabeled) along with physics.
2 Related work
PGML has found successful applications in several disciplines including fluid dynamics (Wang et al. (2017a, 2016, b)), climate science (de Bezenac et al. (2019)), and lake modeling (Karpatne et al. (2017b); Jia et al. (2019); Daw et al. (2020)). However, to the best of our knowledge, PGML formulations have not been explored yet for our target application of wave function prediction in quantum mechanics. Existing work in PGML can be broadly divided into two categories. The first category involves labelfree learning by only minimizing PG loss without using any labeled data. An early work in this category includes work by Stewart et al. (Stewart and Ermon (2017)) on the use of domain constraints (e.g., the kinematic equations of motion) as PG loss to supervise neural networks. More recently, Physicsinformed neural networks (PINNs) and its variants (Raissi et al. (2019, 2017a, 2017b)) have been developed to solve PDEs by minimizing PG loss terms for simple canonical problems such as Burger’s equation. Since these methods are labelfree, they do not explore the interplay between PG loss and label loss. We consider an analogue of PINN for our target application as a baseline in our experiments.
The second category of methods incorporate PG loss as additional terms in the objective function along with label loss, using tradeoff hyperparameters. This includes work in Physicsguided Neural Networks (PGNNs) (Karpatne et al. (2017b); Jia et al. (2019)) for the target application of lake temperature modeling, where the PG loss terms to capture conversation of energy as well as monotonic densitydepth physics. We use an analogue of PGNN as a baseline in our experiments.
While some recent works in PGML have investigated the effects of PG loss on generalization performance (Shin et al. (2020)) and the importance of normalizing the scale of hyperparameters corresponding to PG loss terms (Wang et al. (2020)), they do not study the effects of competing physics objectives which is the focus of this paper. Our work is related to the field of multitask learning (MTL) (Caruana (1993)), as the minimization of physics objectives and label loss can be viewed as multiple shared tasks. For example, alternating minimization techniques in MTL (Kang et al. (2011)) in MTL can be used to alternate between minimizing different PG loss and label loss terms over different minibatches. We consider this as a baseline approach in our experiments.
3 Methodology
3.1 Problem statement
We consider a widely studied problem in quantum mechanics: the Ising chain model, as the testbed application in this paper (see Supplementary materials for a detailed description of the physics of the Ising model). From an ML perspective, we are given a collection of training pairs, , where is a Hamiltonian matrix and is its corresponding groundstate wavefunction and energy, generated by diagonalization solvers. We consider the problem of learning an ANN model, , that can predict for any Hamiltonian matrix, , where are the learnable parameters of ANN. We are also given a set of unlabeled examples, , which will be used for testing. We consider a simple feedforward architecture of in all our formulations.
3.2 Designing physicsguided loss functions
A naïve approach for learning is to minimize the mean sum of squared errors (MSE) of predictions on the training set, referred to as the ). However, instead of solely relying on TrainMSE, we consider the following PG loss terms to guide the learning of to generalizable solutions:
Schrödinger Loss:
A fundamental equation we want to satisfy in our predictions, , for any input is the Schrödinger’s equation, . Hence, we consider minimizing the following equation:
(1) 
where the denominator term ensures that resides on a unit hypersphere with , thus avoiding scaling issues. Note that by construction, Loss only depends on the predictions of and does not rely on true labels, . Hence, Loss can be evaluated even on the unlabeled test data, .
Energy Loss:
Note that there are many noninteresting solutions of that can appear as “local minima” in the optimization landscape of Loss. For example, every input Hamiltonian, , there are possible eigensolutions (where is the length of ), each of which will result in a perfectly low value of Loss , thus acting as a local minima. However, we are only interested in the groundstate eigensolution for every . Therefore, we consider minimizing another PG loss term that ensures the predicted energy at every sample is low as follows:
(2) 
The use of function ensures that ELoss is always positive, even when predicted energies are negative.
3.3 Adaptive tuning of PG loss weights
A simple strategy for incorporating PG loss terms in the learning objective of is to add them to TrainMSE using tradeoff weight parameters, and , for Loss and Loss, respectively. Conventionally, such tradeoff weights are kept constant to a certain value across all epochs of gradient descent. This inherently assumes that the importance of PG loss terms in guiding the learning of towards a generalizable solution is constant across all stages (or epochs) of gradient descent, and they are in agreement with each other. However, in practice, we empirically find that Loss, Loss, and TrainMSE compete with each other and have varying importance at different stages (or epochs) of ANN learning. Hence, we consider the following ways of adaptively tuning the tradeoff weights of Loss and Loss, and as a function of the epoch number .
Annealing :
The first observation we make is that Loss plays a critical role in the initial stages of learning, where gradient descent has a tendency to move towards a local minima solution and then refine the solution until convergence. Having a large value of in the beginning few epochs is thus helpful to avoid the selection of local minima and instead converge towards a generalizable solution. Further, note that Loss is designed such that it would always be nonzero, even when we have converged at a generalizable solution with the lowest energy at every sample, which can result in instabilities during convergence. To avoid this, we consider performing a simulated annealing of that takes on a high value in the beginning epochs, that slowly decays to 0 after sufficiently many epochs. Specifically, we consider the following annealing procedure for :
(3) 
where, is a hyperparameter denoting the starting value of at epoch 0, is a hyperparameter that controls the rate of annealing, and is a hyperparameter that scales the annealing process across epochs.
Cold Starting :
The second observation we make is on the effect of Loss on the convergence of gradient descent towards a generalizable solution. Note that Loss, while being critical in ensuring physical consistency of our predictions with the Schrödinger’s equation, suffers from a large number of local minima and hence is susceptible to favoring the learning of nongeneralizable solutions due to its high nonconvexity. Hence, in the beginning epochs, when we are taking large steps in the gradient descent algorithm to move towards a minimum, it is important to keep Loss turned off so that the learning process does not get stuck in one of the nongeneralizable minima of Loss. Once we have crossed a sufficient number of epochs and have already zoomed into a region in the parameter space in close vicinity to a generalizable solution, we can safely turn on Loss so that it can help refine to converge to the generalizable solution. Based on this observation, we consider “cold starting” , where its value is kept to 0 in the beginning epochs after which it is raised to a constant value, as given by the following procedure:
(4) 
where, is a hyperparameter denoting the constant value of after a sufficient number of epochs,
is a hyperparameter that dictates the rate of growth of the sigmoid function, and
is a hyperparameter that controls the cutoff number of epochs after which is activated from a cold start of 0.Overall Learning Objective:
Combining all of the innovations described above in designing and incorporating PG loss functions, we consider the following overall learning objective:
(5) 
Note that TrainLoss is only computed over the labeled training set, , whereas the PG loss terms, Loss and Loss, are computed over as well as the set of unlabeled samples, . We refer to our proposed model trained using the above learning objective as CoPhyPGNN, which is an abbreviation for Competing Physics Objectives PGNN.
4 Evaluation setup
Data and Experiment Design:
We considered spin systems of Ising chain models for predicting their groundstate wavefunction under varying influences of two controlling parameters: and , which represent the strength of external magnetic field along the axis (parallel to the direction of Ising chain), and axis (perpendicular to the direction of the Ising chain), respectively. The Hamiltonian matrix for these systems is then given as:
(6) 
where are Pauli operators and ring boundary conditions are imposed. Note that the size of is . We set to be equal to 0.01 to break the ground state degeneracy, while
was sampled from a uniform distribution from the interval [0, 2].
Note that when , the system is said to be in a ferromagnetic phase, since all the spins prefer to either point upward or downward collectively. However, when , the system transitions to paramagnetic phase, where both upward and downward spins are equally possible. Because the groundstate wavefunction behaves differently in the two regions, the system actually exhibit different physical properties. Hence, in order to test for the generalizability of ANN models when training and test distributions are different, we generate training data only from the region deep inside the ferromagnetic phase for , while the test data is generated from a much wider range , covering both ferromagnetic and paramagnetic phases. In particular, the training set comprises of points with uniformly sampled from to , while the test set comprises of points with uniformly sampled from to . Labels for the groundstate wavefunction for all training and test points where obtained by direct diagonalization of the Ising Hamiltonian using Intel’s implementation of LAPACK (MKL). We used uniform subsampling and varied from to to study the effect of training size on the generalization performance of comparative ANN models. For validation, we also used subsampling on the training set to obtain a validation set of samples. We performed 10 random runs of uniform sampling for every value of
, to show the mean and variance of the performance metrics of comparative ANN models, where at every run, a different random initializtion of the ANN models is also used. Unless otherwise stated, the results in any experiment are presented over training size
. Links to all code and data used in this paper as well as details about hyperparameter tuning of all models are provided in the supplementary materials.Baseline Methods:
Since there does not exist any related work in PGML that has been explored for our target application, we construct analogue versions of PINNanalogue (Raissi et al. (2019)) and PGNNanalogue (Karpatne et al. (2017b)) adapted to our problem using their major features. We describe these baselines along with others in the following:

Blackbox NN (or NN): This refers to the “blackbox” ANN model trained just using TrainLoss without any PG loss terms.

PGNNanalogue: The analogue version of PGNN (Karpatne et al. (2017b)) for our problem where the hyperparameters corresponding to Loss and Loss are set to a constant value.

PINNanalogue: The analogue version of PINN (Raissi et al. (2019)) for our problem that performs labelfree learning only using PG loss terms with constant weights. Note that the PG loss terms are not defined as PDEs in our problem.

MTLPGNN: Multitask Learning (MTL) variant of PGNN where PG loss terms are optimized alternatively (Kang et al. (2011)) by randomly selecting one from all the loss terms for each minibatch in every epoch.
We also consider the following ablation models of CoPhyPGNN.

CoPhyPGNN (only): This is an ablation model where the PG loss terms are only trained over the training set, . Comparing our results with this model will help in evaluating the importance of using unlabeled samples in the computation of PG loss.

CoPhyPGNN (w/o Loss): This is another ablation model where we only consider Loss in the learning objective, while discarding Loss.

CoPhyPGNN (Labelfree): This ablation model drops TrainMSE from the learning objective and hence performs labelfree (LF) learning only using PG loss terms.
Evaluation Metrics:
We use two evaluation metrics: (a) Test MSE, and (b) Cosine Similarity between our predicted wavefunction,
, and the groundtruth, , averaged across all test samples. We particularly chose the cosine similarity since Euclidean distances are not very meaningful in highdimensional spaces of wavefunctions, such as the ones we are considering in our analyses. Further, an ideal cosine similarity of 1 provides an intuitive baseline to evaluate goodness of results.5 Results and analysis
Table 1 provides a summary of the comparison of CoPhy
PGNN model with baseline methods, where we can see that our proposed model shows significantly better performance in terms of both TestMSE and Cosine Similarity. In fact, the cosine similarity of our proposed model is almost 1, indicating almost perfect fit with test labels. (Note that even a small drop in cosine similarity can lead to cascading errors in the estimation of other physical properties derived from the groundstate wavefunction.) An interesting observation from Table
1 is that CoPhyPGNN (Labelfree) actually performs even worse than blackbox NN. This shows that solely relying on PG loss without considering TrainMSE is fraught with challenges in arriving at a generalizable solution. Indeed, using a small number of labeled examples to compute TrainMSE provides a signficant nudge to ANN learning to arrive at more accurate solutions. Another interesting observation is that CoPhyPGNN (only) again performs even worse than blackbox Blackbox NN. This demonstrates that it is important to use unlabeled samples in , which are representative of the test set, to compute the PG loss. Furthermore, notice that CoPhyPGNN (w/o Loss) actually performance worst across all models, possibly due to the highly nonconvex nature of Loss function that can easily lead to local minima when used without Loss. This sheds light on another important aspect of PGML that is often overlooked, which is that it does not suffice to simply add a PGLoss term in the objective function in order to achieve generalizable solutions. In fact, an improper use of PG Loss can result in worse performance than a blackbox model.5.1 Effect of varying training size
Fig. 2 shows the differences in performance of comparative algorithms as we vary the training size from to . We can see that PGNNanalogue, which does not perform adaptive tuning, shows a high variance in its results across training sizes. This is because without cold starting , Loss can be quite unstable in the beginning epochs and can guide the gradient descent into one of its many local minima, especially when the gradients of trainMSE are weak due to paucity of training data. On the other hand, CoPhyPGNN performs consistently better than all other baseline methods, with smallest variance in its results across 10 random runs. In fact, our proposed model is able to perform well even over 100 training samples.
5.2 Studying convergence across epochs
Figure 3 shows the variations in TrainMSE, TestMSE, and Loss terms for four comparative models at every epoch of gradient descent. We can see that all models are able to achieve a reasonably low value of TrainMSE at the final solution expect CoPhyPGNN (Labelfree), which is expected since it does not consider minimizing TrainMSE in the learning objective. Blackbox NN actually shows the lowest value of TrainMSE than all other models. However, the quantity that we really care to minimize is not the TrainMSE but the TestMSE, which is indicative of generalization performance. We can see that while our proposed model, CoPhyPGNN shows slightly higher TrainMSE than Blackbox NN, it shows drastically smaller TestMSE at the converged solution, demonstrating the effectiveness of our proposed approach.
A contrasting feature of the convergence plots of CoPhyPGNN relative to Blackbox NN is the presence of an initial jump in the TestMSE values during the first few epochs. This likely arises due to the competing nature of two different loss terms that we are trying to minimize in the beginning epochs: the TrainMSE, that tries to move towards local minima solutions favorable to training data, and Loss, that pushes the gradient descent towards generalizable solutions. Indeed, this initial jump in TestMSE helps in moving out of local minima solutions, after which the TestMSE plummets to significantly smaller values. Notice that CoPhyPGNN (Labelfree) shows a similar jump in TestMSE in the beginning epochs, because it experiences a similar effect of Loss gradients during the initial stages of ANN learning. However, we can see that its TestMSE is never able to drop beyond a certain value after the initial jump, as it does not receive the necessary gradients of TrainMSE that helps in converging towards generalizable solutions.
Another interesting observation is that CoPhyPGNN (w/o Loss) does not show any jump in TestMSE during the beginning epochs in contrast to our proposed model, since it is not affected by Loss. If we further look at Loss curves, we can see that CoPhyPGNN (w/o Loss) achieves lowest values, since it only considers Loss as the PG loss term to be minimized in the learning objective. However, we know that Loss is home to a large number of local minima, and for that reason, even though CoPhyPGNN (w/o Loss) shows low Loss values, its testMSE quickly grows to a large value, indicating its convergence on a local minima. These results demonstrate that a careful tradeoff of PG loss terms along with TrainMSE is critical to ensure good generalization performance, such as that of our proposed model.
To better understand the behavior of competing loss terms, we conducted a novel gradient analysis that can be found in the supplementary materials.
5.3 Evaluating generalization power
Instead of computing the average cosine similarity across all test samples, Figure 4 analyzes the trends in cosine similarity over test samples with different values of , for four comparative models. Note that none of these models have observed any labeled data during training outside the interval of . Hence, by testing for the cosine similarity over test samples with , we are directly testing for the ability of ANN models to generalize outside the data distributions it has been trained upon. Evidently, all labelaware models perform well on the interval of . However, except for CoPhyPGNN, all baseline models degrade significantly outside that interval, proving their lack of generalizability. Moreover, the labelfree, CoPhyPGNN (Labelfree), model is highly erratic, and performs poorly across the board.
5.4 Analysis of loss landscapes
To truly understand the effect of adding PG loss to ANN’s generalization performance, here we visualize the landscape of different loss functions w.r.t. ANN model parameters. In particular, we use the code in (Bernardi (2019)) to plot a 2D view of the landscape of different loss functions, namely TrainMSE, TestMSE, and PGLoss (sum of Loss and Loss), in the neighborhood of a model solution, as shown in Figure 5. In each of the subfigures of this plot, the model’s parameters are treated with filter normalization as described in (Li et al. (2018)), and hence, the coordinate values of the axes are unitless. Also, the model solutions are represented by blue dots. As can be seen, all labelaware models have found a minimum in TrainMSE landscape. However, when the testMSE loss surface is plotted, it is clear that while the CoPhyPGNN model is still at a minimum, the other baseline models are not. This is a strong indication that using the PG loss with unlabeled data can lead to better extrapolation; it allows the model to generalize beyond indistribution data. We can see that the without using labels, CoPhyPGNN (Labelfree) fails to reach a good minimum of TestMSE, even though it arrives at a minimum of PG Loss.
To understand the interactions among competing PG loss terms, we further computed the projection of the gradient of every loss term w.r.t. the optimal gradient direction (computed empirically) at every epoch and investigated the importance of PG loss terms in guiding towards the optimal gradient at different stages of neural network learning. See supplementary materials for more details.
6 Conclusions and future work
This work proposed novel strategies to address the problem of competing physics objectives in PGML. For the target problem of quantum mechanics, we designed a physicsguided machine learning (PGML) model CoPhyPGNN to predict groundstate wavefunction and energy for Ising chain models. We also designed comprehensive evaluations that demonstrated the efficacy of our CoPhyPGNN model. From our results, we found that: 1) PG loss helps to extrapolate and gives the model better generalizablity; 2) Using labeled data along with PG loss results in more stable PGML models. Moreover, we visualized the loss landscape to give a better understanding of how the combination of both labeled data and PG loss leads to better generalization performance.
Future extensions of our work can explore the applicability of CoPhyPGNN to other application domains with varying types of physics objectives. While this work empirically demonstrated the value of CoPhyPGNN in combating with competing PG loss terms, future work can focus on theoretical analyses. Finally, future work can focus on the scalability of our model to larger systems in quantum mechanics.
7 Broader impacts
This research has several societal and scientific implications. Our work is motivated by a realworld scientific application to predict the wave function of an
particle system under varying external parameters. This is key to predicting several properties of the system such as magnetization, entanhlement entropy, and quantum phase transitions. While the analysis presented in our paper is a proofofconcept of the validity of neural networks in predicting wave function for
particle systems in simplified external parameters, we hope that this work serves as a stepping stone in accelerating scientific discovery in quantum science by harnessing the power of neural networks for studying large and complex quantum systems while respecting the physics of the problem to ensure generalizability. Existing numerical approaches for solving for the groundstate wave function of particle systems involve computationally expensive techniques for diagonalizing the Hamiltonian matrix of the system, whose size grows exponentially () with the size of the system. This makes it practically impossible to scale such methods to system with using modern highperformance computing infrastructure. Even for systems with smaller , e.g.,, a great number of calculations for different external parameters have to be performed in order to construct the full phase diagram. In contrast, deep learning models, once trained, are quite inexpensive to perform inference on test Hamiltonian matrices, as observed during our preliminary scalability experiments to test the feasibility of our work to larger systems (see Supplementary Materials for more details). We hope that future extensions of our work explore the scalability of our proposed
CoPhyPGNN model on larger systems, without compromising on the generalizability of neural network solutions. Our work also has broader impacts in several other scientific applications involving the solution of eigenvalue equations for the lowest (or equivalently) the highest eigenvalues. Such problems ubiquitously appear in fields such as electromagnetism, nonlinear optics, and materials science requiring quantum mechanical descriptions.References
 Appenzeller [2017] Tim Appenzeller. The scientists’ apprentice. Science, 357(6346):16–17, 2017.

GrahamRowe et al. [2008]
D GrahamRowe, D Goldston, C Doctorow, M Waldrop, C Lynch, F Frankel, R Reid,
S Nelson, D Howe, and SY Rhee.
Big data: science in the petabyte era.
Nature, 455(7209):8–9, 2008.  Karpatne et al. [2017a] Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theoryguided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10):2318–2331, 2017a.

Gil [2017]
Yolanda Gil.
Thoughtful artificial intelligence: Forging a new partnership for data science and scientific discovery.
Data Science, 1(12):119–129, 2017.  Raissi et al. [2019] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
 de Bezenac et al. [2019] Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incorporating prior scientific knowledge. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124009, 2019.
 Karpatne et al. [2017b] Anuj Karpatne, William Watkins, Jordan Read, and Vipin Kumar. Physicsguided neural networks (pgnn): An application in lake temperature modeling. arXiv preprint arXiv:1710.11431, 2017b.
 Jia et al. [2019] Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan Read, Jacob Zwart, Michael Steinbach, and Vipin Kumar. Physics guided rnns for modeling dynamical systems: A case study in simulating lake temperature profiles. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 558–566. SIAM, 2019.
 Daw et al. [2020] Arka Daw, R Quinn Thomas, Cayelan C Carey, Jordan S Read, Alison P Appling, and Anuj Karpatne. Physicsguided architecture (pga) of neural networks for quantifying uncertainty in lake temperature modeling. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 532–540. SIAM, 2020.
 Wang et al. [2017a] JianXun Wang, JinLong Wu, and Heng Xiao. Physicsinformed machine learning approach for reconstructing reynolds stress modeling discrepancies based on dns data. Physical Review Fluids, 2(3):034603, 2017a.
 Pilania et al. [2018] Ghanshyam Pilania, Kenneth James McClellan, Christopher Richard Stanek, and Blas P Uberuaga. Physicsinformed machine learning for inorganic scintillator discovery. The Journal of chemical physics, 148(24):241729, 2018.
 Karpatne et al. [2017c] Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theoryguided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10):2318–2331, 2017c.
 Stewart and Ermon [2017] Russell Stewart and Stefano Ermon. Labelfree supervision of neural networks with physics and domain knowledge. In AAAI, 2017.
 Muralidhar et al. [2018] Nikhil Muralidhar, Mohammad Raihanul Islam, Manish Marwah, Anuj Karpatne, and Naren Ramakrishnan. Incorporating prior domain knowledge into deep neural networks. In 2018 IEEE International Conference on Big Data (Big Data), pages 36–45. IEEE, 2018.
 Raissi et al. [2017a] Maziar Raissi, Paris Perdikaris, and G Karniadakis. Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561, 2017a.
 Shin et al. [2020] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence and generalization of physics informed neural networks. arXiv preprint arXiv:2004.01806, 2020.
 Bonfim et al. [2019] O. F. de Alcantara Bonfim, B. Boechat, and J. Florencio. Groundstate properties of the onedimensional transverse ising model in a longitudinal magnetic field. Phys. Rev. E, 99:012122, Jan 2019. doi: 10.1103/PhysRevE.99.012122. URL https://link.aps.org/doi/10.1103/PhysRevE.99.012122.
 Wang et al. [2016] JianXun Wang, JinLong Wu, and Heng Xiao. Physicsinformed machine learning for predictive turbulence modeling: Using data to improve rans modeled reynolds stresses. arXiv preprint arXiv:1606.07987, 2016.
 Wang et al. [2017b] JianXun Wang, Jinlong Wu, Julia Ling, Gianluca Iaccarino, and Heng Xiao. A comprehensive physicsinformed machine learning framework for predictive turbulence modeling. arXiv preprint arXiv:1701.07102, 2017b.
 Raissi et al. [2017b] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part ii): Datadriven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566, 2017b.
 Wang et al. [2020] Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient pathologies in physicsinformed neural networks. arXiv preprint arXiv:2001.04536, 2020.
 Caruana [1993] Rich Caruana. Multitask learning: A knowledgebased source of inductive bias. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML’93, page 41–48, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077.
 Kang et al. [2011] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multitask feature learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 521–528, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.
 Bernardi [2019] Marcello De Bernardi. losslandscapes, 2019. URL https://github.com/marcellodebernardi/losslandscapes/.
 Li et al. [2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6389–6399. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7875visualizingthelosslandscapeofneuralnets.pdf.
 Brando et al. [2016] M. Brando, D. Belitz, F. M. Grosche, and T. R. Kirkpatrick. Metallic quantum ferromagnets. Rev. Mod. Phys., 88:025006, May 2016. doi: 10.1103/RevModPhys.88.025006. URL https://link.aps.org/doi/10.1103/RevModPhys.88.025006.
 Zhou et al. [2017] Yi Zhou, Kazushi Kanoda, and TaiKai Ng. Quantum spin liquid states. Rev. Mod. Phys., 89:025003, Apr 2017. doi: 10.1103/RevModPhys.89.025003. URL https://link.aps.org/doi/10.1103/RevModPhys.89.025003.
 Terhal [2015] Barbara M. Terhal. Quantum error correction for quantum memories. Rev. Mod. Phys., 87:307–346, Apr 2015. doi: 10.1103/RevModPhys.87.307. URL https://link.aps.org/doi/10.1103/RevModPhys.87.307.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL http://arxiv.org/abs/1412.6980. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
 Bergstra and Bengio [2012] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012. ISSN 15324435.
 Choromanska et al. [2014] Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surface of multilayer networks. CoRR, abs/1412.0233, 2014. URL http://arxiv.org/abs/1412.0233.
Appendix A Relevant Physics Background
a.1 Ising Chain and Quantum Mechanics
Quantum mechanics provides a theoretically rigorous framework to investigate physical properties of quantum materials by solving the Schrödinger’s equation—the fundamental law in quantum mechanics. The Schrödinger’s equation is essentially a PDE that can be easily transformed into an eigenvalue problem of the form: , where is the Hamiltonian Matrix, is the wavefunction, and
is the energy, a scalar quantity. (Note that many other PDEs in physical sciences, e.g., Maxwell’s equations, yield to a similar transformation to an eigenvalue problem.) All information related to the dynamics of the quantum system is encoded in the eigenvectors of
, i.e., . Among these eigenvectors, the ground state wavefunction, , defined as the eigenvector with lowest energy, , is a fundamental quantity for understanding the properties of quantum systems. Exploring how evolves with controlling parameters, e.g., magnetic field and bias voltage, is an important subject of study in material science.A major computational bottleneck in solving for the groundstate wavefunction is the diagonalization of the Hamiltonian matrix, , whose dimension grows exponentially with the size of the system. In order to study the effects of controlling parameters on the physical properties of a quantum system, theorists routinely have to perform diagonalizations on an entire family of Hamiltonian matrices, with the same structure but slightly different parameters.
Here we study a quintessential model of the transverse field: Ising chain model (Bonfim et al. [2019]), which is a unidimensional spin chain model under the influence of a transverse magnetic field , as shown in Fig. 6. Spin is the intrinsic angular momentum possessed by elementary particles including electrons, protons, and neutrons. The Ising spin chain model describes a system in which multiple spins are located along a chain and they interact only with their neighbors. By adding an external magnetic field (), the ground state wavefunction could change dramatically. This model and its derivatives have been used to study a number of novel quantum materials(Brando et al. [2016], Zhou et al. [2017]) and can also be used for quantum computing(Terhal [2015]
), since the qubit, the basic unit of quantum computing, can also be represented as a spin. However, the challenge in finding the groundstate wavefunction of this model is that the dimension
of the Hamiltonian grows exponentially as , where equals the number of spins. We aim to develop PGML approaches that can learn the predictive mapping from the space of Hamiltonians, , to groundstate wavefunctions, , using the physics of the Schrödinger’s equation along with labels produced by diagonalization solvers on training set.Appendix B Hyperparameter Selection
b.1 Hyperparameter Search
To exploit the best potential of the models, we conducted hyperparameter search prior to many of our experiments
^{1}^{1}1All the code and data used in this work is available at https://github.com/jayroxis/CophyPGNN. A complete set of code, data, pretrained models, and stored variables can be found at: https://osf.io/ps3wx/?view_only=9681ddd5c43e48ed91af0db019bf285a (cophypgnn.tar.gz). on training set of size , by doing random sampling in a fixed range for every hyperparameter value. We chose the average of the top5 hyperparameter settings that showed the lowest error on the validation set, which was 2000 instances sampled from the training set. For the proposed CoPhyPGNN model, this resulted in the following set of hyperparameters: , and we chose fixed for all models. The same hyperparameter values were used across all values of in our experiments to show the robustness of these values.We searched for the best model architecture using simple multilayer neural networks that does not show significant overfitting or underfitting, then we fixed the architecture for all the models in our work. The models comprise of four fullyconnected layers with activation and an linear output layer. The widths of all the hidden states are 100. All the experiments used Adamax (Kingma and Ba [2014]) optimizer and set maximum training epochs to 500 that most of the models will converge before that limit.
Since different models may use different loss terms, the numbers of hyperparameters to search are different, and some of them were not being searched. We use random search (Bergstra and Bengio [2012]) and run around 300 to 500 runs per model to keep a balance between search quality and the time spent. The hyperparameters we searched include:

For Loss: , .

For Loss: , ,
b.2 Sigmoid Coldstart and Other Different Modes
Additionally, to further prove our choice on is indeed effective. We compared three other modes with : quickdrop (Eq. 8), quickstart (Eq. 9), inversedsigmoid (Eq. 7).
(7) 
(8) 
(9) 
We replaced the coldstart with the three modes in CoPhyPGNN and ran 400 times for each mode to do hyperparameter search on , , . Using the average hyperparameter values for the top5 models that have the lowest validation error. The results are:

quickdrop: , , .

quickstart: , , .

sigmoid: , , .

inversedsigmoid: , , .
Using the hyperparameter values to set up models, and run 10 times per setting on different training sizes, the results are shown in Fig. 7. We can see that consistently performed better than other modes in both stability and accuracy. Another important information it conveys is that, quickstart, like though it increases the weight of Loss much faster, results in a much more unstable results. Actually our results show that dominates the leaderboard of both top10 best and worst performances (also showed in the barplot in Figure 7). It implies that a smooth and gradual switch of dominance between different loss terms is better in terms of stability.
Appendix C Analysis of gradients
c.1 Contribution of Loss Terms
The complicated interactions between competing loss terms motivate us to further investigate the different role each loss term plays in the aggregated loss function. The sharp bulge in both and Loss in the first few epochs (Fig. 3 in the main document) shows that the optimization process is not quite smooth. Our speculation is that Loss and Loss are competing loss terms in the multiobjective optimization. To monitor the contribution of every loss term in the learning process, we need to measure if the gradients of a loss term points towards the optimal direction of descent to a generalizable model. One way to achieve this is to compute the component of the gradient of a loss term in the optimal direction of descent (leading to a generalizable model). Suppose the desired (or optimal) direction is and the gradient of a loss term is . We can then compute the projection of along the direction of at the epoch as:
(10) 
A higher projection value indicates a larger step toward the optimal direction at the epoch, , which is defined as:
(11) 
where is the model parameters (i.e., weights and biases of the neural network) at the epoch and is the optimal state of the model that is known to be generalizable. Note that finding an exact solution for that is the global optima of the loss function is practically infeasible for deep neural networks (Choromanska et al. [2014]). Hence, in our experiments, we consider the final model arrived on convergence of the training process as a reasonable approximation of . For methods such as PGNNanalogue and CoPhyPGNN, the final models at convergence performed significantly well and showed a cosine similarity of with groundtruth. This is very close to a model trained directly on the test set that only reaches 99.8%. This gives some confidence that the final models at convergence are good approximations to . To compute the inner products between and , we used a flattened representation of the model parameters by concatenating the weights and biases across the layers.
c.2 Experiment Results
We analyze the role of Loss and Loss in the training process of the two methods: CoPhyPGNN and PGNNanalogue. Both these methods were run with the same initialization of model parameters. Training size is 2000 and the rest of settings are same as in Section 5 of the main document.
Figure 8 shows that in the early epochs, the Loss has positive projection values, which means that it is helping the method to move towards the optimal state. On the other hand, the projection of Loss starts with a negative value, indicating that the gradient of the Loss term is counterproductive at the beginning. Hence, Loss helps in moving out of the neighborhood of the local minima of Loss towards a generalizable solution. However, the projection of Loss does not remain negative (and thus counterproductive) across all epochs. In fact, Loss makes a significant contribution by having a large positive projection value after around 50 epochs. This shows that as long as we manage to escape from the initial trap caused by the local minima of the Loss, then it can turn to guide the model towards desired direction . By initially setting close to zero, it allows the Loss to dominate in the initial epochs and move out of the local minima. Later, we let Loss to recover to a reasonable value and it will start to play its role. These findings align quite well with the coldstart and annealing idea proposed in this work and show that it works best when the two loss terms are combined together using adaptive weights.
Note that for this analysis method to produce valid findings, we need to ensure that the loss terms are not pointing towards the direction of an equally good that can be arrived at from the same initialization. To ensure this, we investigated how similar the trained models (optimal states) are when started from the same initialization for the two methods. The parameters of the PGNNanalogue and CoPhyPGNN showed an average cosine similarity of 98.6%, and in many cases reached 99%. This gave us more confidence to believe that our approximations to the optimal model were sufficient.