1 Introduction
Most of the dramatic successes of machine learning in the 21st century have utilized very large datasets in order to achieve their performance ^{1}^{1}1 https://www.edge.org/responsedetail/26587 argues that the creation of highquality datasets is a better predictor of progress in AI than algorithmic advancement.. In supervised tasks, including in image recognition [RDS15], speech recognition [GrMH13] and machine translation [SVL14]
, large datasets had to be assembled before neural network architectures particularly suited to these applications could emerge and achieve humanlevel performance on these tasks. In reinforcement learning
[SB98], successful agents, including Go champion AlphaZero [SSS17] and Atariplaying DQNs [MKS15], operate in easilysimulated toy environments that enable the collection of large quantities of data in the form of observations and interactions with the environment.The paradigm of weaklysupervised learning
[RBVR17] seeks to reduce the data requirements by encoding our prior knowledge into machine learning systems. In this work, we explore the ability to encode our prior knowledge about the physical world into an appropriate loss function to guide deep learning.We are interested in inference within physical environments whose rules can be defined by differential equations. We focus on the case of 2dimensional heat transport [Fou07], whose dynamics are defined by a secondorder differential equation. We seek to encode the dynamics of the system into a loss function, such that the learning algorithm can learn to produce correct solutions to the future state of the system without having to observe any labeled data. We develop a convolutional kernel which encodes the constraint that must be satisfied by any steadystate solution to the heat flow problem, and we use this kernel to determine the loss function. By seeking to minimize this loss, the network learns to satisfy the differential equations of heat transport, effectively learning the underlying physics of the system despite never explicitly being shown the outcomes for any given initial condition.
While the physical system examined throughout this paper is 2D heat transport, the methods developed are extremely general and can be applied to any system defined by partial differential equations and theoretically capable of being solved by the finite difference method (even if it is not practical to do so).
This points us towards to possibility of encoding the equations we have discovered, which govern many physical environments of interest to us, into neural networks. It would be quite convenient if we could encode this information directly in our learning agents, such that they could benefit from the knowledge we’ve already gained about how the world works.
2 Related Work
This work exists at the intersection of two different lines of research pursued by disparate research communities. The first line of work, weakly supervised machine learning, is pursued primarily within the machine learning community and seeks to reduce the data requirements of machine learning applications. It aims to do so by incorporating some form of prior knowledge, either to augment existing data, or to create contextaware learning algorithms that are able to achieve high performance with less data. The second line of work involves the use of physics informed machine learning for modeling physical systems, and is pursued primarily within the mathematical physics and engineering communities.
[RBVR17] defines weak supervision as a unified approach to incorporating various types of weak signal into the machine learning pipeline. These forms of weak signal include crowdsourced [WfWB09], noisy [NDRT13]
, or sparse labels (as in active and semisupervised learning)
[ZG09, Set12]. Additionally, weak signal can be specified in the form of constraints and invariances over the output space (often provided by domain experts), or in terms of weak or biased classifiers (as in transfer learning or boosting). Although these techniques and approaches are disparate, they are unified by their aim to alleviate the need for vast quantities of data to solve machine learning problems.
The present work is most closely connected to work that aims to incorporate domainspecific prior knowledge in the form of constraints over the output space. The nearest cousin to this work is [SE16]
, in which physical constraints are specified over the output trajectories that must be satisfied by solutions for problems in motiontracking, enabling it to be done in a labelfree way. We extend this work to the broader domain of prediction in physical systems governed by nonlinear PDEs. In natural language processing,
[LJK13, AZ11] seek to semantically parse statements or questions (i.e. convert them into their logical forms) with weak supervision signals (e.g. in the form of responses to queries rather than the meaning of the query itself). Recent works [CGCR10, GPLL17] in this area apply constraints on the output space to provide weak signal to the semantic parser.The second line of research that the present work extends leverages machine learning techniques to either discover the underlying dynamics of a system governed by unknown PDEs, or to build faster and more accurate differential equation solvers. [RK17, Rai18] use traditional machine learning techniques and deep learning techniques respectively for both predicting the future of timedependent dynamical systems and for discovering their underlying equations. [FGP17] uses conditional generative models to find the equilibrium solution to a number of transport problems faster than traditional iterative methods. [HJE17, SS17] leverage deep learning techniques to approximately solve partial differential equations is highdimensional spaces, where traditional iterative techniques break down. All of these techniques use large quantities of data in a supervised way, unlike the present work.
3 Background
3.1 HeatTransport
In 2D heat transport, we consider a flat square plate made of some thermally conductive material that is insulated along its edges. Heat is applied to the plate in some way, and our goal is to model the way thermal energy moves through the plate. The initial condition is given by , and we wish to determine , the temperature field on the plate at time . In our model we assume that nonzero elements of represent an applied heat i.e. heat applied the point on the grid for the duration of the transport experiment. Under ideal assumptions, it can be shown that satisfies the two dimensional heat equation [Fou07]
where is a constant for the thermal conductivity of the plate. We can also study solutions that do not vary with time, known as the steadystate solutions to the system
In this case we get the Laplace Equation:
(1) 
Solutions to the Laplace equation are known as harmonic functions, and the particular solution to the steadystate heat transport problem is the harmonic function which also satisfies the initial condition of the system. When heat is applied only to the boundary of the plate, as it is in the cases we study in this paper, equation 1 is known as the Dirichlet boundary problem [Dir50].
3.2 Finite Difference
Finite difference is an iterative method used to compute exact solutions to partial differential equations via a discretization of the equations, and an update rule that is defined by the equations. To solve the 2D steady state heat equation using finite difference, we need to discretize =0. Considering evenly spaced 2D grid, the discretized form of =0 for node (j) would be:
(2) 
The nodal relation expressed in the above equation is solved iteratively by applying the rule above at each node (point in the grid) until convergence.
4 Model
4.1 Approach
The aim of this work is to train a fully convolutional neural network to directly infer the solution to the Laplace equation (1) when given the initial condition as input (i.e. train the neural network to be a solver for the Dirichlet boundary problem [Dir50]). We accomplish this without ever seeing solutions to the boundary problem by encoding the differential equations into a physicsinformed loss function, described in section 4.3, which motivates the network to find the solution without the use of supervision in the form of data. The architecture of the network is described in section 4.2.
Each instantiation of the Dirichlet boundary problem is given to the neural network in the form of a image matrix representing a thermally conductive plate, with the value in each entry representing the temperature applied to each point on the plate. In this work, we only apply heat to the boundary of the plate, so the only nonzero entries in the input are at the boundaries. Zero values in the input represent points on the plate with no temperature applied, meaning they can change over the course of time, as they are influenced by the temperature at neighboring points on the thermally conductive plate.
The desired output is also an image matrix representing the temperature values at each point on the plate after the heat flow process has converged to an equilibrium temperature distribution.
The network is trained with randomly initialized boundary conditions. Because the network is never provided solutions to the boundary value problem, we can generate new data points on the fly virtually free of cost by creating new matrices with the boundaries filled in by random values. In this work, we choose four random temperatures uniformly between 0 and 100 and set the temperatures of the top, right, left and bottom of the plate to be constant, and equal to each of the random temperatures correspondingly. An example input and output are shown in Figure 1.
In section 4.4, we describe how the output image is downsampled repeatedly and multiscale loss function is computed. Downsampling and training over multiple losses in this fashion dramatically speeds up training and improves the quality of the output images.
4.2 Network Architecture
The architecture of the network is a fully convolutional encoderdecoder network adapted from the UNet architecture [RFB15]. The fully convolutional network is comprised of several encoding convolutional layers that decrease the image size to that of the latent space (in our case, ). The decoding layers consist of transposed convolutions on the output of the previous layer concatenated with the corresponding encoding layer. The concatenations amount to skip connections from each encoding layer to its corresponding output layer. Ultimately, an image of the original input size is recovered, and the aim is for the final image to represent the equilibrium temperatures on the plate. The architecture is shown in Figure 1.
The motivation for using a fully convolutional architecture is to flexibly use the same architecture to solve problems at multiple scales. The purpose of the skip connections are to pass the boundary values of the input to the output layers, so the network is not forced to memorize the structure of the input in its bottleneck layers. The architecture mirrors that of [FGP17], in which this network architecture was used to solve a variety of differential equations in a supervised way.
4.3 Physics Informed Kernel
The equilibrium condition defined in Equation (1) can be encoded in a simple rule: the temperature at each point on the plate (that is not initially driven by a heat source) should be the average of its neighbors. In fact, it is by iteration of this rule that the finitedifference method for solving partial differential equations typically solves this problem, as was shown in equation 2.
Examining this condition, we find that it can easily be encoded into a 3x3 convolutional kernel as follows:
0  1  0 

1  +4  1 
0  1  0 
This kernel is run convolutionally across an image, and the outputs flattened and normed, to calculate the loss of the output:
(3) 
By minimizing this loss, the neural networks learns to identify harmonic functions that form solutions to the given Dirichlet boundary problem specified by the initial condition, since the boundary of the output is fixed to be equal to the initial condition
Although this kernel can easily be identified by the structure of the heat equations, we show in section 5.3
that this kernel can easily be learned if labels are provided (in the form of correct solutions to the heat flow equations). Although this kernel is easy derive and justify for the case of heat transport, in principle a similar kernel can be found for any system whose dynamics are defined by partial differential equations. In general, given an update rule for finite difference equations, it is easy to encode this rule into a convolutional kernel.
4.4 Progressive Downsampling for Growing Loss
The network can have difficulty learning to output the heat distribution that minimizes the loss function for very large input sizes, in part because outputting a constant on the entire field has a loss of 0, if we ignore the boundary. As the image size grows larger, the boundary values becomes a proportionally smaller fraction of the total image, reducing the proportion of loss contributed by outputting incorrect values for the points near the border of the plate. Thus, the network is less able to accurately fill in correct temperatures, especially towards the center of the image, instead opting to output constant temperature fields.
In order to solve this problem, we adopt a modified version of the strategy of progressively growing the output of the network to the final problem size, first introduced in [KALL17]. In our case, instead of progressively adding decoding layers, as in [KALL17]
, we compute the loss on several downsampled versions of the output image, and weight their contribution to the overall loss according to a weight vector
where each is the weight of the loss on the downsampled version of the output. The loss for each individual downsampled output is computed in the same way as described in section 4.3, but the weight vector can slowly be tuned from to . Early in training, this motivates the network to output images which satisfy the overall higher level structure of the desired output. We slowly increase the weights associated with getting the finer grained details of the output correct.
In our training, we choose a downsample factor of 4, until the images are of size , the scale at which training works well even without a progressively downsampled loss function.
The general framework is shown in Figure 2
This framework can be viewed as a special case of curriculum learning [BLCW09], where the difficulty of the curriculum scales as the weights in become more concentrated on the largest scale. This provides a flexible framework by which output size can be used to craft a curriculum, and we suspect that it can be widely applied to multiscale problems in image generation, as well as other domains.
5 Experiments & Applications
5.1 Solving the Boundary Value Problem
Test results are shown for inputs of size 256256 in Figure 3. During training of this experiment, the network was never shown the correct answer to the boundary value problem, but only ever given random initializations and trained to minimize the loss in equation( 3). In the figure a number of images are shown containing the boundary value problem specified, the true equilibrium temperatures given that boundary condition (solved to high precision via finite difference), and the output of the neural network after training.
The network has sixteen hidden layers (8 encoding, 8 decoding) and has been trained for 128 epochs using a progressive downsampling strategy to grow the loss function as described in section
4.4. The optimization is done with an Adam optimizer [KB14]. Despite never seeing a correct desired output for the boundary value problem, the average perpixel output error is only 1.39% with a standard deviation of 1.24%. The test examples are unseen during training, and are generated randomly according to the same procedure as during training.
5.2 Solving at Very Large Scale
As described in section 4.4, the network has difficulty learning the correct solution at larger scales, because outputting constant values along the entire image becomes a viable strategy to achieve a low loss. However, by using the strategy of progressive downsampling to grow the loss function, the network is able to easily find solutions to the boundary value problem, achieving good results in just a few epochs. The basic method, without progressive downsampling, fails entirely to learn.
Results for inputs of size 10241024 are show in in Figure 4. The network has twenty hidden layers (10 encoding, 10 decoding) and has been trained for 128 epochs, optimized with Adam [KB14]. The average perpixel output error is only 1.41% with a standard deviation of 1.64%. The test examples are unseen during training, and are generated randomly according to the same procedure as during training.
5.3 Learning the Convolutional Kernel
The kernel defined in section 4.3, copied below, can be easily determined by inspection of the differential equations defining heat transport (i.e. the form of the Laplace equation (1)).
0  1  0 

1  +4  1 
0  1  0 
This may not be the case in general, however, as the local invariants satisfied by a system that obeys a certain set of differential equations may not be readily apparent. Furthermore, for some systems the differential equations governing their evolution may not even be known.
In this experiment, we show that for the case of heat transport, this convolutional kernel can be learned to highprecision as long as we have data. This data should consist of the equilibrium conditions of heat transport (i.e. solutions to the Dirichlet problem defined in equation (1) for various initial conditions — the initial conditions themselves are not required).
We generate data on the fly by randomizing the initial condition and running an iterative finitedifference solver to solve. The data generated is 88 images which are converged to high precision. We then run a learnable convolutional kernel across the data and sum the absolute value of the outputs. The convolutional kernel is optimized via Adam, and its parameters seek to minimize the absolute value of the outputs of the convolution. After several thousand iterations on randomly generated data points, this procedure learns the following kernel:
0  0.0545  0.0001 
0.0545  0.2181  0.0545 
0  0.0547  0 
This is almost exactly a scaled version of the original kernel, which encodes the same rule: that each point on the eqilibrium heat distribution should be the average of the temperature of its four neighboring points. This shows that if we have data which describes converged conditions of the phenomenon of interest, we can learn the local rules that define equilibrium condition directly from this data. In principle, this can be used to discover differential equations governing the dynamics for unknown systems, although more work is needed to show that this can be done with more complex systems than heattransport (e.g. systems with higherorder nonlinearities).
5.4 Application: Speeding Up Finite Difference Calculation
The Finite Difference method described in Section 3.2 can be slow to converge to the correct solution [Rem12], and is particularly sensitive to the initial condition given to the algorithm. Thus, we can use a trained neural network to provide an output that is approximately correct (within 1.5% error), and this answer can be used as an initial condition to be refined by the finite difference method. The finite difference method can then compute the exact solution to a desired level of precision. Once trained, the forward pass through a neural network is fast, accounting for a negligible fraction of the time finitedifference algorithm takes to converge. Our results demonstrate that this “warmstart” approach for solving the heat equations converges much faster than a constant initialization — i.e. one in which all pixels not on the border of the image are set to the average of pixels on the border. In Figure 5, we compare the errors from ground truth at each iteration of finite difference for the two different solvers, one that is given the output of the trained neural net as a warm start and one that is not. We compute “ground truth” by running finite difference to a very high level of precision relative to both initializations of the algorithm.
A more appropriate comparison for this method would be to hierarchical finite difference solvers, which are more often used in practice, but it is hard to make a direct comparison of the error between a hierarchical solver and our proposed method, as the hierarchical solver does not solve the full problem until near the end of its computation. In terms of wallclock time, our method is not competitive with the hierarchical solvers, however it is possible that if a separately trained neural network is used at each layer of the hierarchy, our strategy will be superior to hierarchical finte difference algorithms.
6 Conclusion
We have shown how the equilibrium solution of the heat transport problem can be learned in a weakly supervised way by use of a physics informed loss function that encodes the local rules defined by the differential equations of heat transport. Further, we have shown that it is possible to use this technique to speed up finite difference calculations relative to the naive approach. We have also demonstrated that the local rule defining heat transport can be learned, in the form of a kernel, directly from data. In principle, the differential equations for heat transport can be replaced with those for any other phenomenon whose dynamics are defined by partial differential equations. This work points toward the possibility of encoding our knowledge of the dynamics of the world into neural networks, allowing them to learn how physical systems work even in the absence of data.
References
 [AZ11] Yoav Artzi and Luke Zettlemoyer. Bootstrapping semantic parsers from conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 421–432, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
 [BLCW09] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 1418, 2009, pages 41–48, 2009.
 [CGCR10] James Clarke, Dan Goldwasser, MingWei Chang, and Dan Roth. Driving semantic parsing from the world’s response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, CoNLL ’10, pages 18–27, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
 [Dir50] P.G.L Dirichlet. Über einen neuen Ausdruck zur Bestimmung der Dictigkeit einer unendlich dünnen Kugelschale, wenn der Werth des Potentials derselben in jedem Punkte ihrer Oberfläche gegeben ist. Abh Köngilich. Preuss. Akad. Wiss., 1850.
 [FGP17] Amir Barati Farimani, Joseph Gomes, and Vijay S. Pande. Deep learning the physics of transport phenomena. CoRR, abs/1709.02432, 2017.
 [Fou07] JeanBaptiste Joseph Fourier. 1807.
 [GPLL17] Kelvin Guu, Panupong Pasupat, Evan Zheran Liu, and Percy Liang. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. CoRR, abs/1704.07926, 2017.

[GrMH13]
A. Graves, A. r. Mohamed, and G. Hinton.
Speech recognition with deep recurrent neural networks.
In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, May 2013.  [HJE17] Jiequn Han, Arnulf Jentzen, and Weinan E. Overcoming the curse of dimensionality: Solving highdimensional partial differential equations using deep learning. CoRR, abs/1707.02568, 2017.
 [KALL17] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [LJK13] Percy Liang, Michael I. Jordan, and Dan Klein. Learning dependencybased compositional semantics. Comput. Linguist., 39(2):389–446, June 2013.
 [MKS15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518:529 EP –, Feb 2015.
 [NDRT13] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1196–1204. Curran Associates, Inc., 2013.
 [Rai18] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. CoRR, abs/1801.06637, 2018.
 [RBVR17] Alex Ratner, Stephen Bach, Paroma Varma, and Chris Ré. Weak supervision: The new programming paradigm for machine learning (blog post), 2017.

[RDS15]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li FeiFei.
Imagenet large scale visual recognition challenge.
Int. J. Comput. Vision
, 115(3):211–252, December 2015.  [Rem12] Courtney Remani. Numerical methods for solving systems of nonlinear equations, 2012.
 [RFB15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
 [RK17] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 2017.
 [SB98] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
 [SE16] Russell Stewart and Stefano Ermon. Labelfree supervision of neural networks with physics and domain knowledge. CoRR, abs/1609.05566, 2016.

[Set12]
Burr Settles.
Active learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
, 6(1):1–114, 2012.  [SS17] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. ArXiv eprints, August 2017.
 [SSS17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354 EP –, Oct 2017. Article.
 [SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014.
 [WfWB09] Jacob Whitehill, Ting fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 2035–2043. Curran Associates, Inc., 2009.
 [ZG09] Xiaojin Zhu and Andrew B. Goldberg. Introduction to semisupervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1–130, 2009.
Comments
There are no comments yet.