space_time_pde
MeshfreeFlowNet: Physical Constrained Space Time Super-Resolution
view repo
We propose MeshfreeFlowNet, a novel deep learning-based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the low-resolution inputs. While being computationally efficient, MeshfreeFlowNet accurately recovers the fine-scale quantities of interest. MeshfreeFlowNet allows for: (i) the output to be sampled at all spatio-temporal resolutions, (ii) a set of Partial Differential Equation (PDE) constraints to be imposed, and (iii) training on fixed-size inputs on arbitrarily sized spatio-temporal domains owing to its fully convolutional encoder. We empirically study the performance of MeshfreeFlowNet on the task of super-resolution of turbulent flows in the Rayleigh-Benard convection problem. Across a diverse set of evaluation metrics, we show that MeshfreeFlowNet significantly outperforms existing baselines. Furthermore, we provide a large scale implementation of MeshfreeFlowNet and show that it efficiently scales across large clusters, achieving 96.80 less than 4 minutes.
READ FULL TEXT VIEW PDFMeshfreeFlowNet: Physical Constrained Space Time Super-Resolution
In recent years, along with the significant growth of computational resources, there has been an increasing attempt towards accurately resolving the wide range of length and time scales present within different physical systems. These length and time scales often differ by several orders of magnitude. Such time and length scale disparities are commonly observed in different physical phenomena such as turbulent flows, convective diffusive systems, subsurface systems, multiphase flows, chemical processes, and climate systems [Hu2007, Hurrell2009, Bagchi2010, Liu1995, Szymczak2013, Qin2020, Quan2011, RIAZ2006, Esmaeilzadeh2020, Esmaeilzadeh2019]. Yet many key physical quantities of interest are highly dependent on correctly resolving such fine scales, such as the energy spectrum in turbulent flows.
From a numerical perspective, resolving the wide range of spatio-temporal scales within such physical systems is challenging since extremely small spatial and temporal numerical stencils would be required. In order to alleviate the computational burden of fully resolving such a wide range of spatial and temporal scales, multiscale computational approaches have been developed. For instance, in the subsurface flow problem, the main idea of the multiscale approach is to build a set of operators that map between the unknowns associated with the computational cells in a fine-grid and the unknowns on a coarser grid. The operators are computed numerically by solving localized flow problems. The multiscale basis functions have subgrid-scale resolutions, ensuring that the fine-scale heterogeneity is correctly accounted for in a systematic manner [Jenny2006, Jenny2005, Jenny2003]. Similarly, in turbulent flows [Bramkamp2004, Modest2005], climate forecast [Shen2013], multiphase systems [Luo2009, Bi2004], reactive transport [Bauer2001, Odman1991], complex fluids [Ren2005] and biological systems [Perdikaris2016], multiscale approaches attempt to relate the coarse-scale solutions to the fine-scale local solutions taking into account the fine-scale physics and dynamics. In these multiscale approaches, although the small scale physics and dynamics are being captured and accounted for at larger scales, the fine-grid solutions are extremely costly and hence impractical to solve for. Reconstructing the fine-scale subgrid solutions by having coarse-scale solutions in space and/or time still remains a challenge.
In this manuscript, we refer to such a process of reconstructing the fine-scale subgrid solutions from the coarse-scale solutions as super-resolution, where the high-resolution solutions are reconstructed using the low-resolution physical solutions in space and/or time. One key insight is that there are inherent statistical correlations between such pairs of low-resolution and high-resolution solutions that can be leveraged to reconstruct high-resolution solutions from the low-resolution inputs. Furthermore, this super-resolution process can be effectively modeled using deep learning models that learn such statistical correlations in a self-supervised manner from low-resolution and high-resolution pairs that exist in a training dataset. A successful super-resolution model should be able to efficiently represent the high-resolution outputs, effectively scale to large spatio-temporal domains as in a realistic problem setting, and allow for incorporating physical constraints in the form of PDEs to regularize the outputs to physically valid solutions. Furthermore, in order for a learning-based methodology to be applied to real problems such as Computational Fluid Dynamics (CFD) simulations and climate modeling, the methodology needs to address the High Performance Computing (HPC) challenges of scalability to large scale computing systems, for both the training and inference stages.
To this end, we propose , a novel physics-constrained, deep learning based super-resolution framework to generate continuous (grid-free) spatio-temporal solutions from the low-resolution inputs. first maps the low-resolution inputs to a localized latent context grid using a convolutional encoder, which can then be continuously queried at arbitrary resolutions. In summary, our main contributions are as follows:
We propose , a novel and efficient physics-constrained deep learning model for super-resolution tasks.
We implement a set of physics-based metrics that allow for an objective assessment of the reconstruction quality of the super-resolved high-resolution turbulent flows.
We empirically assess the effectiveness of the framework on the task of super-resolving turbulent flows in the Rayleigh–Bénard convection problem, showing a consistent and significant improvement in recovering key physical quantities of interest in super-resolved solutions over competing baselines.
We demonstrate the scalability of our framework to larger and more challenging problems by providing a large scale implementation of that scales to up to 128 GPUs while retaining a scaling efficiency.
Recently, deep learning models have been applied for studying fluid flow problems in different applications. In particular, a certain class of research works has studied flow problems on a grid representation where dynamic flow properties (e.g., velocity, pressure, etc.) could be considered as structured data similar to images. For instance, guo2016convolutional used convolutional architectures for approximating non-uniform steady laminar flow around 2D and 3D objects. Similarly, bhatnagar2019prediction
used convolutional neural networks for prediction of the flow fields around airfoil objects and for studying aerodynamic forces in a turbulent flow regime.
Zabaras used Bayesian deep convolutional neural networks to build surrogate flow models for the purpose of uncertainty quantification for flow applications in heterogeneous porous media.Alternatively, a different area of research has attempted to solve PDEs governing different physical phenomena by using simple fully connected deep neural networks. In order to enforce physical constraints, the PDEs under consideration are added as a term to the loss function at the training stage. For instance,
Weinanused fully connected deep neural networks with skip connections for solving different classes of PDEs such as Poisson equations and eigenvalue problems.
bar2019unsupervised used fully connected deep neural networks for solving forward and inverse problems in the context of electrical impedance tomography governed by elliptical PDEs. long2017pde proposed a framework to uncover unknown physics in the form of PDEs by learning constrained convolutional kernels. raissi2019physics applied fully connected neural networks for solving PDEs in a forward or inverse setup. smith2020eikonet deployed residual neural networks to solve the Eikonal PDE equation and perform ray tracing for an application in earthquake hypo-center inversion and seismic tomography. Using their proposed framework they solved PDEs in different contexts such as fluids, quantum mechanics, seismology, reaction-diffusion systems, and the propagation of nonlinear shallow-water waves.used such representations for the computer vision task of reconstructing 3D scenes, where latent context grids are decoded into continuous implicit functions that represent the surfaces of different geometries.
In this section, we provide a detailed explanation of the problem set up, notations, data sets, along with the learning paradigm and evaluation metrics.
Consider a space and time dependent partial differential equation (PDE) with a unique solution, written in the general form as
(1a) | ||||
(1b) |
defined on a spatio-temporal domain where represents a dimensional spatial domain, and represents the one dimensional temporal domain. Here the is the operator defining the PDE within the domain (), is the source function, is the boundary condition function, is the operator defining the PDE on the boundary (), and the solution to the PDE in Eqn. 1 is .
For a partial differential equation (Eqn. 1), a problem instance is realized with a given source and boundary condition functions and . For a pair , we consider to be an operator that generates the high-resolution solution , i.e., , and to be an operator that produces the low-resolution solution , i.e., . Consider a compact normed space of operators , a set of operators , mapping low-resolution solutions to high-resolution ones. For a given compact normed space of functions , let denote the approximation gap with respect to , i.e.,
(2) |
with continuous approximation error in . Here the norm is with respect to a desired space where and is a preferred measure^{3}^{3}3Note that in the max-min game of the Eqn. 2, the action of the environment player, , is revealed to the approximator player to choose . Therefore, since the game is in the favor of the approximating player, the value of the game, can be much smaller than its min-max counterpart. We require the min-max value to be desirably small when we aim to have a single model to perform well on a set of designated problems.. In this work, given a problem instance specified with , we are interested in learning
using parametric models. In order to map a low-resolution solution to its corresponding super-resolution with low approximation error, we require
to be small and comparable with a desired error level. Therefore, given a problem set , a proper and rich class of operators, , allows for a desirable approximation error.For a given PDE in Eqn 1, defined on a hyper-rectangular domain in , we consider a spatio-temporal grid of resolutions with denoting the number of discretized points in time domain, and a data set , of points and their solution values.
In this work, we generate the dataset as the solution to a classical fluid dynamics system with a chaotic nature. We consider the well-known Rayleigh–Bénard instability problem in 2D where a static bulk fluid (kinematic viscosity , thermal diffusivity ) is initially occupying the space between two horizontal plates (see Fig. 1). The lower and upper plates are considered to be respectively hot and cold with temperatures of and . Gradually the temperature of the bulk fluid adjacent to the hot (cold) plate increases (decreases) and due to the buoyancy effects and density gradient the bulk fluid ascends (descends), leading to the formation of vortices and growth of flow instability in a chaotic and turbulent regime. The governing partial differential equations for the Rayleigh–Bénard instability problem are
(3a) | ||||
(3b) | ||||
(3c) |
where and with and being the Rayleigh and Prandtl numbers respectively defined as and , with , , , , , and respectively being the gravity acceleration, thermal expansion coefficient, kinematic viscosity, thermal diffusivity, temperature difference between hot and cold plates, and the separation length between the plates. From a physical perspective quantifies the balance between the gravitational forces and viscous damping, and quantifies the ratio between momentum diffusivity and thermal diffusivity (balance between heat convection and conduction).
We use the Dedalus framework [Burns2019] in order to numerically solve the system of Equations (3a)-(3c) using the spectral method approach. We solve Equations (3a)-(3c) for a duration of in time with a time step size of . In a coordinate system of -axis, and -axis, we consider a plate length and a separation distance , and discretize the domain with and points respectively in the and directions.
For the simulation cases, we consider Rayleigh and Prandtl numbers respectively in the range of and . Upon solving the system of Eqns. (3a)-(3c), we create a high-resolution Rayleigh–Bénard simulation dataset , unless otherwise mentioned, with a spatial resolution of = = , and a temporal resolution of = (upon adaptive time stepping). We consider a normalized domain size of unit length in -direction with a domain aspect ratio of 4, i.e., [], and solve the Rayleigh–Bénard problem for a duration of 50 []. Then we create a low-resolution dataset , by downsampling the high-resolution data in both space and time. We use downsampling factors of and for creating the low-resolution data in the temporal and spatial dimensions respectively.
In this work, we use multiple physical metrics, each accounting for different aspects of the flow field, in order to report the evaluation of the model for super-resolving low-resolution data. As the specific metrics of evaluation, we report the Normalized Mean Absolute Error (NMAE) and R2 score for the physical metrics between the ground truth and the predicted high-resolution data. Such physical metrics are listed in the following.
Total Kinetic Energy () : the kinetic energy per unit mass associated with the flow is defined as the total kinetic energy and can be expressed as .
Root Mean Square Velocity () : the square root of the scaled total kinetic energy is defined as the root mean square (RMS) velocity as .
Dissipation () : is the rate at which turbulence kinetic energy is converted into thermal internal energy by molecular viscosity and defined as with and
respectively being the rate of strain tensor and the kinematic viscosity.
Taylor Microscale () : is the intermediate length scale at which viscous forces significantly affect the dynamics of turbulent eddies and defined as . Length scales larger than the Taylor microscale (i.e., inertial range) are not strongly affected by viscosity. Below the Taylor microscale (i.e., dissipation range) the turbulent motions are subject to strong viscous forces and kinetic energy is dissipated into heat.
Taylor-scale Reynolds () : is defined as the ratio of RMS inertial forces to viscous forces and is expressed as .
Kolmogorov Time () and Length () Scales : Kolmogorov microscales are the smallest scales in turbulent flows where viscosity dominates and the turbulent kinetic energy dissipates into heat. Kolmogorov time and length scale can be respectively expressed as and .
Turbulent Integral Scale () : is a measure of the average spatial extent or coherence of the fluctuations and can be expressed as .
Large Eddy Turnover Time () : is defined as the typical time scale for an eddy of length scale to undergo significant distortion and is also the typical time scale for the transfer of energy from scale to smaller scales, since this distortion is the mechanism for energy transfer and expressed as .
In order to further illustrate the feasibility of utilizing our proposed framework for large physical systems requiring processing orders of magnitude more computation, here we study the scalability of . We scale our model on a GPU stack on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC). We use data distributed parallelism with synchronous gradient descent, and test performance up to 16 nodes with a total of GPUs.
In a data-parallel distribution strategy, the model is replicated over all GPUs. At every step, each GPU receives its own random batch of the dataset to calculate the local gradients. Gradients are averaged across all devices with an all_reduce operation. To achieve better scaling efficiency, the communication of one layer’s gradients is overlapped with the backprop
computation of the previous layer. PyTorch
torch.distributed [torch_distributed] package provides communication primitives for multiprocess parallelism with different collective communication backends. We use NVIDIA’s Collective Communications Library (NCCL) [nccl] backend which gave us the best performance when running on GPUs, both within and across multiple nodes. The PyTorch torch.nn.parallel.DistributedDataParallel [torch_ddp] wrapper, which we use for the results in this paper, builds on top of the torhch.distributed package to provide efficient data-parallel distributed training. With this setup, we achieve more than throughput efficiency on 16 Cori GPU nodes, 128 GPUs in total. Cori has 8 V100 (Volta) GPUs per node, the GPUs are interconnected with NVLinks in hybrid cube-mesh topology [cori_gpu_topog]. The nodes are equipped with Mellanox MT27800 (ConnectX-5) EDR InfiniBand network cards.In this work, we propose the framework as a novel computational algorithm for constructing the super-resolution solutions to partial differential equations using their low-resolution counterpart solutions.
A schematic for the training pipeline for the model is presented in Fig. 4. The model inputs a low-resolution spatio-temporal grid that can be acquired from a PDE solver, along with query locations that can be any continuous value within the range of the input domain. produces a predicted output vector at each query location.
consists of two subnetworks, namely the Context Generation Network, and the Continuous Decoding Network. The Context Generation Network produces a Latent Context Grid, which is a learned localized representation of the function. The Latent Context Grid contains latent context vectors at each grid vertex, which along with spatio-temporal coordinates , can be concatenated and fed into the Context Generation Network, modeled as a Multilayer Preceptron (MLP), to generate the physical outputs . For training the model, two loss terms, namely the prediction loss and the equation loss are used. We present a more detailed architectural breakdown for our method and our baseline method in Fig. 5. We discuss the details of the Context Generation Network (Sec. 4.1), Continuous Decoding Network (Sec. 4.2), and loss functions (Sec. 4.3) in the following subsections.
100NMAE (R2) | ||||||||||
avg. R2 | ||||||||||
0 | 0.664 (0.9992) | 0.752 (0.9986) | 0.701 (0.9991) | 0.643 (0.9978) | 0.503 (0.9987) | 0.884 (0.9968) | 0.849 (0.9974) | 1.311 (0.9925) | 0.548 (0.9994) | 0.9977 |
0.0125 | 0.698 (0.9990) | 0.666 (0.9987) | 0.671 (0.9989) | 0.554 (0.9983) | 0.408 (0.9991) | 0.805 (0.9971) | 0.767 (0.9978) | 1.048 (0.9946) | 0.515 (0.9992) | 0.9981 |
0.025 | 1.271 (0.9976) | 0.742 (0.9988) | 1.261 (0.9976) | 0.623 (0.9984) | 0.682 (0.9984) | 0.849 (0.9972) | 0.819 (0.9978) | 0.821 (0.9966) | 0.931 (0.9984) | 0.9978 |
0.05 | 0.951 (0.9985) | 0.732 (0.9987) | 0.948 (0.9985) | 0.611 (0.9977) | 0.473 (0.9987) | 0.929 (0.9962) | 0.881 (0.9971) | 0.811 (0.9964) | 0.732 (0.9989) | 0.9978 |
0.1 | 1.691 (0.9960) | 1.091 (0.9970) | 1.672 (0.9960) | 0.902 (0.9967) | 0.852 (0.9975) | 1.221 (0.9940) | 1.193 (0.9950) | 1.269 (0.9931) | 1.152 (0.9976) | 0.9959 |
0.2 | 2.031 (0.9940) | 2.093 (0.9926) | 2.041 (0.9938) | 1.623 (0.9932) | 1.659 (0.9923) | 1.832 (0.9912) | 1.841 (0.9927) | 1.632 (0.9895) | 1.673 (0.9955) | 0.9927 |
0.4 | 14.701 (0.7569) | 4.033 (0.9652) | 14.521 (0.7877) | 2.374 (0.9840) | 5.506 (0.9117) | 4.447 (0.9486) | 4.269 (0.9580) | 4.352 (0.9365) | 10.985 (0.8841) | 0.9036 |
0.8 | 27.162 (-0.1611) | 4.979 (0.9564) | 27.132 (0.0733) | 6.843 (0.9087) | 12.604 (0.4617) | 6.645 (0.9010) | 6.195 (0.9198) | 8.132 (0.8057) | 20.661 (0.5900) | 0.6062 |
1.0 | 39.101 (-2.2671) | 5.722 (0.9231) | 39.923 (-1.4482) | 11.562 (0.6148) | 21.043 (-1.5721) | 5.094 (0.9271) | 5.117 (0.9371) | 10.109 (0.6812) | 32.252 (0.0209) | -0.1314 |
The Context Generation Network is a convolutional encoder that produces a Latent Context Grid from the low-resolution physical input . Denote this network as
(4) |
where is the set of learnable parameters associated with this network, is the generated context grid, where is the number of latent channels or the dimensions of each latent context vector. It is worth noting that the function can be applied on a small fixed-size sub-window of , since the model is fully-convolutional. By applying it to the input in a fully-convolutional manner, the size of the domain at test time (both spatial and temporal sizes) can be orders of magnitude greater than the dimensions of .
In this work, we implement the Context Generation Network as a 3D variant of the U-Net architecture which was originally proposed by Ronneberger2015. U-Net has successfully been applied for different computer vision tasks that involve dense localized predictions such as image style transfer [Zhang2018], image segmentation [Lian2019, Chen2019], image enhancement [Chen2018], image coloring [Liu2019, Fang2019], and image generation [Esser2018, Zhang]
Different from the original U-Net architecture, we replace the 2D convolutions with 3D counterparts and utilize residue blocks instead of individual convolution layers for better training properties of the network. The U-Net comprises of a contractive part followed by an expansive part. The contractive part is composed of multiple stages of convolutional residue blocks, each followed by a max-pooling layer (of stride 2). Each residue block consists of 3 convolution layers (1x1, 3x3, 1x1) interleaved with batch normalization layers and ReLU activation. The expansive part mirrors the contractive part, replacing max-pooling with nearest neighbor upsampling. In between the layers with similar grid sizes within the contractive and expansive parts, a skip connection concatenates the features from the contractive layer with the features in the expansive layer as the input to the subsequent layers in the contractive part in order to preserve the localized contextual information.
100NMAE (R2) | ||||||||||
Model | avg. R2 | |||||||||
Baseline (I) | 69.6360 (-19.7894) | 3470.132 (-57.7717) | 76.338 (-14) | 78.164 (-11096) | 75.729 (-4934) | 77.410 (-12599) | 122.55 (-3147) | 137.376 (-0.5190) | 109.398 (-3.0092) | -3541 |
Baseline (II) | 6.489 (0.9557) | 8.769 (0.8967) | 6.144 (0.9593) | 3.903 (0.9490) | 2.489 (0.9711) | 5.584 (0.9382) | 6.019 (0.94019) | 2.902 (0.9597) | 5.076 (0.9644) | 0.9482 |
, | 0.664 (0.9992) | 0.752 (0.9986) | 0.701 (0.9991) | 0.643 (0.9978) | 0.503 (0.9987) | 0.884 (0.9968) | 0.849 (0.9974) | 1.311 (0.9925) | 0.548 (0.9994) | 0.9977 |
, | 0.698 (0.9990) | 0.666 (0.9987) | 0.671 (0.9989) | 0.554 (0.9983) | 0.408 (0.9991) | 0.805 (0.9971) | 0.767 (0.9978) | 1.048 (0.9946) | 0.515 (0.9992) | 0.9981 |
100NMAE (R2) | ||||||||||
Dataset(s) () | avg. R2 | |||||||||
1 | 0.698 (0.9990) | 0.666 (0.9987) | 0.671 (0.9989) | 0.554 (0.9983) | 0.408 (0.9991) | 0.805 (0.9971) | 0.767 (0.9978) | 1.048 (0.9946) | 0.515 (0.9992) | 0.9981 |
10 | 0.527 (0.9996) | 0.574 (0.9989) | 0.517 (0.9996) | 0.3930 (0.9991) | 0.3433 (0.9994) | 0.540 (0.9988) | 0.536 (0.9990) | 0.677 (0.9985) | 0.345 (0.9998) | 0.9992 |
One unique property of our super-resolution methodology is that the output is continuous instead of discrete. This removes the limitations in output resolution, and additionally, it allows for an effective computation of the gradients of predicted output physical quantities, enabling an easy way of enforcing PDE-based physical constraints.
The continuous decoding network can be implemented using a simple Multilayer Perceptron, see Fig.
4. For each query, denote the spatio-temporal query location to be and the latent context grid to be , where are the spatio-temporal coordinates and the latent context vector for the -th vertex of the grid. Denote as the set of neighboring vertices that bound , where for a (+1) dimensional spatio-temporal grid . Denote the continuous decoding network, implemented as a Multilayer Perception as(5) |
where is the set of trainable parameters of the Multilayer Perception network. The query value at with respect to the shared network and the latent context grid can be calculated as
(6) |
where , is the trilinear interpolation weight with respect to the bounding vertex , is the stencil size corresponding to the discretization grid vertices.
Since the Continuous Decoding Network is implemented as an MLP, arbitrary spatio-temporal derivatives of the output quantities: can be effectively computed via backpropagation through the MLP. Denote the approximation of the derivative operator to be . We combine the partial derivatives to compute the equation loss as the norm of the residue of the governing equations.
We use a weighted combination of two losses to train our network: the norm of the difference between the predicted physical outputs and the ground truth physical outputs, which we refer to as Prediction Loss, and the norm of the residues of the governing PDEs, which we refer to as Equation Loss.
Denote the set of sample locations within a mini-batch of training samples to be where is the mini-batch of point samples for the -th low-resolution input, is the vector that represents the ground truth physical output quantities. In the case of the Rayleigh-Bénard Convection example in this study, we have where are the pressure, temperature, x-velocity and y-velocity terms respectively. The super-resolution for the learned model queried at conditioning on low-resolution input is
(7) |
where is the predicted output vector. The prediction loss for a mini-batch can be formulated as
(8) |
where is the Frobenius -norm of the difference. We use the L1 Norm for computing the prediction loss. The Equation loss for a mini-batch can be formulated as
(9) |
which is the norm of the PDE equation residue, using the PDE definition in Eqn. 1. Finally, a single loss term for training the network can be represented as a weighted sum of the two loss terms as
(10) |
where
is a hyperparameter for weighting the equation loss.
100NMAE (R2) | ||||||||||
avg. R2 | ||||||||||
1.480 (0.9962) | 2.444 (0.9872) | 0.881 (0.9988) | 31.319 (0.3788) | 0.422 (0.9998) | 1.947 (0.8005) | 1.564 (0.9490) | 32.070 (0.3881) | 0.388 (0.997) | 0.8328 | |
1.299 (0.9968) | 1.226 (0.9981) | 1.247 (0.9972) | 0.339 (0.9996) | 0.325 (0.9996) | 1.022 (0.9978) | 1.040 (0.9981) | 0.946 (0.9988) | 0.735 (0.9988) | 0.9983 | |
1.242 (0.9961) | 1.703 (0.9923) | 1.270 (0.9955) | 1.070 (0.99517) | 0.931 (0.9967) | 1.661 (0.9927) | 1.669 (0.9927) | 0.794 (0.9970) | 0.446 (0.9987) | 0.9952 | |
1.336 (0.9965) | 2.659 (0.9839) | 1.352 (0.9965) | 1.742 (0.9873) | 1.504 (0.9926) | 2.205 (0.9890) | 2.275 (0.9885) | 4.218 (0.9578) | 0.831 (0.9985) | 0.9878 | |
2.115 (0.9931) | 3.11 (0.9832) | 2.084 (0.9933) | 5.873 (0.9209) | 7.483 (0.8742) | 4.607 (0.9478) | 4.281 (0.9594) | 7.381 (0.9192) | 1.077 (0.9978) | 0.9543 |
In all the experiments, we use an Adam optimizer with learning rate of , and
regularization for the loss function, 3000 random samples per epoch, and train for 100 epochs.
In this part, we investigate the influence of the importance given to the Equation loss and the prediction loss (see, Section 4.3) on the performance of . As presented in Eqn. 10, the total loss () comprises of the Prediction loss () and the Equation loss () where the Equation loss is weighted with a scaling coefficient . Accordingly, we study the influence of the hyperparameter in the loss function given in Eqn. 10 on the accuracy of . For this purpose, we consider . The values of the evaluation metrics for trained with each of the values are presented in Table 1 for a validation set. indicates a loss function which only depends on the Prediction loss, and the physical aspects (PDE imposed constraints) of the predicted high-resolution data are not accounted for. As presented in Table 1, the best performance on the validation set is achieved for a loss function with the Equation loss weighting coefficient of . Allover this work we refer to this optimum weighting coefficient as and perform the training tasks using the loss function in Eqn. 10 where the Equation loss is weighted with . As presented in Table 1, a model that is trained with , which only focuses on the data (i.e., uses Prediction loss only) and does not account for the physics and the PDE constraints (i.e., ignores Equation loss) underperforms compared to the trained model with . On the other hand models trained with a significant focus on the physical constraints only (i.e., large ) underperform in super-resolving the low-resolution data. In general, a balance between the focus of the on the model and the physical constraints leads to an optimal super-resolution performance. In achieving that, in Eqn. 10, the Prediction loss () captures the global structure of the data and the Equation loss () further guides the model in accurately reconstructing the local structure of the data.
In this section, we present a comparison between the performance of our proposed framework for super-resolving low-resolution data against two baselines, namely: a classic trilinear interpolation algorithm (Baseline (I)), and a deep learning based 3D U-Net model (Baseline (II)). Specifically for the 3D U-Net model for Baseline (II), we use the same U-Net backbone as in our framework, with the difference being that while uses the 3D U-Net to generate a latent context grid, the baseline U-Net continues with 3D up-sampling and transpose-convolution operations up to the target high-resolution space (see, Fig. 5). Table 2 presents the comparison of the with the baseline models. As shown in Table 2, the Baseline (I) is a purely interpolation-based approach and fails to reconstruct the high-resolution data and resolve the fine-scale details, leading to large errors in flow-based evaluation metrics that characterize the flow dynamics. The extremely large normalized mean absolute error (NMAE) of the calculated velocity for the fine-scale solution found by Baseline (I) indicates that a merely interpolative scheme cannot accurately reconstruct the fine-scale local dynamics (e.g., the flow velocities). Moreover, the NMAE of the total kinetic energy (, as a global characterizing parameter of the turbulent dynamics, is also very large for the high-resolution solution by Baseline (I). On the other hand, the deep learning based Baseline (II) which utilizes the 3D U-Net model directly maps the low-resolution data to the high-resolution space, achieves better performance compared to Baseline (I). However, as presented in Table 2, our model performs significantly better than Baselines (I) and (II). The specification of in Table 2 refers to the weighting coefficient of the Equation loss component in the loss function in Eqn. 10. In Table 2, indicates that only the prediction loss has been considered for training the model, whereas refers to the optimum value for the weighting coefficient of the Equation loss from the ablation study (see, Table 1, Sec. 5.1).
We further evaluate the robustness of for resolution enhancement of low-resolution datasets that have physical initial and boundary conditions different from the datasets the model has been trained on. We refer to such initial and boundary conditions as unseen initial/boundary conditions. In order to investigate the generalizability of on unseen initial and boundary conditions, we study the effect of each condition separately in the following setups.
We investigate the robustness of a trained for enhancing the resolution of a low-resolution unseen data with physical initial conditions different than the training datasets that the has been trained on. Table 3 shows the performance of the on a dataset with unseen initial conditions. The first row of Table 3 shows the values of the evaluation metrics for unseen test data when is trained only on one dataset whereas the second row shows the values of the evaluation metrics when is trained on 10 datasets each with a different initial condition. As can be observed from the results, the performance of on unseen cases can be improved by training on a more diverse set of initial conditions.
We further investigate the robustness of a trained for super-resolving a low-resolution unseen dataset with physical boundary conditions different from the training datasets that the has been trained on. As the use of more datasets in Sec. 5.3.1 was shown to be effective in improving the performance of for super-resolution, here we use a dataset that comprises of 10 different sets of boundary conditions (i.e. different Rayleigh numbers of , corresponding to Reynolds numbers of up to 10,000). Table 4 shows the performance of such a trained for 5 different test datasets each with a different Rayleigh number boundary condition. In Table 4, ’s performance is evaluated for a Rayleigh number within the range of boundary conditions of the training sets (i.e., ), for Rayleigh numbers slightly below and above the range of boundary conditions of the training sets (i.e., and respectively), and for Rayleigh numbers far below and above the range of boundary conditions of the training sets (i.e., and respectively). As presented in Table 4, achieves a good performance not only on the boundary conditions within the range of Rayleigh number boundary conditions it has been trained on, but also on the unseen boundary conditions far out of the range of Rayleigh number boundary conditions it has been trained on. This illustrates the fact that a trained model can generalize well to unseen boundary conditions.
Last but not least, we study the scalability of the model to study its applicability to larger problems that require orders-of-magnitude more compute. The scaling results are presented in Fig. 7. Figs. 6(a) demonstrates that an almost-ideal scaling performance can be achieved for the model on up to 128 GPU workers, achieving approximately scaling efficiency. In Fig. 6(b), we show the convergence for the model training loss with respect to the number of epochs. In Fig. 6(c)
, we show the loss convergence with respect to total wall time, where an increasing number of GPU workers leads to a drastic decrease in total training time. As the models achieve similar levels of losses after 100 epochs, yet the training throughput scales almost linearly with the number of GPUs, we see a close to an ideal level of scaling for convergence speed on up to 16 GPUs. A small anomaly regarding the loss curve for 128 GPUs where the loss does not decrease to an ideal level after 100 epochs shows that a very large batch size could lead to diminishing scalability with respect to model convergence. This has been observed in numerous machine learning scaling studies and requires further investigations within the community.
In this work, for the first time, we presented the , a physics-constrained super-resolution framework, that can produce continuous super-resolution outputs, would allow imposing arbitrary combinations of PDE constraints and could be evaluated on arbitrary-sized spatio-temporal domains due to its fully-convolutional nature. We further demonstrated that can recover a wide range of important physical flow quantities (e.g., including Turbulent Kinetic Energy, Kolmogorov Time and Length Scales, etc.) by accurately super-resolving turbulent flows significantly better than traditional (trilinear interpolation) and deep learning based (3D U-Net) baselines. We further illustrated the scalability of to a large cluster of GPU nodes with a high speed interconnect, demonstrating its applicability to problems that require orders of magnitude more computational resources.
Future work includes exploring the applicability of to other physical applications beyond 2D Rayleigh Bernard convection. One interesting direction to pursue is to explore the use of 4D spatio-temporal convolution operators such as those proposed by choy20194d to further extend this framework to 4D space-time simulations. That will open up a wide range of applications in Turbulence modeling, where statistical priors between the low-resolution simulation and subgrid-scale physics can be learned from D Direct Numerical Simulations (DNS). The scalability of the model on HPC clusters will be critical in learning from such large scale datasets, where single node training will be prohibitively slow. The fully convolutional nature of the framework, along with the demonstrated scalability makes it well poised for such challenges. Moreover, due to the generalizability of our PDE constrained framework, it would be interesting to apply this framework on applications beyond turbulent flows.
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. The research was performed at the Lawrence Berkeley National Laboratory for the U.S. Department of Energy under Contract No. DE340AC02- 05CH11231. K. Kashinath is supported by the Intel Big Data Center at NERSC. K. Azizzadenesheli gratefully acknowledges the financial support of Raytheon and Amazon Web Services. A. Anandkumar is supported in part by Bren endowed chair, DARPA PAIHR00111890035 and LwLL grants, Raytheon, Microsoft, Google, and Adobe faculty fellowships. We also acknowledge the Industrial Consortium on Reservoir Simulation Research at Stanford University (SUPRI-B).