nif
None
view repo
High-dimensional spatio-temporal dynamics can often be encoded in a low-dimensional subspace. Engineering applications for modeling, characterization, design, and control of such large-scale systems often rely on dimensionality reduction to make solutions computationally tractable in real-time. Common existing paradigms for dimensionality reduction include linear methods, such as the singular value decomposition (SVD), and nonlinear methods, such as variants of convolutional autoencoders (CAE). However, these encoding techniques lack the ability to efficiently represent the complexity associated with spatio-temporal data, which often requires variable geometry, non-uniform grid resolution, adaptive meshing, and/or parametric dependencies. To resolve these practical engineering challenges, we propose a general framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF consists of two modified multilayer perceptrons (MLPs): (i) ShapeNet, which isolates and represents the spatial complexity, and (ii) ParameterNet, which accounts for any other input complexity, including parametric dependencies, time, and sensor measurements. We demonstrate the utility of NIF for parametric surrogate modeling, enabling the interpretable representation and compression of complex spatio-temporal dynamics, efficient many-spatial-query tasks, and improved generalization performance for sparse reconstruction.
READ FULL TEXT VIEW PDFNone
Machine learning and artificial intelligence algorithms have broadly transformed science and engineering, including the application areas of computer vision (Krizhevsky et al., 2012)
(Sutskever et al., 2014), molecular dynamics (Zhang et al., 2018; Mardt et al., 2018), and dynamical systems (Brunton and Kutz, 2019). The subfield of scientific machine learning, which often focuses on modeling, characterization, design, and control of large-scale, physics-based models, has also experienced significant growth. Despite the achievements in scientific machine learning (Duraisamy et al., 2019; Karniadakis et al., 2021; Kutz, 2017; Brunton et al., 2020), there remain significant challenges in the representation of high-dimensional spatio-temporal dynamics, which are often modeled by nonlinear partial differential equations (PDEs). Data-driven modeling of PDE systems often relies on a more advantageous representation of the physics. In general, manifold-based methods are a dominant paradigm (Hesthaven et al., 2016; Carlberg et al., 2011; Peherstorfer and Willcox, 2016; Zahr and Farhat, 2015; Benner et al., 2015). However, there are recent innovations in developing mesh-based methods (Long et al., 2018; Zhu and Zabaras, 2018; Geneva and Zabaras, 2020; Bar-Sinai et al., 2019; Li et al., 2020a; Pfaff et al., 2020) and mesh-agnostic methods (Raissi et al., 2020; Lu et al., 2021a, b; Sun et al., 2020). Despite the diversity of algorithmic innovations, each has various challenges in efficiently or accurately representing complex spatio-temporal dynamics. Indeed, practical engineering applications require handling variable geometry, non-uniform grid resolutions, adaptive meshes, and/or parametric dependencies. We advocate a general mathematical framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF leverages a hypernetwork structure that allows one to isolate the spatial complexity, thus accounting for all other complexity in a second network where parametric dependencies, time, and sensor measurements are encoded and modulating the spatial layer. We show that NIF is highly advantageous for representing spatio-temporal dynamics in comparison with current methods.Spatio-temporal data is ubiquitous. As such, a diversity of methods have been developed to characterize the underlying physics. In manifold-based modeling, which is the most dominant paradigm for reduced-order modeling, one first extracts a low-dimensional manifold from the solution of a PDE, typically using the singular value decomposition (Benner et al., 2015; Noack et al., 2003; Rowley et al., 2004) or a convolutional autoencoder (CAE) (Brunton and Kutz, 2019; Holmes et al., 2012; Mohan et al., 2019; Murata et al., 2020; Xu and Duraisamy, 2020; Lee and You, 2019; Ahmed et al., 2021b). Then one either directly solves the projected governing equation on the manifold (Carlberg et al., 2011, 2013)
or learns the projected dynamics from the data on the manifold with either Long Short-Term Memory (LSTM)
(Mohan and Gaitonde, 2018), Artificial Neural Network
(Pan and Duraisamy, 2018b; San et al., 2019; Lui and Wolf, 2019), polynomials (Qian et al., 2020; Brunton et al., 2016; Peherstorfer and Willcox, 2016). Other variants include jointly learning dynamics together with the manifold (Champion et al., 2019; Kalia et al., 2021; Lusch et al., 2018; Takeishi et al., 2017; Yeung et al., 2019; Otto and Rowley, 2019; Pan and Duraisamy, 2020; Lange et al., 2021; Mardt et al., 2020), and closure modeling to account for non-Markovian effects (Pan and Duraisamy, 2018a; Wang et al., 2020; Maulik et al., 2020; Ahmed et al., 2021a). Manifold-based approaches first reduce the prohibitively large spatial degrees of freedom (e.g.,
in fluid dynamics) into a moderate number (e.g., ) by learning a low-dimensional representation on which we project and solve the PDE, thus inheriting the physics (Carlberg et al., 2011), or simply performing time-series modeling on the reduced manifold (Xu and Duraisamy, 2020) or modal expansion (Taira et al., 2017). Equivalently, this preprocessing step can be viewed as learning a time-dependent vector-valued low dimensional representation of a spatio-temporal field. More broadly, learning an effective low-dimensional representation is often domain specific, for example, using a real-valued matrix with RGB channels for representing images in computer vision
(Szeliski, 2010), Word2Vec for representing words in natural language processing (NLP) (Mikolov et al., 2013), spectrograms for representing audio signal in speech recognition (Flanagan, 2013), among others.Manifold-based methods have several drawbacks when applied to realistic spatio-temporal data. Specifically, practical engineering applications require handling variable geometry, non-uniform grid resolutions, adaptive meshes, and/or parametric dependencies. This includes data from incompressible flow in the unbounded domain (Yu et al., 2022), combustion (Bell and Day, 2011), astrophysics (Almgren et al., 2010), multiphase flows (Sussman, 2005), and fluid-structure interactions (Bhalla et al., 2013)) which are generated with advanced meshing techniques such as adaptive mesh refinement (AMR) (Berger and Oliger, 1984), and/or overset grids (Chan, 2009). Such meshes typically change with time or parameters (e.g., Mach number dependency on shock location) in order to efficiently capture the multi-scale phenomena^{1}^{1}1for some systems (Bryan et al., 2014), AMR is the only way to make such simulation possible., which violates the requirement of common SVD approaches. While CNNs require preprocessing the flowfield as an image, i.e., voxel representation with a uniform Cartesian grid, in order to perform discrete convolution, which is affordable in 2D but becomes increasingly expensive in 3D (Park et al., 2019) due to cubic memory requirements. The memory footprint limits the resolution to typically (Mescheder et al., 2019) unless one resorts to an optimized parallel implementation of 3D CNN on clusters (Mathuriya et al., 2018). Additionally, such uniform processing is inherently inconsistent with the nonuniform multi-scale nature of PDEs and can decrease the prediction accuracy of downstream tasks, e.g., failing to accurately recover total lift/drag from flowfield predictions (Bhatnagar et al., 2019) or efficiently capture small-scale geometry variations that can trigger critical physics phenomena. This leads us to pursue a mesh-agnostic and expressive paradigm beyond SVD and CAE for dimensionality reduction of spatio-temporal data.
Recent advances in mesh-based methods^{2}^{2}2In this paper, “mesh-based” in our paper is the same as “graph-based”. A key characteristics of graph-based method is that the online computational cost scales with the size of the graph (i.e., how much grid points) (Chen et al., 2021). While manifold-based methods only access mesh during postprocessing. have shown promising results either with discretization-invariant operator learning (Li et al., 2020a, c, b) or meshes based on the numerical solver (Pfaff et al., 2020; Sanchez-Gonzalez et al., 2020; Xu et al., 2021). On the other hand, a mesh-agnostic framework (e.g., physics-informed neural network (PINN) (Raissi et al., 2020)) has been successfully applied to solving canonical PDEs on problems where mesh-based algorithms can be cumbersome due to, for instance, arbitrary geometry (Berg and Nyström, 2018; Sun et al., 2020) and/or high-dimensionality (Sirignano and Spiliopoulos, 2018). In addition to leveraging the PDE governing equation, it successfully leverages the structure of multilayer perceptrons (MLPs) with the coordinate as input and the solution field as output. It can be viewed as using an adaptive global basis in space-time to approximate the solution with the known PDE, instead of the traditional local polynomial basis inside a cell, e.g., finite element analysis (Hughes, 2012). Concurrently, such MLP structure has been also employed in computer graphics community for learning 3D shape representations (Park et al., 2019; Mescheder et al., 2019), scenes (Mildenhall et al., 2020; Sitzmann et al., 2020; Tancik et al., 2020). A closely related work called MeshfreeFlowNet (MFN) (Esmaeilzadeh et al., 2020)
uses CNN and coordinate-based MLP to perform super-resolution for fluid problems. MFN first uses a 2D CNN to extract features from coarse-scale spatial-temporal field. Then, the extracted features are concatenated with
as an augmented input to a MLP, which outputs the high-resolution measurement at .Motivated by the above seminal works, we introduce a mesh-agnostic representation learning paradigm called neural implicit flow (NIF) that exploits the expressiveness and flexibility of the multilayer perceptrons (MLPs) for dimensionality reduction of parametric spatio-temporal fields. In section 2, we present the NIF paradigm and several derived problem-specific frameworks. In section 3, we demonstrate the following capabilities of NIF:
To our knowledge, NIF enables the first ever scalable 3D nonlinear dimensionality reduction on spatial-temporal datasets from arbitrary different meshes. Example includes a three-dimensional fully turbulent flows with over 2 million cells. (see section 3.3)
NIF also enables modal analysis of spatial-temporal dynamics on adaptive meshes. As an example, we explore dynamic mode decomposition on the fluid flow past a cylinder with adaptive mesh refinement (see section 3.5).
NIF also provides a performance improvement in the following applications on several canonical spatio-temporal dynamics:
NIF generalizes 40% better in terms of root-mean-square error (RMSE) than a generic mesh-agnostic MLP in terms of mesh-agnostic surrogate modeling for the parametric Kuramoto–Sivashinsky PDE under the same size of training data or trainable parameters (see section 3.1)
NIF outperforms both (linear) SVD and (nonlinear) CAE in terms of nonlinear dimensionality reduction, as demonstrated on the Rayleigh-Taylor instability on adaptive mesh with a factor from 10 times to 50% error reduction. (see section 3.2)
Compared with the original implicit neural representation which takes all information (including time ) into a single feedforward network (Sitzmann et al., 2020; Lu et al., 2021c), NIF enables efficient spatial sampling with 30% less CPU time and around 26% less memory consumption under the same level of accuracy for learning the aforementioned turbulence dataset. (see section 3.4)
NIF outperforms the state-of-the-art method (POD-QDEIM (Drmac and Gugercin, 2016)) in the task of data-driven sparse sensing with 34% smaller testing error on the sea surface temperature dataset. (see section 3.6)
Finally, conclusions and future perspectives of NIF are presented in section 4.
We begin by considering 3D spatial-temporal data with varying time/parameters, namely , with spatial coordinate , time , and parameters
(e.g., Reynolds number). Without loss of generality, consider a supervised learning problem: using an
-layer MLP with spatial coordinate as input to fit a single spatial realization at an arbitrary time and parameter , i.e., . An -layer MLP with as input is a vector-valued function defined as , with . The first layer weight has size , and the remaining hidden layer weight has size . The last layer weight has size . Biases are denoted as where subscript denotes the index of layer. . The set of weights and biases are defined as and correspondingly. The activation is a non-linear continuous function. One can use gradient-based optimization to find the weights and biases that minimize the discrepancy between and the single snapshot data at and :(1) |
where
is a loss function (e.g., mean squared error, mean absolute error) and
is some spatial measure which can depend on . This minimization leads to a key observation: a well-trained set of weights and biases of the MLP fully determines a spatial field that approximates the target field, in a mesh-agnostic sense. Such a trained MLP (what we call ShapeNet) is closely related to the so-called neural implicit representation (Sitzmann et al., 2020), and works in computer graphics (Park et al., 2019; Mescheder et al., 2019) used such an MLP to fit signed distance functions (SDFs) of a desired surface, which is implicitly defined by the zero iso-surface of the SDF.To explicitly correlate the corresponding spatial fields with external factors of interests, such as time, parameters, and sparse sensor measurements, we use a second MLP (denoted as ParameterNet) to learn mappings from these parameters to weights and biases . Note that the typical output dimension of ParameterNet, e.g.,, the total number of scalars in and , is above thousands to tens of thousands. Dimensionality reduction assumes the existence of a rank- subspace that can approximate the spatio-temporal data. The reduced coordinates are denoted as . One can simply consider a linear output layer in ParameterNet after a bottleneck layer of width . Once are determined, and in the ShapeNet are completely determined, which in turn determines the spatial “flow” field conditioned on the time or more generally, any other external parameters. As a result, the bottleneck layer in fig. 1 can be viewed as an -dimensional latent representation, similar to the bottleneck layer in generic autoencoders (Goodfellow et al., 2016)^{3}^{3}3However, note that we haven’t introduced “encoder” in fig. 1. We will introduce an “encoder” based on sparse sensors in section 3.2.. In summary, we can find a mesh-agnostic representation of parametric spatio-temporal fields as in the aforementioned manifold-based methods, where spatial complexity, e.g., coherent structures, is explicitly decoupled from temporal and parametric complexity, e.g., chaos and bifurcations.
Note that any neural network that generates the weights and biases of another neural network is generally called a hypernetwork (Ha et al., 2016). This concept has been applied in image (Sitzmann et al., 2020) and language modeling (Ha et al., 2016) and its inspiration can be dated back to control fast-weight memory (Schmidhuber, 1992) in the early 90’s. Therefore, NIF can be viewed as a special family of hypernetworks for MLP with only spatial input while any other factors, e.g., time , parameter , are fed into the hypernetwork. This implies the dimensionality reduction is only performed towards spatial complexity, which is the biggest cause of computational and storage challenges for non-linear PDEs. Thus, spatial complexity is decoupled from other factors.
A comparison of the NIF architecture with other recent frameworks of machine learning for PDEs is shown in fig. 2. DeepONet (Lu et al., 2021a; Wang et al., 2021a) pioneered the learning of general operators associated with PDEs from data. It has been a building block of a DeepM&MNet (Cai et al., 2021), which has been applied to data assimilation in electroconvection (Cai et al., 2021) and hypersonic flows (Mao et al., 2021). Interestingly, the structure of DeepONet can be viewed as that of NIF with only the last linear layer of the ShapeNet determined by the so-called trunk net. To highlight the expressiveness of NIF, we compare a tiny NIF with only 51 parameters and a moderate DeepONet with 3003 parameters on a 1D modulated traveling sine wave. Figure 3 shows no visual difference between the ground truth and NIF.
Alternatively, neural operators (Li et al., 2020b, c, a; Kovachki et al., 2021) are discretization-independent mesh-based frameworks for learning solution operators of PDEs beyond the straightforward CNN approach on a fixed grid (Zhu and Zabaras, 2018; Long et al., 2018). Neural operators have been successful in recovering Green’s function, learning chaotic PDE systems (Li et al., 2021a), and leveraging known PDEs such as Navier-Stokes equations (Li et al., 2021b). It inherits translational and rotational invariance using graph neural networks, as opposed to NIF and DeepONet, which might be advantageous in the small data regime.
Distinct from the above pioneering works aimed at learning solution operators of PDEs, NIF emphasizes a scalable nonlinear dimensionality reduction paradigm that outperforms SVD and CAE in dealing with complex parametric spatial-temporal data (e.g., turbulence) on arbitrary mesh structure. In the following subsections, we will develop problem-specific, NIF-based frameworks on a variety of learning tasks.
Consider a class of non-linear parametric PDEs, , . Here and is a non-linear function or operator in general. An example of data-fit parametric surrogate modeling (Benner et al., 2015; Sobieszczanski-Sobieski and Haftka, 1997; Frangos et al., 2010; Amsallem et al., 2013; Bhatnagar et al., 2019) is to find an approximated relation between the parameter and the corresponding PDE solution under fixed initial and boundary conditions. After the data-fit surrogate model is trained, it is expected to make predictions for an unseen input possibly in real-time^{4}^{4}4 Note that our major goal in the following PDE examples is to demonstrate NIF for dimensionality reduction, not for learning solution operator of PDEs as is advocated in other neural network architectures. However, we present the following example just to show the capability of efficient surrogate modeling where computational cost is independent of mesh size. Note that existing graph-based frameworks still suffer from poor scaling of computational cost with mesh size.. . An illustration of the method is shown in fig. 4. This is attractive for many engineering tasks that require many-query analyses such as optimization, uncertainty quantification, and control. In contrast to more physics-based models (Benner et al., 2015), a data-fit model simplifies the surrogate modeling for PDEs as a high-dimensional regression without access to prior knowledge. Despite the many disadvantages, including large sample complexity, lack of interpretability etc., it is the simplest and most widely used type of surrogate model (Qian et al., 2020; Loiseau et al., 2021).
As illustrated in fig. 4, we apply NIF to the above parametric surrogate modeling by simply allowing the weights and biases to depend on time and parameters through the ParameterNet ,
(2) |
where is an MLP with its own weights and biases denoted as , vec is the matrix vectorization, and is the total number of unknown parameters in and . Again, it is important to note that the width of the layer before the last linear layer of approximates the total number of parameters we need to optimize. We set the bottleneck width which equals the input dimension of ParameterNet. Hence, the rank of is . We denote the above dependency of weights and biases simply as . By considering eq. 2, and extending eq. 1 from to , we arrive at the following minimization formulation,
(3) |
where now becomes a measure in . A natural choice of to cover the domain of interest is the empirical distribution based on the numerical discretization of
where the data comes from. For example a tensor product of discretizations independently in
and leads to where denotes Dirac measure. Here total number of the spatial mesh points is , the number of snapshots in time is , and the number of parameters selected is . Once a proper is chosen, eq. 3 can be solved via gradient-based method, e.g., Adam optimizer (Kingma and Ba, 2014).It should be noted that eq. 3 permits using any kind of proper empirical measure to cover . As shown in fig. 1, this can be especially advantage for problems where an efficient adaptive mesh (e.g., AMR, Overset), moving mesh (e.g., fluid-structure interaction) or simply parameter-dependent mesh (e.g., varying discontinuity with parameters) is adopted. It is a distinct feature that makes NIF different from previous pipelines of dimensionality reduction (Xu and Duraisamy, 2020; Mohan et al., 2019) with CAE (i.e., a homogeneous isotropic spatial measure) and SVD (i.e., a tensor product measure structure between and ). In this paper, we take as the mean square error and rewrite as the most general form^{5}^{5}5Taken the example of tensor product measure, the number of total training data points is the product of resolution on each dimensions, i.e., . Therefore, in practice is typically larger than millions. . So eq. 3 becomes the standard least-squares regression,
(4) |
As described in section 2
, ShapeNet is a general MLP with activation function
still to be determined. Given the universal approximator theorem (Hornik et al., 1989), the choice seems to be initially arbitrary. However, there is a significant and often overlooked difference between “what can MLP approximate” and “what can MLP efficiently learn with gradient-based algorithms”. Standard MLPs with common activation functions, such as ReLU, tanh, swish, sigmoid, have been recently observed and proven
(Tancik et al., 2020) to suffer from extreme inefficiency on learning high-frequency functions even with increased network complexity. Indeed, the failure of standard MLPs on high frequency datasets is a well-known phenomenon called spectral bias (Rahaman et al., 2019) or F-principle (Xu et al., 2019). Recall that in fig. 1, ShapeNet critically relies on an MLP with input approximating the “shape” of spatial data, which can be problematic in the case of fluid flows with high wavenumber content, e.g., eddies, hairpin vorticies. Note that this issue is also relevant to PINNs (Wang et al., 2021b), which might explain the challenges of using data-free PINNs to solve Naiver-Stokes in fully turbulent regimes.Recent advances in computer graphics propose remedies that uses Fourier features (Tancik et al., 2020) or -scaled sine activation functions^{6}^{6}6Activation function becomes . We use throughout this work. in MLP, called SIREN (Sitzmann et al., 2020), to learn high frequency content. The former Fourier feature approach has been recently introduced in the PINN community by Wang et al. (2021b) with the need to tune the length scale of the Fourier features for each dataset. Here we take the latter approach since we empirically found SIREN is much less sensitive to hyper-parameters of the network compared to the original Fourier feature approach (Tancik et al., 2020). Thus, as shown in fig. 5, we design ShapeNet using SIREN with a ResNet-like structure. However, implementing such an -scaled sine activation function requires a special type of initialization (Sitzmann et al., 2020) of both the MLP parameters and the uniform normalization of the dataset. Implementation details are documented in appendix D. With SIREN for the ShapeNet, we can feed time into ParameterNet to learn a compressed representation from spatial-temporal data in a mesh-agnostic way as illustrated in fig. 6. Furthermore, instead of feeding time into ParameterNet, we can build an encoder just using sensor measurements from only a few locations as input to the ParameterNet. Such applications are demonstrated in sections 3.3 and 3.2.
The proper orthogonal decomposition (POD) (Lumley, 1967) was introduced to the fluid dynamics community by Lumley in 1965 in order to give a mathematical definition of “large eddy” by applying a Karhunen-Loeve expansion (Loeve, 1955) to turbulent velocity fields. Formally, the first POD modes are determined by the first eigenfunctions of an integral operator with the cross-correlation kernel of the turbulent velocity field (George, 1988). In practice, the SVD is often used to obtain a discrete realization of POD modes (Brunton and Kutz, 2019). Such POD modes not only find an -dimensional linear subspace that optimally minimizes the projection error (as shown in eq. 5) but also provides a sequence of ordered basis weighted by their singular values (Djouadi, 2008).
(5) |
subject to the constraints , , for , where the superscript denotes -th component and is the Kronecker delta. POD via SVD relies on a fixed mesh in order to provide a closed-form discrete approximation of the POD modes. If the mesh changes with time and/or parameters, which is the cases for many problems of interest (Teyssier, 2002; Bryan et al., 2014; Vay et al., 2004), then the SVD-based approaches are ill-suited for many downstream tasks such as modal analysis of fluid flows (Taira et al., 2017) and reduced-order modeling (Benner et al., 2015; Loiseau et al., 2021; Noack et al., 2003).
Since multi-scale features often appear in spatio-temporal data, we employ NIF with SIREN in section 2.2 for the applications considered in the rest of this paper. As shown in fig. 7, we first provide a framework based on NIF to directly approximate an optimal -dimensional linear space of classical POD theory (Djouadi, 2008). The key observation is that we can use ParameterNet with input to parameterize only the last layer weights and biases of ShapeNet while the rest of weights and biases of ShapeNet are determined by optimization. As shown in eq. 6, we arrive at an interpretable approximation of the original spatio-temporal field as a sum of products of spatial functions and temporal modes parameterized by MLP,
(6) |
The notations in eq. 6 is slightly different from eq. 5 in order to highlight that NIF only approximates the -dimensional linear subspace rather than obtaining a set of ordered orthonormal spatial functions. Note that one needs to take the cell area into account when implementing eq. 6. To remove the effects of the slightly varying magnitude of the spatial basis learned with a neural network, we consider a normalizing spatial basis, such that where and . Since is the corresponding mesh-agnostic -dimensional linear representation, one can effortlessly apply any existing SVD-based frameworks (e.g., SVD-DMD (Schmid, 2010)) on datasets with arbitrary varying meshes.
The goal of data-driven sparse reconstruction is to use limited sensor measurements to infer the entire high-dimensional system state, given a priori information of the low-dimensional manifold where system evolves. It has been widely applied in projection-based reduced order modeling, especially for large-scale nonlinear systems (also known as “hyper-reduction”) where computing expensive nonlinear terms can be avoided. Currently POD-QDEIM (Drmac and Gugercin, 2016) is one of the most popular methods, which shows improved performance over classical compressive sensing techniques (Manohar et al., 2018). The idea of POD-DEIM (Chaturantabut and Sorensen, 2010)
is to use least-square estimators for the latent representation by only measuring a few locations (e.g., sensors).
Chaturantabut and Sorensen (2010) derived an error upper-bound of DEIM that indicates two contributions: 1) a spectral norm related to sensor selection and 2) projection error of the linear subspace. The idea of POD-QDEIM is to minimize the former contribution of the spectral norm with QR pivoting. Without loss of generality, given a scalar field on mesh points and snapshots, the data matrix is(7) |
The corresponding reduced rank- SVD is . Near-optimal sensor placement can be achieved via QR factorization with column pivoting of the transpose of spatial basis,
(8) |
Note that the top rows of give a sparse measurement matrix, where are the canonical basis vectors with a unit entry at index and zero elsewhere. corresponds to the index of -th best sensor location. One can recover the full field by least-square estimation of the latent representation from at those sensors.
Given the success of POD-QDEIM, we can turn our attention to its error contribution, i.e., the projection error of the linear subspace, by replacing POD with NIF. We first use the optimized sensor location determined by POD-QDEIM. Then, as shown in fig. 8, given sensor measurements as input for ParameterNet, and the ground true field at , we end up with a standard supervised learning problem. Once the model is trained, it can be used to predict the full spatial field at any location for a given measurement from sensors.
The code and data for the following applications is available at https://github.com/pswpswpsw/paper-nif. The Python package for NIF is available at https://github.com/pswpswpsw/nif.
We apply the data-fitting parametric surrogate modeling framework in eq. 4 on the 1D Kuramoto–Sivashinsky equation with periodic boundary condition,
(9) |
with varying parameter as shown in fig. 4. For the K-S equation, we fix the initial condition as and vary from 0.2 to 0.28 which is not chaotic^{7}^{7}7Although most often the K-S equation is simulated on chaotic regime, it is also famous for rich bifurcation phenomenon as its parameters change (Papageorgiou and Smyrlis, 1991).. The training data consists of 20 points in the parameter space (i.e., 20 simulations with distinct ). The testing data consists of 59 simulations with a finer sampling of . As shown in fig. 4, the system response when one varies from 0.2 to 0.28 is relatively smooth without any chaos. This makes it a well-posed problem for regression. The training data is preprocessed with standard normalization. Details are given in section B.1.
For NIF, we take 4 layers with units for ParameterNet as 2-30-30-2-6553 and 5 layers with units 1-56-56-56-1 with ResNet-like skip connection for ShapeNet. We empirically found such ResNet-like skip connections can help accelerate the convergence. Note that 6553 corresponds to the total number of weights and biases in the aforementioned ShapeNet. The swish activation function (Ramachandran et al., 2017) is adopted. As a comparison, we use a standard MLP with 5 layers with units 3-100-100-100-1, which is around the same number of model parameters with as input and output . This can be viewed as a straightforward idea using PINNs (Raissi et al., 2020)
as the regression without minimizing the PDE loss. Note that the same skip connections are employed as well. The model parameters are initialized with a truncated normal with standard deviation of 0.1 for both cases
^{8}^{8}8We empirically find Such “small weights initialization” is an easy way to help the convergence of NIF. Systematic work in (Chang et al., 2019) on initialization of hypernetwork might further improve the result.. For the standard MLP case, two extra cases with tanh and ReLU activations are considered. We implemented both models in Tensorflow
(Abadi et al., 2016). Note that we heavily rely on einsum in implementing NIF. We adopt the Adam optimizer (Kingma and Ba, 2014)with a learning rate of 1e-3, batch size of 1024 and 40000 epochs. To take training randomness into account and remove outliers, we take the average of 4 well converged trials for both network structures. As shown in
table 1, NIF with Swish activations achieved better performance on both training and testing data than three MLP counterparts.Model | Train RMSE | Test RMSE |
NIF (Swish) | 0.64 | |
MLP (Swish) | ||
NIF (tanh) | 0.99 | |
MLP (tanh) | ||
MLP (ReLU) |
For simplicity, we fix the Swish activation function in the following. We first vary the size of the training data and retrain the models to evaluate data efficiency. We change the number of sampling points in parameter space for training data from the previous 20 to 15, 24, 29. As shown in fig. 9, NIF with Swish activation performs consistently better than MLP with Swish activation. To achieve the same level of testing error, NIF requires approximately half of the training data. Finally, we vary the number of model parameters from 7000 to 34000 while fixing the number of points in parameter space to 20. We then retrain the models to evaluate model expressiveness. As displayed in fig. 9, given the same number of parameters, NIF lowers the testing error by half compared to its MLP counterpart. Details of the comparisons are given in section C.1.
To highlight the advantage of NIF on multi-scale problems, here we compare the performance of SVD, CAE and a NIF-based autoencoder on reducing the dimensionality of the density field from the classical 2D Rayleigh-Taylor (R-T) instability. In this R-T problem, the interface between the high density fluid above and the low density fluid below is initially perturbed with a single mode profile of density. The CFD simulation is performed with CASTRO (Almgren et al., 2010), which is a compressible hydrodynamic code with AMR. Since the mesh changes with time, CAE or SVD can only be applied after projecting the data from the adaptive mesh onto a static fine mesh. Such preprocessing can introduce the so-called projection error and can be computationally challenging, especially in 3D. In the following, we take the static fine mesh as 128256 for SVD and CAE. In contrast, NIF directly takes the pointwise raw data on the adaptive mesh for training.
The goal here is to encode the flow state onto an -dimensional latent space, from which one can faithfully reconstruct the flow state. Note that is the dimension of the latent subspace. Since we only take data from a single initial condition and the collection of data ends before the symmetry breaks, the minimal latent space dimension that preserves the information in this case is one. We sample the flowfield uniformly in time and split such single trajectories into 84 training and 28 testing snapshots in a way that the testing snapshots fall in between the training snapshots. Details of data preparation are provided in section B.2.
Note that the structure of NIF in fig. 1 is only for a decoder rather than an encoder. Hence, unlike SVD and CAE, we still need to choose an encoder that feeds information to the network in order to let it discern one snapshot from the other. One can choose either a convolution layer with coarse data on Cartesian meshes just like CAE, or a neural network with
a cluster of point-wise measurements at certain locations. Here we choose the latter: we consider 32 uniformly distributed sensors along the vertical axis in the middle of the flowfield.
For NIF, we take two ResNet blocks with 64 units in fig. 5 followed by a linear bottleneck layer of units as ParameterNet. ShapeNet contains 2 ResNet-like blocks in fig. 5 with 128 units. ParameterNet takes the 32 sensor measurements as input and outputs 66,561 parameters as the weights and biases of the ShapeNet. While ShapeNet takes pointwise coordinates as input and outputs the prediction of density at . The output dimension of ParameterNet is 66,561 as the total number of weights and biases of the ShapeNet. To enforce a smooth latent representation, we use Jacobian and approximated Hessian regularization (Rifai et al., 2011) together with an initialization of small weights for the encoder, which we find empirically to be helpful.
For CAE, We choose a typical deep convolutional architecture used for fluid flows (Wiewel et al., 2019) with detailed setup in section C.2.1. Gradient-based optimization is performed with an Adam optimizer. The learning rate is 2e-5 for NIF and 1e-3 for CAE with a batch size of 3150 for NIF and 4 for CAE^{9}^{9}9Note that pointwise data is fed to NIF while image snapshot data projected on a uniform Cartesian mesh is fed to CAE.. The total learning epoch is 10,000 for CAE and 800 for NIF.
In order to make quantitative comparisons, we project all of the predictions on testing data together with ground true data onto a very fine mesh with resolution of 256512 by nearest-neighbor (for CAE and SVD) or direct sampling (for NIF). As shown in fig. 10 where a rank 1 reduction is performed, the prediction of NIF is better than SVD and CAE with varying static mesh resolution from , and . When increases from to , it is expected that the errors from non-linear models do not change much due to the nature of single realization while that from the linear method decreases. On average, predictions of the CAE with three increasing resolutions lead to 10, 5, 1.7 times more error than that of NIF model. Further error analysis in section C.2.2 shows that the NIF-based autoencoder outperforms SVD mainly due to the lack of nonlinear expressiveness of SVD and it is better than CAE because of the excess projection error from the uniform Cartesian grid.
Next, we apply NIF in section 2.2 to learn a spatially compressed representation of 3D multi-scale spatio-temporal field. Our goal is to find a vector-valued continuously differentiable function which “fits” the original spatial-temporal data. is linearly determined by the time-dependent reduced coordinates . Note that is several order of magnitude smaller than the number of mesh points. If it is achieved, one can efficiently send a snapshot of turbulence data at any time by just transferring a -dimensional vector to the receiver. While the receiver just needs a “skeleton” ShapeNet and a single linear decoder layer (i.e., the last layer of ParameterNet) at local device in order to decode from the sender into a continuous differentiable spatial field. It is important to note that the last layer of ParameterNet is a very wide linear layer, with the width on the order of the total number of weights and biases of ShapeNet.
As an illustrative example, we use a temporally evolving (20 snapshots) spatial () velocity field of homogeneous isotropic turbulence (HIT) from Johns Hopkins University. Details of data preparation are given in section B.4. It should be highlighted that distinct from CAE or SVD, the model complexity of NIF is not directly related to the resolution that the data is stored but rather the intrinsic spatial complexity of the data itself. In this example, as for network architecture, ShapeNet has 4 ResNet-like blocks as hidden layers with width as 200 while ParameterNet has 1 ResNet-like block with width as 50 followed by a linear layer with width and a linear layer that maps the -dimensional vector to all the weights and biases of ShapeNet. The total number of trainable parameters is 1,297,365, which is only inside ParameterNet. For training, we use an Adam optimizer with a learning rate of 1e-5 and batch size of 1600.
First, we test our model by comparing the first component of the ground true velocity field versus the reconstruction from NIF. As displayed in fig. 11, NIF reconstructs the ground true velocity even with small-scale structures very well. Since most modern studies on turbulence are descriptive from a statistical viewpoint, it is important to verify that the PDF is well preserved after compression. As shown in fig. 12, the PDFs of various quantities are approximately well preserved. For a more stringent test, we verify the model performance by visually comparing the iso-contour of Q-criterion and vorticity magnitude. As displayed in fig. 13, most of the high order quantity is well preserved by the model with only small visual difference. Lastly, it is important to highlight that the original dataset require a storage of an array with size while the total number of parameters need to be trained is which is of the former. Further, such compression ratio can be improved with more recent neural network pruning techniques (Han et al., 2015) and more inputs to ParameterNet. We leave this exciting topic for future study.
When we are querying large scientific datasets, typically from PDE solvers, we perform many more queries in space than queries in time or parameter space. For example, spatial statistics in the homogeneous direction are often collected in the analysis of turbulent physics. Visualization of vortex indicators requires intensive spatial query, e.g., spatial differentiation. The primary reason is that the spatial degree of freedom is typically much larger than either the temporal or parametric degrees of freedom.
Since the structure of NIF isolates the spatial complexity independently from any other factors, we can efficiently get the spatial field data without unnecessary repeated computations related to other factors such as time or system parameters. On the contrary, it could be a waste of resources if one uses a single feedforward neural network with multiple SIREN layers (Sitzmann et al., 2020) that takes all information as input with the output still being the field of interests (here we refer it simply as “SIREN”). It is because such a single network will need to mix and learn all of the complexities, e.g., multi-scale, chaos and bifurcations. While the number of spatial query is typically on the order of 10,000 in 2D and 1 million in 3D, one has to repeatedly perform temporal and/or parametric query for different spatial points if they adopt a single network with all information as input for spatial query intensive tasks, e.g., learning representation of video (Sitzmann et al., 2020) or temporally evolving volumetric field (Lu et al., 2021c). Therefore, this can lead to a potentially larger network with longer inference time under the same level of accuracy.
Since once the output of ParameterNet is given, the spatial inference can be performed with only run inference on the ShapeNet. We use the same HIT data with resolution (see section 3.3). Our model setup is the same as before except that the width of the ShapeNet and that of the SIREN change from 36, 52, 75, 105 to 150 and the width of the ParameterNet increases from 3 to 10. As shown in the top left of fig. 14, NIF uses a smaller network for spatial evaluation compared to SIREN counterpart under the same level of reconstruction error. This is because SIREN takes as input so capacity of the network is also spent on learning temporal variation.
To further compare the performance, we measure CPU runtime and memory consumption of the spatial query part in the code with the standard Python package time and Fil-memory-profiler (Turner-Trauring, 2019). We denote the spatial query of three velocity components as a forward computation while is denoted as a backward computation. To make it a fair comparison, for NIF we take the inference time on ParameterNet (although it contributes less than 1% to the total computational time here) into consideration as well. Note that in the top right and bottom left subfigures of fig. 14, there is not much difference in terms of the scaling between NIF and SIREN, which is simply the similarity between the ShapeNet and the SIREN except the first layer of SIREN takes as inputs while that of NIF takes . We also note that the backward computation requires nearly twice the computational cost as that of forward computation. However, with the same accuracy NIF requires a smaller network than SIREN. Hence, given the same reconstruction error, the width required for NIF and SIREN can be determined. The bottom right of fig. 14 indicates that NIF leads to 27% less neural network width, 30% less CPU time in forward and backward passes and 26%/27% less memory consumption in forward/backward computations for the task of querying homogeneous isotropic turbulence data.
Next, we test this NIF-based framework to learn a linear subspace for a classical modal analysis problem in fluid mechanics: vortex shedding behind a cylinder. As shown in fig. 7, the simulation is performed with AMR, which is frequently seen in computational fluid dynamics for efficient simulation and data-storage for a wide range of flows containing moving sharp gradients (e.g., unsteady shocks/flames/vortices). Here we first collect 100 snapshots of a spatial field consisting of two velocity component and , together with , coordinate in each cell and time . Then, we follow eq. 6 to extract a rank-10 linear subspace from the spatio-temporal field. NIF with SIREN is employed in both ParameterNet and ShapeNet. Details of the data preparation and the reconstruction performance are given in section B.3. Given the learned latent 10-dimensional time series as shown in fig. 7, we perform a standard dynamic mode decomposition (DMD) (Schmid, 2022) to extract isolated spatio-temporal modes with distinct frequencies shown. The DMD mode shapes in fig. 7 agree qualitatively well with other existing studies (Pan et al., 2021; Chen et al., 2012). Note that the latent representation contains time and spatial functions contains as input arguments. Thus, at the postprocessing stage, one can easily evaluate these functions above at any time and/or spatial location for any resolution one desires.
As shown in fig. 8, we apply the above NIF-based framework to reconstruct and predict sea surface temperature data (Reynolds et al., 2002) given sparse localized measurements. In order to compare with the state-of-the-art, we use a similar setup from Manohar et al. (2018). We take snapshots from 1990 to 2006 as training data and that of the next 15 years, until 2021, as testing data. As mentioned in the previous subsection, we first use POD-QDEIM on the training data to find the best- sensor locations on the sea. Besides, as shown in the top left of fig. 15, we empirically find POD-QDEIM performs the best at surprisingly low 5 sensors. In order to perform an exhaustive evaluation on NIF-based framework, we vary from 5 to 600. Due to the appearance of multi-scale structure in sea surface temperature, we use NIF with SIREN in section 2.2. For to , we take 2 ResNet-like blocks in ShapeNet with width 60 and 2 blocks in ParameterNet with width 60. For , we take 2 blocks in ShapeNet still with width 60 and 2 blocks in ParameterNet with width . For all cases, we fix in analogous to the equality between rank and number of sensors in POD-QDEIM. We use Adam optimizer for mini-batch training with learning rate as 1e-5 and batch size as 7000 for 150 epochs. Details of data preparation are in section B.5.
To further evaluate our framework, we make a side-by-side comparison with the state-of-the-art POD-QDIEM in both reconstruction and prediction of sea surface temperature using the same information from the best- sparse sensor locations from POD-QDEIM. As displayed in the top right of fig. 15, the space-time mean-squared error on training data of both NIF-based framework and POD-QDEIM decrease as the number of sensors increase while that of our framework decays much more quickly than that of POD-QDEIM. The approximated decay rate is shown in the bottom left of fig. 15. We find that our NIF-based framework shows a consistent decay rate as the number of sensors increases. On the other hand, POD-QDEIM struggle to decrease training error only after with a similar decay rate of with a faster rate of after . Also, it is interesting to note that as more and more sensors are used, the POD-QDEIM generalization is worse and worse while our framework in general performs better and better. As shown in the bottom right of fig. 15, given the number of sensors considered in this work, our NIF-based model surpass the best POD-QDEIM model after using 200 sensors, which corresponds to using more than of all possible sensor locations on the sea. Finally, as mentioned before, the most generalizable model of POD-QDEIM is using 5 sensors which results in a space-time M.S.E as 0.71 (additional parameter sweeps are shown in the top left of fig. 15). While the model of our NIF-based framework with smallest testing error takes 600 sensors and results in a space-time M.S.E as 0.46, which is 34% smaller than the best model of POD-QDEIM.
Apart from comparing space-time mean squared error, we also compute a standard deviation of spatially mean squared error along time axis as an indicator for robustness of model performance (see error bars in fig. 15). Note that the axis of the top right figure in fig. 15 is in log scale. Therefore, for the range of number of sensors considered, training error bar of our framework is significantly smaller than that of POD-QDEIM, which indicates our framework can faithfully reconstruct the training data with higher confidence. This is particularly important for hyper-reduction in projection-based ROMs (Carlberg et al., 2011). The testing error bar of our framework is also smaller than that of POD-QDEIM, which means that our framework has higher robustness in predicting unseen data as well. Additional discussions are in section C.3.
High-dimensional spatial complexity is a major bottleneck of computational and storage costs in many physical science and engineering fields where the physics relies on a set of complex partial differential equations (PDEs). Existing frameworks, such as SVD and CAE, suffer from various challenges arising from complex data structures in real world scientific computing. In this work, a mesh-agnostic representation for parametric spatial-temporal datasets, called neural implicit flow (NIF), is proposed. The key feature of NIF is its ability to separate spatial complexity from other factors such as temporal and parametric complexity, which naturally follows the philosophy of manifold-based learning of PDEs. Compared to existing SVD and CAE frameworks, of which the performance is either restricted by unnecessary projection, linearity, intractable memory scaling for large-scale 3D dataset, or inefficiency for adaptive meshes, NIF enables scalable mesh-agnostic nonlinear dimensionality reduction with improved performance. As a comparison, table 2 shows a summary of the capabilities and challenges of SVD, CAE, and NIF. Specifically, we demonstrate the advantages of NIF in terms of four derived mesh-agnostic frameworks: data-fit surrogate modeling for parametric PDEs, compressed representation of multi-scale spatial-temporal data, learning linear representations, and nonlinear sparse sensing. To the best of our knowledge, nonlinear dimensionality reduction of 3D turbulence with over 2 million cells and complex multi-scale dynamics with an AMR solver is achieved for the first time.
For spatially complex data, one requires a correspondingly large ShapeNet to accommodate the spatial structures. Larger networks require longer time to converge. As a result, one has to manually decide the size of ShapeNet and ParameterNet for specific problems, whereas SVD and CAE are much easier to configure.
Unlike SVD and CAE where best practices are well established (Maulik and Mengaldo, 2021; He et al., 2019), training a hypernetwork of a deep neural network still requires some trial and error. A typical SVD runtime takes a few seconds to minutes. Training CAE usually takes 0.5hrs to arrive at a decent accuracy. While NIF usually takes above an hour to days depending on the data complexity, which in turns affects the size of model. First, because the input of NIF is pointwise, the total number of training data can be huge for 3D datasets. Second, in order to connect the two networks, there are expensive tensor operations between 1D and a 2D tensor (output reshaped from ParameterNet). Note that both of the tensors are batch dependent in NIF, which can decrease cache performance. For example in the 3D turbulence case, it takes up to 1.3 hours to process one single epoch and almost 5 days on a 16GB GPU to obtain a good accuracy.
The use of hypernetwork leads to more complex network topology than the standard feedforward neural net with a comparable model parameter size. As a result, it creates larger memory footprints, which unfortunately limits the maximal batch size and leads to longer training time.
Despite of some promising results (Wang et al., 2022), it is generally more challenging to embed invariance into coordinate-input neural networks than graph-based approaches (Li et al., 2020a). Such lack of invariance may worsen the generalization of data-fit regression especially when the amount of training data is small.
On the flip side, model complexity does not scale with the “superficial” complexity of the data, e.g., the number of mesh points. Finer mesh points only lead to more training data while model complexity can keep the same. Meanwhile, mesh-based models (e.g., CAE) still suffer from growing model complexity as the number of mesh points increases.
Since it is mesh-agnostic in nature, it becomes very convenient to fuse information from different data sources. For example, PIV data are mostly on a perfect uniform Cartesian mesh, whereas CFD data are commonly on a well designed body-fitted mesh. Sometimes different types of CFD solver would use different mesh, e.g., multi-level Cartesian mesh for LBM/IBM (Taira and Colonius, 2007).
Thanks to the decoupling of spatial complexity, we have a direct analogy of SVD but with a mesh-agnositic, nonlinear and scalable version for 3D datasets. It is straightforward to extend the established ROM frameworks (Carlberg et al., 2011; Bruna et al., 2022), modal analysis (Schmid, 2022; Taira et al., 2020; Rowley et al., 2009) to more realistic and mesh-complex datasets.
Postprocessing of PDE data, e.g., turbulence (Li et al., 2008), often involves intensive spatial query than temporal or other parametric query. Our design of NIF leads to a compact network for spatial complexity, which improves the efficiency for massive spatial query on either the point value or the spatial derivative.
NIF has the potential to extend existing data-driven modeling paradigms based on SVD and CAE to the real world of scientific computing, where raw spatial-temporal data can be three dimensional, large-scale, and stored on arbitrary adaptive meshes. High-fidelity large-scale data from modern scientific computing codes (Almgren et al., 2010; Nonaka et al., 2012) can be reduced to effective low dimensional spaces, where existing modal analysis (Taira et al., 2020; Towne et al., 2018; McKeon and Sharma, 2010) data-driven flow control (Duriez et al., 2017) and surrogate-based modeling/optimization (Koziel and Leifsson, 2013) can be performed seamlessly on the arbitrary adaptive mesh. However, as is typical for nonlinear dimensionality reduction methods (e.g., CAE), the latent representation is more difficult to interpret than its linear counter part (e.g., SVD), and each time the latent representation can be different. This raises new questions on how to guide the latent representation to be interpretable for experts. Another exciting direction is to construct projection-based ROMs in such a latent space by minimizing the residual of known governing equations. This has been demonstrated very recently on the material point method (Chen et al., 2021)
and for high-dimensional PDE with active learning
(Bruna et al., 2022). Besides dimensionality reduction, NIF can be used to design efficient “decoders” for transferring large spatial scientific data. It essentially trades reduction in storage/transfer and ease in data fusion of heterogeneous meshes with off-line training a NIF model and additional function calls on the ShapeNet.Property/Model | SVD | CAE | Neural Implicit Flow |
Resolution | strong Convergence to discrete data | weak Requires uniform mesh | strong Continuous field |
Variable geometry | strong | weak | strong |
Scalablity | strong Efficient randomized SVD available | fair/weak Affordable in 2D Resolution restricted in 3D | strong Point-wise mini-batch training; Number of parameters required only scale with intrinsic complexity |
Parametric/temporal variation of domain discretization | weak Requires the same mesh throughout | weak Requires the same uniform mesh throughout | strong arbitrary mesh |
Training easiness | strong | fair/strong | fair |
Interpretability of representation | strong | weak | fair last-layer parameterization learns linear subspace |
Expressiveness | weak Linear, not ideal for advection dominated flows | fair Nonlinear but cannot capture multi-scale efficiently | strong Nonlinear with multi-scale capability |
The authors thank Kevin Carlberg, Karthik Duraisamy, Zongyi Li, and Lu Lu for valuable discussions. The authors acknowledge funding support from the Air Force Office of Scientific Research (AFOSR FA9550-19-1-0386). The authors acknowledge funding from the National Science Foundation AI Institute in Dynamic Systems grant number 2112085. We also appreciate the generous support from Prof. Karthik Duraisamy on using the following computing resources, which were provided by the NSF via the grant “MRI: Acquisition of ConFlux, A Novel Platform for Data-Driven Computational Physics”. This work also used the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC).
As shown in fig. 16, state-of-the-art methods for dimensionality reduction of parametric spatial temporal data rely on SVD and CAE. Dimensionality via SVD requires the data on a single fixed mesh. First, it flattens the spatial data into a long column vector. Second, these column vectors are stacked over time. Finally, SVD is performed on such stacked matrix and the right singular vectors are the latent representation. CAE treated the flowfield as image. First, one performs a uniform pixelation. Second, one feed the processed images into convolutional autoencoders. The corresponding latent representation is formed by the bottleneck layer in fig. 16.
Given , , we use ETD-RK4 (Kassam and Trefethen, 2005) method to solve 1D Kuramoto-Sivashinsky equation in eq. 9, with periodic boundary condition and a fixed initial condition . Spatial discretization is performed with 1024 points. Time integration . The raw dataset has space-time resolution as . We subsample 4 times in space and 1000 times in time, i.e., . Note that K-S has a rich dynamics when parameter changes (Papageorgiou and Smyrlis, 1991). In order to make the problem well-posed, we choose a regime where the system is not chaotic. Parametric variation of the solution in is displayed in fig. 17. Training data is prepared with 20 uniform sampling with , ending up with data points. Testing data is prepared with uniform samples within , which leads to data points. Still, each data point has 4 components . Each component is normalized with zero mean and unit standard deviation.
Rayleigh-Taylor instability happens when a heavy fluid floats above a light fluid in a gravitational field. The interface between heavy and light fluid is unstable and “mushroom”-like multi-scale structure will grow as small-scale structures grows much faster than large-scale structures. In order to efficiently simulate the system, adaptive mesh refinement is necessary. The goal of dimensionality reduction of parametric fluid system is to reduce the spatial complexity so that one may perform modeling, design or control in the reduced coordinate efficiently. However, without given the range of system parameters in the data, it is often difficult to make statement on the true dimensionality of a flow. In order to simplify our discussion, we choose only to simulate a single realization of R-T with fixed system parameter where the dimensionality can be reduced to one.
Using CASTRO, we consider one level of AMR with an initial Cartesian grid of 256 points along vertical axis and 128 points along horizontal axis during the simulation, which efficiently captures the evolution of vortex generation and advection of density fronts. AMR refinement ratio is 2 in each direction. We regrid at every 2 steps. The simulation domain is a rectangular with 0.5 long in -axis and 1.0 in -axis. Top and bottom boundary condition is slip wall while left and right are periodic. Time span of the simulation is from 0 to 3.5. The density for heavy fluid is 2 while that for the light fluid is 1. They are separated by a horizontally trigonometric and vertically tanh interface, which is a perturbation to the unstable system. The exact expression follows the equation (46-47) in the original CASTRO paper (Almgren et al., 2010).
We save the simulation data at every 10 steps. Training data starts from 83rd to 249th snapshot (corresponding to time from 1.004 to 2.879) with skipping on the even snapshot: 83, 85, …, 247, 249. Testing data starts from 86th to 248th snapshot with an incremental of six: 86, 92, …, 242, 248. Training and testing data are not overlapping but they have no distribution shift. In total, training data has 84 snapshots and testing data as 28 snapshot. The number of adaptive mesh points ranges from 40,976 to 72,753 across time. For POD and CAE, original data on adaptive mesh is projected onto three different Cartesian grids: , , using the filter ResampleToImage in ParaView (Ahrens et al., 2005). Finally, original data is also projected onto a very fine Cartesian mesh as “ground truth” in order to make quantitative comparison for predictions from different models.
As for NIF, firstly, note that we use a feedforward neural network encoder with sparse sensor as input. As shown in the left of fig. 18, we collect measurements from 32 equally spaced sensor along the vertical axis in the middle of the flowfield. At test time, input data that feed to the encoder is displayed right of fig. 18, where we can see a clear density front moving towards the bottom. Secondly, since NIF takes pointwise data, each data point is a 35-dimensional row vector. The first 32 are sensor values at while the rest three are and density at that location . Because we need to go through all grid points in the adaptive mesh at time , there will be as many data points with repeated 32 columns as the total number of points in the adaptive mesh at time . In fact, we end up with 4,190,928 rows with 35 columns for the training data. Thirdly, since we are using NIF with SIREN in our ShapeNet, we normalize between -1 and 1 while the rest 33 columns are normalized with zero mean and unit standard deviation.
We generated the data using PeleLM (Nonaka et al., 2012) to solve a incompressible N-S. The cylinder geometry is represented using embedding boundary method. The setup is given in our the Github repo. We set the cylinder centered at the original. The computational domain is (streamwise) direction from -0.02 to 0.1 while direction from -0.04 to 0.04. The boundary condition is set as inflow: Dirichlet, outflow Neumann. Side: periodic. We set m/s, viscosity Pas, kg/. The radius of cylinder m. Thus, Reynolds number is . The simulation is performed from s to s with sampling s. The flow is initialized with uniform streamwise flow superimposed with a side velocity in order to break the symmetry so that the flow can quickly fall on to the limit cycle. To remove the initial transient effect, we collect the last 100 snapshots. The AMR is based on vorticity where we enable a finer mesh with half of the grid size if the magnitude of local vorticity monitored exceed 3000. Overall, we collect data (we can easily collect AMR data pointwise using ParaView) after the system falls on to the limit cycle and sampled the last 100 snapshots. Moreover, in order to remove the effect from the exit boundary and only focus on the region of interests, we only sample cell-centered data within the domain of interests with m and m as shown in fig. 19. Since we are using NIF with SIREN in appendix D, we normalize the into uniform distribution between -1 and 1. For cell area , we scale it with a factor of since we are using single precision in Tensorflow. For output velocity , we first normalize them into zero mean. Next we normalize
into unit variance and normalize
with the same factor. Finally, we arrange the collected pointwise data into a big matrix, with 966,514 rows and 6 columns,where wide tilde means normalized variable.
We use the forced isotropic turbulence dataset from JHU Turbulence dataset (Li et al., 2008) with Taylor-scale Reynolds number around 433. The simulation is computed by solving 3D incompressible Navier-Stokes equation with pseudo-spectral method. To sustain the turbulence, energy is injected into the large scales. After the system reaches the equilibrium state, the original dataset consists snapshots collected at every 0.02 nondimensionalized time. Here we collect -th snapshot, in total 20 snapshots. Then we slice a block with resolution from the original HIT with the 20 snapshots, which is . Since we are using NIF with SIREN in appendix D, we normalize the range of so that the input signal is uniformly distributed between -1 and 1. For target velocity , we simply normalize them into zero mean and unit variance.
We obtain the weekly averaged sea surface temperature data since 1990 to present from NOAA website ^{10}^{10}10https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc. At the time when we worked on this problem, there is in total 1642 snapshots (1990-2021) with each snapshot corresponds to a weekly averaged temperature field. Each temperature snapshot contains 180 uniformly spaced latitude and 360 uniformly spaced longitude coordinates. It should be noted coordinates correspond to land should be masked ^{11}^{11}11https://downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/lsmask.nc. Thus, each masked snapshot contains 44219 points. We use temperature field of the first 832 weeks (16 years) as training data with a total number of 36,790,208 data points. Each data point is a -dimensional row vector, where is the number of sensors and 3 corresponds to . Locations of sensors are obtained via POD-QDEIM or equivalently data-driven sensor placement in (Manohar et al., 2018). Temperature field in the rest of time are testing data. Still, spatial coordinate and is normalized between -1 and 1. Target temperature is normalized into zero mean and unit variance.
Recall that the advantages of NIF have been demonstrated in section 3.1. In order to further compare the model performance, we further compute the RMSE for each parameter for four configurations: (1) NIF-Swish, (2) MLP-Swish, (3) NIF-tanh, (4) MLP-tanh. As displayed in fig. 20, NIF with Swish generalizes better than other configurations especially when . As seen in section B.1, such parameter regime corresponds to the traveling waves solutions with a transition time increasing with . Meanwhile, NIF-tanh generalizes slightly better tanh MLP-tanh but similar with MLP-Swish. Hence, we only compare NIF-Swish with MLP-Swish in the rest of the paper.
Here we consider further changing the size of training data to verify if NIF-Swish still performs better than MLP-Swish. As shown in fig. 21, we prepare another three set of datasets with 15, 24, and 29 samples of parameters, which correspond to simulations of K-S system at those parameters.Figure 21 shows the same phenomenon as before that NIF (Swish) generalizes better than MLP (Swish) when .
Model | ShapeNet/Vanilla MLP | ParameterNet | Number of training parameters |
NIF (Swish)-1 | 1-30-30-30-1 | 2-30-30-2-1951 | 6,935 |
MLP (Swish)-1 | 3-58-58-58-1 | 7,135 | |
NIF (Swish)-2 | 1-38-38-38-1 | 2-29-29-2-3079 | 10,254 |
MLP (Swish)-2 | 3-70-70-70-1 | 10,291 | |
NIF (Swish)-3 | 1-56-56-56-1 | 2-30-30-2-6553 | 20,741 |
MLP (Swish)-3 | 3-100-100-100-1 | 20,701 | |
NIF (Swish)-4 | 1-60-60-60-1 | 2-47-47-2-7501 | 24,996 |
MLP (Swish)-4 | 3-110-110-110-1 | 24,971 | |
NIF (Swish)-5 | 1-70-70-70-1 | 2-60-60-2-10151 | 34,415 |
MLP (Swish)-5 | 3-130-130-130-1 | 34,711 |
Finally, we consider verifying the above phenomenon by changing the number of trainable parameters away from the baseline in section 3.1 as described in table 3. Again, we still train for 4 converged runs and compute the average error. Still, Figure 22 shows the same phenomenon that explains the difference between NIF (Swish) and MLP (Swish).
The encoder first uses three consecutive stride-2 convolutional layer followed by batch normalization and Swish activations with the number of output channel as 16, 32, 64. Output of the above is followed by flatten layer and a linear dense layer with
output units. Then such units are feed to the decoder, which first contains a linear dense layer followed by a reshape layer and two consecutive dilated transposed convolutional layer followed by batch normalization and Swish activation with the number of output channel as 32, 16. Finally, the last layer is a dilated transpose convolutional layer that maps to the same size as input snapshot. For the best results, we do not use any pooling layers throughout.Layer (type) | Output Shape | Parameter Count |
Convolution 2D | (None, 128, 64, 16) | 160 |
Batch Normalization | (None, 128, 64, 16) | 64 |
Swish Activation | (None, 128, 64, 16) | 0 |
Convolution 2D | (None, 64, 32, 32) | 4640 |
Batch Normalization | (None, 64, 32, 32) | 128 |
Swish Activation | (None, 64, 32, 32) | 0 |
Convolution 2D | (None, 32, 16, 64) | 18496 |
Batch Normalization | (None, 32, 16, 64) | 256 |
Swish Activation | (None, 128, 64, 16) | 0 |
Flatten | (None, 32768) | 0 |
Dense | (None, 8) | 262152 |
Dense | (None, 32768) | 294912 |
Reshape | (None, 32, 16, 64) | 0 |
ConvolutionTranspose 2D | (None, 64, 32, 32) | 18464 |
Batch Normalization | (None, 64, 32, 32) | 128 |
Swish Activation | (None, 64, 32, 32) | 0 |
ConvolutionTranspose 2D | (None, 128, 64, 16) | 4624 |
Batch Normalization | (None, 128, 64, 16) | 64 |
Swish Activation | (None, 128, 64, 16) | 0 |
ConvolutionTranspose 2D | (None, 256, 128, 1) | 145 |
As mentioned before, we can quantitatively compare the performance of three methods of dimensionality reduction by projecting those outputs on a very fine mesh with 256512 resolution. Recall that the shape of POD output is 128256, and the shape of CAE output varyes from 3264, 64128, 128256. Hence, we use nearest-neighbor to obtain the projection onto 256512 using the transform.resize function in scikit-image package (Van der Walt et al., 2014). While for NIF, we simply perform spatial query at those coordinates of cell nodes, which are , , for , . This can be generated in Python with numpy.linspace(2.5e-7, 0.5, 256, endpoint=True) and numpy.linspace(5e-7, 1, 512, endpoint=True).
Unless highlighted, the following data fields are projected onto 256512 mesh and become 2D arrays varying with time . Value of the field at each coordinate and time can be indexed by .
: ground true data from adaptive mesh,
: ground true data sampled on the Cartesian mesh,
: ground true data sampled on the Cartesian mesh,
: ground true data sampled on the Cartesian mesh,
: output prediction from POD on the Cartesian mesh,
: output prediction from CAE on the Cartesian mesh,
: output prediction from CAE on the Cartesian mesh,
: output prediction from CAE on the Cartesian mesh,
as the output prediction from NIF evaluated on the 256512 mesh.
Without loss of generality, let’s take the CAE with training data on the mesh for example. Notice that
(10) |
Hence, we define the following three error metrics at each time :
fitting error: spatially averaged mean square of fitting difference,
(11) |
projection error: spatially averaged square of project difference,
(12) |
total error: spatially averaged square of total difference
(13) |
We can define the same metrics for other models. Evolution of the above three error metrics for all the models on training and testing time steps are displayed in the first row of fig. 23. Fitting error (red) contributes the most in POD while projection error is more than two orders of magnitude smaller. This implies the lack of expressiveness in POD leads to its inferior performance. In contrast, projection error (green) contributes most in CAE while fitting error remains unchanged with varying resolution of projection from 32 to 128256. This indicates that CAE is mainly restricted by the projection error introduced during preprocessing in this example. Meanwhile, NIF learns the latent representation directly from the raw data on adaptive mesh. Therefore, fitting error is the same as total error. Moreover, we observe that projection error grows near-exponentially in time. This is because of the nature of R-T instability that energy from large scale structures transfer to small-scale as time goes. Such small-scale vortex generates even further smaller vortex. Eventually, the Cartesian grid becomes unable to resolve the flow, which ends up with a significant amount of projection error. As shown in the rest of fig. 23, such phenomenon persists even changing rank . The fact that the dimensionality of this R-T data is one is consistent with fig. 23 from which only POD methods improves when is increased.
Comparison on temporal variation of spatially mean-squared error between our framework and POD-QDEIM (Manohar et al., 2018) with different number of sensors is shown in fig. 24. We can visually confirm the error variation is much higher in POD-QDEIM compared to our models. This in turn means the model prediction from POD-QDEIM should be accompanied with larger uncertainties. As one increases the number of sensors, training error of both models decay drastically. Meanwhile, testing error of POD-QDEIM increases and that of NIF-based framework visually stays around the same level.
Comparison on contours of sea surface temperature among ground true, our framework and POD-QDEIM for the very first week of 1990 (in training dataset) is displayed in fig. 25. We can roughly see the inferior performance of POD-QDEIM comes from at least two sources:
POD-QDEIM overshoots for the average temperature in the middle of the Pacific Ocean.
POD-QDEIM misses small scales structures while NIF-based framework captures them well.
As the number of sensors increases, both models performs better and better on training data.
In order to further analyze the model generalization performance on sea surface temperature data, we compute the temporal average of squared error on unseen testing data (2006 to 2021) and plot its corresponding spatial distribution with varying number of sensors in fig. 26. Evidently, except at 5 sensors where NIF-SS shows a similar testing error compared to POD-QDEIM, NIF-SS generalizes much better than POD-QDEIM regardless of the number of sensors . As increases, the error distribution of both two frameworks tends to contain more small-scale patches.
An inspection on error distribution from NIF-based framework shows interesting correlations between regions that are difficult to predict and ocean circulation patterns. Let’s focus on the NIF-SS with error magnified 5 times (third column) in fig. 26. For example, when , we see the regions that are most difficult to predict by NIF-SS happen mostly at the location of Gulf stream, North Pacific gyre and Norwegian current.
We adopt SIREN (Sitzmann et al., 2020), which is standard MLP with -scaled sine activations, as our ShapeNet. In this work we take and find it to be sufficient for all cases. The initialization of SIREN is given in section D.1. Based on SIREN, we also borrow the ResNet-like structure (Lu et al., 2021c) to further improve training performance, which is presented in section D.2. Section D.3 describes how we connect ParameterNet with ShapeNet. Finally, practical advice on training NIF with SIREN is given in section D.4.
First of all, one should normalize each component of input for SIREN as approximately. denotes uniform distribution on interval , . For example, in eq. 14, we normalize one coordinate component as
(14) |
For the output , we choose the standard normalization (zero mean and unit variance).
Second, SIREN requires a special initialization of weights and biases to achieve superior performance. Without loss of generality, consider a standard MLP equipped with -scaled as activation functions and units structure as ----. / denotes the dimension of input/output layer. denotes the dimension of hidden layer. Next, we initialize weights and biases of input layer component-wise in eq. 15 as
(15) |
where subscript denotes the -th row -th column component. We initialize all other layer weights and biases following eq. 16,
(16) |
Note that -scaled sine activation is defined as .
The final step is to connect the output of ParameterNet to ShapeNet. Note that now only ParameterNet contains undetermined parameters while the parameters in ShapeNet are subject to the output of ParameterNet. Therefore, we only need to carefully design the initialization of last layer weights and biases of ParameterNet in order to be consistent with the requirement in section D.1. Recall that the network structure of NIF is a special case of hypernetworks of MLP that focuses on decoupling spatial complexity away from other factors. We take the initialization scheme of hypernetworks in SIREN (Sitzmann et al., 2020). The idea is to simply multiply the last layer initial weights by a small factor, e.g., , while keeping the last layer initial biases to match the distribution in section D.1.
We empirically found successful and stable training NIF with SIREN using Adam optimizer requires 1) small learning rate typically around to , 2) large batch size, which often takes to be filling up the whole GPU memory. When debugging, it is recommended to monitor training error at every 5 epoch and plot the prediction at the same time. If one found training error, e.g., mean-square error, stuck at relatively high in the first 40 epoch and the prediction is completely useless, then one should further decrease the learning rate or increasing the batch size. We found NIF with SIREN also tend to fit large scale structure first then small scales.
However, it is not always possible to increase the batch size given limited GPU memory especially in the fully 3D case. We found that if we fully parameterize the weights and biases of ShapeNet (as oppose to only parameterizing the last layer in section 2.3), the memory footprints of NIF becomes relatively large due to the tensor multiplication with einsum in implementation. Therefore, the maximal number of epochs affordable can be limited, which can lead to too long training time or unstable training if even one set learning rate as low as in the worst case.
If one doesn’t have access to good GPU resources, a more economic remedy on a single GPU is to use small-batch training, which is well-known to have better generalization compared to large-batch training (Masters and Luschi, 2018). However, here we are only looking for stable training of NIF with SIREN that uses small batch sizes to fit in a single GPU. We empirically found L4 optimizer (Rolinek and Martius, 2018) can have stable training performance in the small batch regime compared to Adam for the same batch size. Empirically, we found L4 optimizer can reduce minimal batch size required for stable training by at least 10 times compared to Adam optimizer, while only has a slight increase in training cost. Thus, one can increase the capacity of NIF by 10 times without sacrificing much in the overall training time and performance. Note that such capacity can be crucial in training large-scale 3D fully turbulence dataset with limited GPU resources.
For all the results shown in this paper, we have used Nvidia Tesla P100 (16 GB), Nvidia GeForce RTX 2080 GPU (12 GB), and Nvidia A6000 GPU (48 GB). If available, the simplest remedy is to use data-parallelism with multiple GPUs. We have implemented data-parallel capability for NIF in our Github repo (https://github.com/pswpswpsw/nif) on multiple GPUs and it scales well. We leave the complete study on distributed learning with NIF for future work.
Paraview: an end-user tool for large data visualization
. The visualization handbook 717 (8). Cited by: §B.2.Prediction of aerodynamic flow fields using convolutional neural networks
. Computational Mechanics 64 (2), pp. 525–545. Cited by: §1, §2.1.Nonlinear model reduction via discrete empirical interpolation
. SIAM Journal on Scientific Computing 32 (5), pp. 2737–2764. Cited by: §2.4.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 558–567. Cited by: §4.1.On the eigenvector bias of fourier feature networks: from regression to solving multi-scale pdes with physics-informed neural networks
. Computer Methods in Applied Mechanics and Engineering 384, pp. 113938. Cited by: §2.2, §2.2.