1 Introduction
Predictive mathematical models, derived from first principles and symmetry
arguments and validated in experiments, are of key importance for the scientific
understanding of natural phenomena. While this approach has been particularly successful in
describing spatiotemporal dynamical systems in physics and engineering,
it has not seen the same degree of success in other scientific fields, such as neuroscience,
biology, finance, and ecology. This is because the underlying first principles in these areas remain
largely elusive. Nevertheless, modeling in those areas has seen increasing use and relevance to help
formulate simplified mathematical equations where sufficient observational data are available for
validation Mogilner et al. (2006); Sbalzarini (2013); Tomlin and Axelrod (2007); Duffy (2013); Adomian (1995). In biology, modern highthroughput technologies have enabled collection of
largescale data sets, ranging from genomics, proteomics, and metabolomics data,
to microscopy images and videos of cells and tissues. These data sets
are routinely used to infer parameters in hypothesized models, or to perform model selection
among a finite number of alternative hypotheses Donoho (2017); Barnes et al. (2011); Asmus et al. (2017).
The amount and quality of biological data, as well as the performance of computing hardware and
computational methods have now reached a level that
promises direct inference of mathematical models of biological processes from the available
experimental data. Such datadriven approaches seem particularly valuable in cell and developmental
biology, where first principles are hard to come by, but largescale imaging data are available,
along with an accepted consensus of which phenomena a model could possibly entail.
In such scenarios, datadriven modeling approaches have the potential to uncover the unknown first principles underlying the observed biological dynamics.
Biological dynamics can be formalized at different scales, from discrete molecular processes to the
continuum mechanics of tissues. Here, we consider the macroscopic, continuum scale where spatiotemporal
dynamics are modeled by Partial Differential Equations (PDEs) over coarsegrained state variables
Kitano (2002); Tomlin and Axelrod (2007).
PDE models have been used to successfully address a range of biological problems
from embryo patterning Gregor et al. (2005) to modeling geneexpression
networks Chen et al. (1999); Ay and Arnosti (2011) to predictive models of cell
and tissue mechanics during growth and development Prost et al. (2015).
They have shown their potential to recapitulate experimental observables in cases
where the underlying physical phenomena are known or have been postulated
Münster et al. (2019). In many biological systems, however, the governing
PDEs are not (yet) known, which limits progress in discovering the underlying
physical principles. Thus it is desirable to verify existing models or even discover new models by extracting governing laws directly from measured spatiotemporal data.
For given observable spatiotemporal dynamics, with no governing PDE known, several proposals have been put forward to learn mathematically and physically interpretable PDE models. The earliest work in this direction Voss et al. (1998) frames the problem of “PDE learning” as a multivariate nonlinear regression problem where each component in the design matrix consists of space and time differential operators and loworder nonlinearities computed directly from data. Then, the alternating conditional expectation (ACE) algorithm Breiman and Friedman (1985)
is used to compute both optimal elementwise nonlinear transformations of each component and their associated coefficients. In
Bär et al. (1999), the problem is formulated as a linear regression problem with a fixed predefined set of space and time differential operators and polynomial transformations that are computed directly from data. Then, backward elimination is used to identify a compact set of PDE components by minimizing the least square error of the full model and then removing terms that reduce the fit the least. In the statistical literature
Xun et al. (2013); Raissi et al. (2017), the PDE learning problem is formulated as a Bayesian nonparametric estimation problem where the observed dynamics are learned via nonparametric approximation, and a PDE representation serves as a prior distribution to compute the posterior estimates of the PDE coefficients. Recent influential work revived the idea of jointly learning the structure
and the coefficients of PDE models from data in discrete space and time using sparse regression Brunton et al. (2016); Rudy et al. (2017); Schaeffer (2017). Approaches such as SINDy Brunton et al. (2016) and PDEFIND Rudy et al. (2017) compute a large preassembled dictionary of possible PDE terms from data and identify the most promising components via penalized linear regression formulations. For instance, PDEFIND is able to learn different types of PDEs from simulated spatiotemporal data, including Burgers, KuramatoSivishinksy, reactiondiffusion, and NavierStokes equations. PDEFIND’s performance was evaluated on noisefree simulated data as well as data with up to 1% additive noise and showed a critical dependency on the proper setting of the tuning parameters which are typically unknown in practice. Recent approaches attempt to alleviate this dependence by using Bayesian sparse regression schemes for model uncertainty quantification Zhang and Lin (2018) or information criteria for tuning parameter selection Mangan et al. (2019). There is also a growing body of work that considers deep neural networks for PDE learning
Long et al. (2017); Raissi and Karniadakis (2018); Raissi et al. (2019). For instance, a deep feed forward network formulation Long et al. (2017), PDENET, directly learns a computable discretized form of the underlying governing PDEs for forecasting Long et al. (2017, 2018). PDENET exploits the connection between differential operators and ordersofsum rules of convolution filters Dong et al. (2017) in order to constrain network layers to learn valid discretized differential operators. The forecasting capability of this approach was numerically demonstrated for predefined linear differential operator templates. A compact and interpretable symbolic identification of the PDE structure is, however, not available with this approach.Here, we ask the question whether and how it is possible to extend the class of sparse regression inference methods to work on real, limited, and noisy experimental data. As a first step, we present a statistical learning framework, PDESTRIDE (STabilitybased Robust IDEntification of PDEs), to robustly infer PDE models from noisy spatiotemporal data without requiring manual tuning of learning parameters, such as regularization constants. PDESTRIDE is based on the statistical principle of stability selection Meinshausen and Bühlmann (2010); Shah and Samworth (2013), which provides an interpretable criterion for any component’s inclusion in the learned PDE in a datadriven manner. Stability selection can be combined with any sparsitypromoting regression method, including LASSO Tibshirani (1996); Meinshausen and Bühlmann (2010), iterative hard thresholding (IHT) Blumensath and Davies (2008), Hard Thresholding Pursuit (HTP) Foucart (2011)
, or Sequential Thresholding Ridge Regression (STRidge)
Rudy et al. (2017). PDESTRIDE therefore provides a dropin solution to render existing inference tools more robust, while reducing the need for parameter tuning. In the benchmarks presented herein, the combination of stability selection with debiased iterative hard thresholding (IHTd) empirically showed the best performance and highest consistency w.r.t. perturbations of the dictionary matrix and sampling of the data.This paper is organized as follows: Section 2 provides the mathematical formulation of the sparse regression problem and discusses how the design matrix is assembled. We also review the concepts of regularization paths and stability selection and discuss how they are combined in the proposed method. The numerical results in Section 3
highlight the performance and robustness of the PDESTRIDE for recovering different PDEs from noisecorrupted simulated data. We also perform an achievability analysis of PDESTRIDE+IHTd inference scheme for consistency and convergence of recovery probability with increasing sample size. Section
4 demonstrates that the robustness of the proposed method is sufficient for realworld applications. We consider learning a PDE model from noisy biological microscopy images of membrane protein dynamics in a C. elegans zygote. Section 5 provides a summary of our results and highlights future challenges for datadriven PDE learning.2 Problem Formulation and Optimization
We outline the problem formulation underlying the datadriven PDE inference considered here. We review important sparse regression techniques and introduce the concept of stability selection used in PDESTRIDE.
2.1 Problem formulation for PDE learning
We propose a framework for stable estimation of the structure and parameters of the governing equations of continuous dynamical systems from discrete spatiotemporal measurements or observations. Specifically, we consider PDEs for the multidimensional state variable of the form shown in Eq. (2.1.1), composed of polynomial nonlinearities (e.g., ), spatial derivatives (e.g., ), and the parametric dependence modeled through .
(2.1.1) 
Here, is the function map that models the spatiotemporal nonlinear dynamics of the system. For simplicity, we limit ourselves to forms of the function map that can be written as the linear combinations of polynomial nonlinearities, spatial derivatives, and combinations of both. For instance, for a onedimensional () state variable , the function map can take the form:
(2.1.2) 
where are the coefficients of the components of the PDE for . The continuous PDE of the form described in Eq. (2.1.2), with appropriate coefficients , holds true for all continuous space and time points in the domain of the model. Numerical solutions of the PDE try to satisfy the equality relation in Eq. (2.1.2) for reconstituting the nonlinear dynamics of a dynamical system at discrete space and time points . We assume that we have access to noisy observational data of the state variable
over the discrete space and time points. The measurement errors are independent and identically distributed and are assumed to follow a normal distribution with mean zero and variance
.We follow the formulation put forward in Bär et al. (1999); Rudy et al. (2017); Schaeffer (2017) and construct a large dictionary of PDE components using discrete approximations of the dynamics from data. For instance, for the onedimensional example in Eq. (2.1.2), the discrete approximation with PDE terms can be written in vectorized form as a linear regression problem:
(2.1.3) 
Here, the lefthand side is a discrete
approximation to the time derivatives of at the discretization points and
represents the response or outcome vector in the linear regression framework.
Each column of the dictionary or design matrix represents the discrete approximation of one PDE component, i.e., one of the
terms in Eq. (2.1.2), at discrete points in space and time
. Each column is interpreted as a potential
predictor of the response vector . The vector is the vector of
unknown PDE coefficients, i.e., the prefactors of the terms in Eq.
(2.1.2).
Both and need to be constructed from numerical
approximations of the temporal and spatial derivatives of the observed state
variables. There is a rich literature in numerical analysis on this topic (see,
e.g.,Chartrand (2011); Stickel (2010)).
Here, we approximate the time derivatives by
firstorder forward finite differences from after initial denoising of
the data. Similarly, the spatial derivatives are computed by applying the
secondorder central finite differences. Details about the denoising are given in Section 3 and the Supplemental Material.
Given the general linear regression ansatz in Eq. 2.1.3 we formulate the datadriven PDE inference problem as a regularized optimization problem of the form:
(2.1.4) 
where is the minimizer of the objective function, a smooth convex data fitting function, a regularization or penalty function, and is a scalar tuning parameter that balances data fitting and regularization. The function is not necessarily convex or differentiable. We follow previous work Bär et al. (1999); Rudy et al. (2017); Schaeffer (2017) and consider the standard leastsquares objective
(2.1.5) 
The choice of the penalty function influences the properties of the coefficient estimates . Following Brunton et al. (2016); Rudy et al. (2017); Schaeffer (2017), we seek to identify a small set of PDE components among the possible components that accurately predict the time evolution of the state variables. This implies that we want to identify a sparse coefficient vector , thus resulting in an interpretable PDE model. This can be achieved through sparsitypromoting penalty functions . We next consider different choices for that enforce sparsity in the coefficient vector and review algorithms that solve the associated optimization problems.
2.2 Sparse optimization for PDE learning
The leastsquares loss in Eq.(2.1.5) can be combined with different sparsityinducing penalty functions . The prototypical example is the norm leading to the LASSO formulation of sparse linear regression Tibshirani (1996):
(2.2.1) 
The LASSO objective comprises a convex smooth loss and a convex nonsmooth regularizer. For this class of problems, efficient optimization algorithms exist that can exploit the properties of the functions and come with convergence guarantees. Important examples include coordinatedescent algorithms Wu and Lange (2008); Friedman et al. (2010) and proximal algorithms, including the DouglasRachford algorithm Eckstein and Bertsekas (1992) and the projected (or proximal) gradient method, also known as the ForwardBackward algorithm Combettes and Pesquet (2011). In signal processing, the latter schemes are termed iterative shrinkagethresholding (ISTA) algorithms (see Beck and Teboulle (2009) and references therein) which can be extended to nonconvex penalties. Although the LASSO has been previously used for PDE learning in Schaeffer (2017), the statistical performance of the LASSO estimates are known to deteriorate if certain conditions on the design matrix are not met. For example, the studies in Meinshausen et al. (2006); Zhao and Yu (2006) provide sufficient and necessary conditions, called the irrepresentable conditions, for consistent variable selection using LASSO, essentially excluding too strong correlations of the predictors in the design matrix. The conditions are, however, difficult to check in practice, as they require knowledge of the true components of the model. One way to relax these conditions is via randomization. The Randomized LASSO Meinshausen and Bühlmann (2010) considers the following objective:
(2.2.2) 
where each is an i.i.d.
random variable uniformly distributed over
with . For , the Randomized LASSO reduced to the standard LASSO. The Randomized LASSO has been shown to successfully overcome limitations of the LASSO to handle correlated components in the dictionary Meinshausen and Bühlmann (2010) while simultaneously preserving the overall convexity of objective function. As part of our PDESTRIDE framework, we will evaluate the performance of the Randomized Lasso in the context of PDE learning using cyclical coordinate descent Friedman et al. (2010).The sparsitypromoting property of the (weighted) norm comes at the expense of considerable bias in the estimation of the nonzero coefficients Zhao and Yu (2006), thus leading to reduced variable selection performance in practice. This drawback can be alleviated by using nonconvex penalty functions Fan and Li (2001); Zhang (2010), allowing nearoptimal variable selection performance at the cost of needing to solve a nonconvex optimization problem. For instance, using the “norm” (which counts the number of nonzero elements of a vector) as regularizers leads to the NPhard problem:
(2.2.3) 
This formulation has found widespread applications in compressive sensing and signal processing. Algorithms that deliver approximate solutions to the objective function in Eq. 2.2.3 include greedy optimization strategies Tropp (2004), Compressed Sampling Matching Pursuit (CoSaMP) Needell and Tropp (2009), and subspace pursuit Dai and Milenkovic (2009). We here consider the Iterative Hard Thresholding (IHT) algorithm Blumensath and Davies (2008, 2009), which belongs to the class of ISTA algorithms. Given the design matrix and the measurement vector , IHT computes sparse solutions by applying a nonlinear shrinkage (thresholding) operator to gradient descent steps in an iterative manner. One step in the iterative scheme reads:
(2.2.4) 
The operator is the nonlinear hardthresholding operator. Convergence of the
above IHT algorithm is guaranteed iff is true in each iteration for some constant .
Specifically, under the condition
that , the IHT algorithm is guaranteed to not increase the cost function
in Eq. (2.2.3) (Lemma 1 in Blumensath and Davies (2008)). The above IHT
algorithm can also be viewed as a thresholded version of the classical Landweber iteration
Herrity et al. (2006). The fixed points of for the nonlinear
operator in Eq. (2.2.4) are local minima of the cost function in Eq. (2.2.3)
(Lemma 3 in Blumensath and Davies (2008)). Under the same condition on the design
matrix, i.e. , the optimal solution of the cost function in
Eq. (2.2.3) thus belongs to the set of fixed points of the IHT algorithm
(Theorem 2 in Blumensath and Davies (2008) and Theorem 12 in
Tropp (2006)). Although the IHT algorithm comes with theoretical convergence
guarantees, the resulting fixed points are not necessarily sparse
Blumensath and Davies (2008).
Here, we propose modification of the IHT algorithm that will prove to be particularly suited for solving PDE learning problems with PDESTRIDE. Following a proposal in Foucart (2011) for the Hard Thresholding Pursuit (HTP) algorithm, we equip the standard IHT algorithm with an additional debiasing step. At each iteration, we solve a leastsquares problem restricted to the support obtained from the IHT iteration. We refer to this form of IHT as Iterative Hard Thresholding with debiasing (IHTd). In this twostage algorithm, the standard IHT step serves to extract the explanatory variables, while the debiasing step approximately debiases (or refits) the coefficients restricted to the currently active support Figueiredo et al. (2007)
. Rather than solving the leastsquares problem to optimality, we use gradient descent steps until a loose upper bound on the leastsquares refit is satisfied. This prevents overfitting by attributing low confidence to large supports and reduces computational overhead. The complete IHTd procedure is detailed in Algorithm 1 in the Supplementary material. In PDESTRIDE, we will compare IHTd with a heuristic iterative algorithm, Sequential Thresholding of Ridge regression (STRidge), that also uses
penalization and is available in PDEFIND Rudy et al. (2017).2.3 Stability selection
The practical performance of the sparse optimization techniques for PDE learning critically hinges on the proper selection of the regularization parameter that balances model fit and complexity of the coefficient vector. In model discovery tasks on real experimental data, a wrong choice of the regularization parameter could result in incorrect PDE model selection even if true model discovery would have been, in principle, achievable. In statistics, a large number of tuning parameter selection criteria are available, ranging from crossvalidation approaches Kohavi (1995) to information criteria Schwarz (1978), or formulations that allow joint learning of model coefficients and tuning parameters Lederer and Müller (2015); Bien et al. (2018). Here, we advocate stabilitybased model selection Meinshausen and Bühlmann (2010) for robust PDE learning. The statistical principle of stability Yu (2013)
has been put forward as one of the pillars of modern data science and statistics and provides an intuitive approach to model selection
Meinshausen and Bühlmann (2010); Shah and Samworth (2013); Liu et al. (2010).In the context of sparse regression, stability selection Meinshausen and Bühlmann (2010) proceeds as follows (see also Figure 1 for an illustration). Given a design matrix and the measurement vector , we generate random subsample indices of equal size and produce reduced subdesigns and by choosing rows according to the index set . For each of the resulting subproblems, we apply a sparse regression technique and systematically record the recovered support as a function of over an regularization path . The values of are data dependent and are easily computable for generalized linear models with convex penalties Friedman et al. (2010). Similarly, the critical parameter for the nonconvex problem in Eq.(2.2.3) can be evaluated from optimality conditions (Theorem 12 in Tropp (2006) and Theorem 1 in Blumensath and Davies (2008)). The lower bound of the regularization path is set to with default value . The dependent stability (or importance) measure for each coefficient is then computed as:
(2.3.1) 
where indicates the independent random subsamples. The stability of each coefficient can be plotted across the path, resulting in a component stability profile (see Figure 1F for an illustration). This visualization provides an intuitive overview of the importance of the different model components. Different from the original stability selection proposal Meinshausen and Bühlmann (2010), we define the stable components of the model as follows:
(2.3.2) 
Here, denotes the critical stability threshold parameter and can be set to Meinshausen and Bühlmann (2010). The default setting is . In an exploratory data analysis mode, the threshold can also be set through visual inspection of the stability plots, thereby allowing the principled exploration of several alternative PDE models. The stability measures also provide an interpretable criterion for a model component’s importance, thereby guiding the user to build the right model with high probability. As we will show in the numerical experiments, stability selection ensures robustness against varying dictionary size, different types of data sampling, noise in the data, and variability of the suboptimal solutions when nonconvex penalties are used. All of these properties are critical for consistent and reproducible model learning in realworld applications. Under certain conditions, stability selection can also provides an upper bound on the expected number of false positives. Such guarantees are not generally assured by any sparsitypromoting regression method in isolation Shah and Samworth (2013). For instance, stability selection combined with randomized LASSO (Eq. (2.2.2) with ) is consistent for variable selection even when the irrepresentable condition is violated Meinshausen and Bühlmann (2010).
3 Numerical experiments on simulation data
We present numerical experiments in order to benchmark the performance and robustness of PDESTRIDE combined with different sparsitypromoting regression methods to infer PDEs from spatiotemporal data. In order to provide comparisons and benchmarks, we first use simulated data obtained by numerically solving known groundtruth PDEs, before applying our method to a realworld data set from biology. The benchmark experiments on simulation data are presented in four subsections that demonstrate different aspects of the inference framework: Subsection 3.1 demonstrates the use of different sparsitypromoting regression methods in our framework in a simple 1D Burgers problem. Subsection 3.2 then compares their performance in order to choose the best regression method, IHTd. In subsection 3.3, stability selection is combined with IHTd to recover 2D vorticitytransport and 3D reactiondiffusion PDEs from limited, noisy simulation data. Subsection 3.4 reports achievability results to quantify the robustness of stability selection to perturbations in dictionary size, sample size, and noise levels.
STabilitybased
Robust IDEntification of PDEs (PDESTRIDE)
Given the noisecorrupted data and a choice of regression method, e.g., (randomized) LASSO, IHT, HTP, IHTd, STRidge.

Apply any required denoising method on the noisy data and compute the spatial derivatives and nonlinearities to construct the design matrix and the timederivatives vector for suitable sample size and dictionary size, and , respectively.

Build the subsamples and , for , by uniformly randomly subsampling of rows from the design matrix and the corresponding rows from . For every subsample , standardize the subdesign matrix such that and , for . Here, is the element in row and column of the matrix . The corresponding measurement vector is centered to zero mean.

Apply the sparsitypromoting regression method independently to each subsample to construct the paths for values of as discussed in section 2.3.

Compute the importance measures of all dictionary components along the discretized path, as discussed in section 2.3. Select the stable support set by applying the threshold to all . Solve a linear leastsquares problem restricted to the support to identify the coefficients of the learned model.
Adding noise to the simulation data
Let be the vector of clean simulation data sampled in both space and time. This vector is corrupted with additive Gaussian noise to
such that
is the additive Gaussian noise with an empirical standard deviation of the entries in the vector
, and is the level of Gaussian noise added.Computing the data vector
The data vector contains numerical approximations to the time derivatives of the state variable at different points in space and time. We compute these by firstorder forward finite differences (i.e., the explicit Euler scheme) from
after initial denoising of the data. Similarly, the spatial derivatives are computed by applying the secondorder central finite differences directly on the denoised data. For denoising we use truncated single value decomposition (SVD) with a cutoff at the elbow of the singular values curve, as shown in Supplementary Figures
S2 and S3.Fixing the parameters for stability selection
We propose that PDESTRIDE combined with IHTd provides a parameterfree PDE learning method. Therefore, all stability selection parameters described in Section 2.3 are fixed throughout our numerical experiments. The choice of these statistical parameters is welldiscussed in the literature Meinshausen and Bühlmann (2010); Bühlmann et al. (2014); Friedman et al. (2010). We thus fix: the repetition number , regularization path parameter , path size , and the importance threshold . Using these standard choices, the methods works robustly across all tests presented hereafter, and is parameterfree in that sense. In both stability and regularization plots, we show the normalized value of regularization constant . Although, the stable component set is evaluated at as in Eq.(2.3.2), the entire stability profile of each component from to is shown in all our stability plots. This way, we get additional graphical insight into how each component evolves along the path.
3.1 Case study with 1D Burgers equation and different sparsitypromoters
We show that stability selection can be combined with any sparsitypromoting penalized regression to learn PDE components from noisy and limited spatiotemporal data. We use simulated data of the 1D Burgers equation
(3.1.1) 
with identical boundary and initial conditions as used in Rudy et al. (2017) to provide fair comparison between methods: periodic boundaries in space and the following Gaussian initial condition:
The simulation domain is divided uniformly into 256 Cartesian grid points in space and 1000 time points.
The numerical solution is visualized in spacetime in Figure 4. The numerical solution was obtained using parabolic method based on finite differences and timestepping using explicit Euler method with step size .
We test the combinations of stability selection with the three sparsitypromoting regression techniques described in Section 2.2: randomized LASSO, STRidge, and IHTd. The top row of Figure 2 shows the regularization paths for randomized LASSO, STRidge, and IHTd. The bottom row of Figure 2 shows the corresponding stability profiles for each component in the dictionary. The colored solid lines correspond to the advection and diffusion terms of Eq.3.1.1. Thresholding the importance measure at = 0:8, sparsitypromoting regression methods are able to identify the correct components in the stability plots (solid colored lines) from the noise variables (dashed black lines).
3.2 Comparison between sparsitypromoting techniques
Although stability selection can be used in conjunction with any or sparsitypromoting regression method, the question arises whether a particular choice of regression algorithm is particularly well suited in the frame of PDE learning. We therefore perform a systematic comparison between LASSO,
STRidge, and IHTd for recovering the 1D Burgers equation under perturbations to the sample size , the dictionary size , and the noise level . An experiment for a particular triple is considered as success if there exists a (see section 2.3) for which the true PDE components are recovered. In Figure 3, the success frequencies over 30 independent repetitions with uniformly random data subsamples are shown.
A first observation from Figure 3 is that solutions (here with STRidge and IHTd) are better than the relaxed solutions (here with LASSO). We also observe that IHTd performs better than LASSO and STRidge for large dictionary sizes , high noise, and small sample sizes.
Large dictionaries with higherorder derivatives computed from discrete data cause grouping (correlations between variables), for which LASSO tends to select one variable from each group, ignoring the others Wang et al. (2011). Thus, LASSO fails to identify the true support consistently. STRidge shows good recovery for large dictionary sizes with clean data, but it breaks down in the presence of noise on the data. Of all three methods, IHTd shows the best robustness to both noise and changes in the design. We note a decrease in inference power with increasing sample size , especially for large and high noise levels. This again can be attributed to correlations and groupings in the dictionary, which become more prominent with increasing sample size .
Based on these results, we use IHTd as the sparsitypromoting regression method in conjunction with PDESTRIDE for model selection in the remaining sections.
3.3 Stabilitybased model inference
We present benchmark results of PDESTRIDE for PDE recovery with IHTd as the sparse regression method. This combination of methods is used to recover PDEs from limited noisy data obtained by numerical solution of the 1D Burgers, 2D vorticitytransport, and 3D GrayScott equations. Once the support of the PDE model is learned by PDESTRIDE with IHTd, the actual coefficient values of the nonzero components are determined by solving the linear leastsquares problem restricted to the recovered support . However, more sophisticated methods could be used for parameter estimation for a known structure of the PDE like in Raissi et al. (2017); Xun et al. (2013) from limited noisy data. But, this is beyond the scope of this paper given that in all cases considered the sample size significantly exceeds the cardinality of the recovered support () for which LLS fit provide good estimates of the PDE coefficients.
3.3.1 1D Burgers equation
We again consider the 1D Burgers equation from Eq. (3.1.1), using the same simulated data as in Section 3.1, to quantify the performance and robustness against noise of the PDESTRIDE+IHTd method. The results are shown in Figure 4 for a design with and . Even on this small data set, with a sample size comparable to dictionary size, our method recovers the correct model () with up to noise on the data, although the leastsquares fits of the coefficient values gradually deviate from their exact values (see Table 1).
For comparison, the corresponding stability plots for PDESTRIDE+STRidge are shown in Supplementary Figure S4. When using STRidge regression, the algorithm creates many false positives, even at mild noise levels (%).
clean  1.0008  0.1000 

0.9971  0.1016  
0.9932  0.0997  
0.9842  0.0976  
0.9728  0.0984  
0.9619  0.0967 
3.3.2 2D vorticity transport equation
This section discusses results for the recovery of 2D vorticity transport equation using PDESTRIDE. The vorticity transport equation can be obtained by taking curl of the NavierStokes equations and imposing the divergencefree constraint for enforcing incompressibility, i.e. . This form of NavierStokes has found extensive applications in oceanography and climate modeling Temam (2001). For the numerical solution of the transport equation, we impose noslip boundary condition at the left , right and bottom sides and shear flow boundary condition on the top side . The simulation code was written using openFPM framework Incardona et al. (2019) with explicittime stepping on a space grid. The poisson problem was solved at every timestep to correct velocities to ensure divergencefree fields. The viscosity of the fluid simulated was set to . In Figure 5, we show a single time snapshot of the velocities and the vorticity field inside the square domain of the Liddriven cavity experiment.
(3.3.1) 
In Figure 6, the PDESTRIDE results for 2D vorticity transport equation recovery are shown. The results demonstrate consistent recovery of the true support of the PDE for different noiselevels . The stable components recovered correspond to the true PDE components of the 2D vorticity equation Eq. 3.3.1. In table 2, we show refitted coefficients for the recovered PDE components. It should also be noted that the separation between the true (colored solidlines) and the noisy (black dottedlines) becomes less distinctive with increasing noiselevels. In the supplementary Figure S5, we also report the STRidge based stability selection results for the same design and stability selection parameters. It can be seen that STRidge struggles to recover the true support even at small noiselevels, i.e. .
clean  0.02504  0.02502  0.9994  1.0025 

0.02501  0.02504  0.9997  1.0006  
0.02492  0.0250  1.0003  0.9944  
0.0247  0.0250  1.004  0.9841  
0.0245  0.0251  1.0091  0.9748  
0.0242  0.0251  1.0083  0.9586 
3.3.3 3D GrayScott equation
In this section, we report the recovery capabilities of PDESTRIDE for the 3D GrayScott reactiondiffusion equation Eqs. 3.3.2. Reaction and diffusion of chemical species can produce a variety of patterns, reminiscent of those often observed in nature. Such processes form essential basis for morphogenesis in biology Meinhardt (1982) and may even be used to describe skin patterning and pigmentation Manukyan et al. (2017). We choose this example to show examples of systems with coupled variables and is very similar in dynamics to our real world example discussed in section 4. The reactiondiffusion dynamics described by Eqs. 3.3.2 are commonly be used to model the nonlinear interactions between two chemical species .
(3.3.2a)  
(3.3.2b) 
We simulate the above equations using openFPM framework. The snapshot of the 3D cube simulation box along with the concentration is shown in Fig. 7. Finitedifference space discretization scheme was used to discretize the dynamics described in Eqs. 3.3.2 with grid spacing . The explicit timestepping with step size was used for temporal integration to simulate for in real time. The 3D GrayScott model parameters used are , and .
Given the large degrees of freedom present in the 3D problem, for our learning problem we choose to sample data only from a small cube in the middle of the domain with dimension
. In Figure 7, we show PDESTRIDE method correctly identifies the true PDE components for the dynamics of species given by Eq. (3.3.2a) for different noise levels. One can appreciate the clear separation between the true and noisy PDE components in the stability plots. We show results for different noiselevels between with as few as samples and for dictionary size . Similar plots for the inference of species dynamics are shown in Fig. S1. Although perfect recovery was not possible owing to the small diffusivity () of the species, consistent and stable recovery of the reaction terms (interaction terms) can be seen. The refitted coefficients for the recovered PDE for the both the species are reported in table 3 and table S1, respectively. The comparison plots for PDESTRIDE with STRidge for the 3D GrayScott recovery are shown in the supplementary Figures S6, S7. We note that the STRidge is able to recover the complete form of Eq 3.3.2 in noisefree case for both the species, but it fails to recover both the PDEs in the noise case. The comparison clearly demonstrates that PDESTRIDE+IHTd clearly outperforms PDESTRIDE+STRidge for inference from noisy datasets.1(0.014)  

clean  0.0140  0.0140  1.0000  
0.0142  0.0143  0.9915  
0.0144  0.0146  0.9795  
0.0150  0.0153  0.9843 
In all the above presented cases, the learned stable components coincide with the true components of the underlying PDE model. The PDESTRIDE framework is able to learn the stable components with data points as few as . So, we conclude that our PDESTRIDE+IHTd framework is able to robustly learn partial differential equations from limited noisy data. In the next section, we discuss the consistency and robustness of the PDESTRIDE+IHTd for perturbations in design parameters like samplesize , dictionary size , and noiselevels .
3.4 Achievability results
Achievability results are a compact way to check for the robustness and consistency of a model selection method for varying design parameters. They also provide an approximate means to reveal the sample complexity of any and sparsitypromoting technique, i.e. the number of data points required to recover the model with full probability. Specifically, given a sparsity promoting method, dictionary size , sparsity , and noise level , we are interested in how the sample size scales with with recovery probability converging to one. The study Wainwright (2009)
reports sharp phase transition from failure to success for Gaussian random designs with increasing sample size
for LASSO based sparsity solutions. The study also provides sufficient lower bounds for sample size as a function of for full recovery probability. In this section, we ask the question on whether sparse model selection with PDESTRIDE (paired with IHTd) exhibit similar sharp phase transition behaviour. Given the dictionary components in our case are compiled from derivatives and nonlinearities computed from noisydata, it is interesting to observe if full recovery probability is achieved and maintained with increasing sample size(). In the particular context of PDE learning, increasing dictionary size by including higher order nonlinearities and higher order derivatives has the potential to include strongly correlated components, which can negatively impact the inference power. This observation was made evident with results and discussion from section 3.2.In Figures (8, 9) the achievability results for both the 1D Burgers system and 3D component GrayScott reaction diffusion system are shown. Each point in Figure (8, 9) correspond to 20 repetitions of an experiment with some design under random data subsampling. An experiment with a design is considered as success if for which the true PDE support is recovered with PDESTRIDE+IHTd by thresholding at . In Figures (8, 9), we see strong consistency and robustness to design parameters for both Burgers and GrayScott systems. And, we also observe sharp phase transition from failure to success with recovery probability converging to one with increasing sample size . This amply evidence suggest that PDESTRIDE not only enhances the inference power of the IHTd method but also ensures consistency. In addition, the sharp phase transition behaviour also point towards a strict lower bound on the sample complexity (N) below which full recoverability is not attainable as studied in Wainwright (2009). Following this, achievability plots can be used to readout approximate estimates of the samplecomplexity of the learned dynamical systems. In the case of Burgers, success probability is achieved with data points as few as in noisefree and points in noisy cases for different designs (). For the 3D Grayscott system, success probability is achieved with data points as few as in noisefree and points in noisy cases for different designs ().
4 Datadriven PDE inference on real experimental data to explain C. elegans zygote patterning
We showcase the applicability of the PDESTRIDE with IHTd to real experimental data. We use microscopy images to infer a PDE model that explains early C. elegans embryo patterning, and we use it to confirm a previous hypothesis about the physics driving this biological process. Earlier studies of this process proposed a mechanochemical mechanism for PAR protein polarization on the cell membrane EtemadMoghadam et al. (1995); Goehring et al. (2011); Gross et al. (2019). They systematically showed that the cortical flows provide sufficient perturbations to trigger polarization Goehring et al. (2011). The experiments conducted in Goehring et al. (2011) measured the concentration of the anterior PAR complex (aPAR), the concentration of posterior PAR complex (pPAR) and the cortical flow field (v) as a function of time as shown in Figure 10, A, B, D. The concentration and velocity fields were acquired on a grid with resolution of in space and time. These experimental datasets were used to validate the mechanochemical model developed in the studies EtemadMoghadam et al. (1995); Goehring et al. (2011). Here, we challenge the PDESTRIDE+IHTd framework to learn a PDE model for the regulatory network (Eq 4.0.1) of the interacting membrane PAR proteins in a pure datadriven sense from the experimental dataset. Given the noisy nature of the datasets, our analysis is limited to the first SVD mode of the data as shown in Figure 10 C. We also focus our attention on the temporal regime post the advection trigger when the early domains of PAR proteins are already formed. The PDESTRIDE is then directed to learn an interpretable model from the data, that evolves the early protein domains to fully developed patterns as shown in Figure 10A.
The reaction kinetics of the PAR proteins can be formulated as,
(4.0.1) 
Here, and are the reactant and product stoichiometry, respectively. The variables and correspond to the aPAR and pPAR protein species and is the reaction rate.
In designing the dictionary , the maximum allowed stoichiometry for reactant and product is restricted to 2, i.e. . The PDESTRIDE+IHTd results for the learned regulatory reaction network from data are shown in Figure 10 E, F. The stable components of the model for aPAR protein are and for pPAR protein are for a design . In Figure 10G, achievability tests are conducted to show the consistency and robustness of the learned models across different samplesizes . The learned model achieves full recovery probability for samplesize . Our preliminary models inferred in a datadriven manner exhibit very good qualitative agreement with the experiments and also recapitulates the mutual inhibitory nature of the PAR protein. The parameters of the learned models are then computed by leastsquares refitting and are tabulated in Table 4.
pPAR  

0.00019769  0.01073594  0.00027887  
aPAR  
0.0041325  0.00216077  0.00014699 
In the Figure 10H, 10I, we overlay the numerical solution of the the learned model with the denoised experimental data at certain time snapshots for both models of aPAR and pPAR proteins for quantitative comparison. This very simple PDE model is able to describe the temporal evolution of the PAR protein domains on the C. elegans zygote. Although, there is a good match in the spatialscales (PAR domain sizes) for the two proteins, there exists a nonnegligible discrepancy between the simulation and experiments in the timescales for the pPAR evolution. This difference can be attributed to the advection processes dictated by the cortical flows, which are not included in our simple ODE type model as shown in 10E, 10I. We believe including higher modes of the SVD decomposition and also using structured sparsity for enforcing symmetric arguments through grouping Ward Jr (1963) can further mature our datadriven models to include the mechanical aspects of the PAR system. In supplementary Figure S8(left), we already show for a particular design of , the advection and diffusion components of the aPAR protein model carry significant importance measure to be included in the stable set , but this is not the case with the pPAR components as shown in Figure S8(right). The preferential advective displacement of the aPARs to the anterior side modeled by the advective term is also in line with the observations of the experimental studies Goehring et al. (2011). However, such models with advection and diffusion components exhibit inconsistency for varying samplesize , in contrast to our simple ODE type model as illustrated in Figure 10G.
5 Conclusion and Discussion
We have addressed two key issues that have thus far limited the application of sparse regression methods for automated PDE inference from noisy and limited data: the need for manual parameter tuning and the high sensitivity to noise in the data. We have shown that stability selection combined with any sparsitypromoting regression technique provides an appropriate level of regularization for consistent and robust recovery of the correct PDE model. Our numerical benchmarks suggested that iterative hard thresholding with debiasing (IHTd) is an ideal combination with stability selection to form a robust and parameterfree framework (PDESTRIDE) for PDE learning. This combination of methods outperformed all other tested algorithmic approaches with respect to identification performance, amount of data required, and robustness to noise. The resulting stabilitybased PDESTRIDE method was tested for robust recovery of the 1D Burgers equation, 2D vorticity transport equation, and 3D GrayScott reactiondiffusion equations from simulation data corrupted with up to 6 of additive Gaussian noise. The achievability studies demonstrated the consistency and robustness of the PDESTRIDE method for full recovery probability of the model with increasing sample size and for varying dictionary size and noise levels .
In addition, we note that achievability plots provide a natural estimate for the samplecomplexity of the underlying nonlinear dynamical system. However, this empirical estimate of the samplecomplexity depends on the choice of model selection algorithm and how the data is sampled.
We demonstrated the capabilities of the PDESTRIDE+IHTd framework by applying it to learn a PDE model of embryo polarization directly from fluorescence microscopy images of C. elegans zygotes. The model recovered the regulatory reaction network of the involved proteins and their spatial transport dynamics in a pure datadriven manner, with no knowledge used about the underlying physics or symmetries. The thus learned, dataderived PDE model was able to correctly predict the spatiotemporal dynamics of the embryonic polarity system from the early spatial domains to the fully developed patterns as observed in the polarized C. elegans zygote. The model we inferred from image data using our method confirms both the structure and the mechanisms of popular physicsderived cell polarity models. Interestingly, the mutually inhibitory interactions between the involved protein species, which have previously been discovered by extensive biochemical experimentation, were automatically extracted from the data.
Besides rendering sparse inference methods more robust to noise and parameterfree, stability selection has the important conceptual benefit of also providing interpretable probabilistic importance measures for all model components. This enables modelers to construct their models with high fidelity, and to gain an intuition about correlations and sensitivities. Graphical inspection of stability paths provides additional freedom for user intervention in semiautomated model discovery from data.
We expect that statistical learning methods have the potential to enable robust, consistent, and reproducible discovery of predictive and interpretable models directly from observational data. Our parameterfree PDESTRIDE framework provides a first step toward this, but many open issues remain. First, numerically approximating time and space derivatives in the noisy data is a challenge for noise levels higher than a few percent. This limits the noise robustness of the overall methods, regardless of how robust the subsequent regression method is. The impact of noise becomes even more severe when exploring models with higherorder derivatives or stronger nonlinearities. Future work should focus on formulations that are robust to the choice of different discretization methods, while providing the necessary freedom to impose structures on the coefficients. Second, a principled way to constrain the learning process by physical priors, such as conservation laws and symmetries, is lacking. Exploiting such structural knowledge about the dynamical system is expected to greatly improve learning performance. It should therefore be explored whether, e.g., structured sparsity or grouping constraints Kutz et al. (2017); Ward Jr (1963) can be adopted for this purpose. Especially in coupled systems, like the GrayScott reactiondiffusion system and the PARpolarity model, known symmetries could be enforced through structured grouping constraints.
In summary, we believe that datadriven model discovery has tremendous potential to provide novel insights into complex processes, in particular in biology. It provides an effective and complementary alternative to hypothesisdriven approaches. We hope that the stabilitybased model selection method PDESTRIDE presented here is going to contribute to the further development and adoption of these approaches in the sciences.
Code and data availability: The github repository for the codes and data can be found at https://github.com/SuryanarayanaMK/PDESTRIDE.git.
Acknowledgements
This work was in parts supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC2068 – 390729961. We are grateful to the Grill lab at MPICBG/TU Dresden for providing the spatiotemporal PAR concentration and flow field data and allowing us to use them in our showcase. We thank Nathan Kutz (University of Washington) and his group for making their code and data public.
References
 Mogilner et al. [2006] Alex Mogilner, Roy Wollman, and Wallace F Marshall. Quantitative modeling in cell biology: what is it good for? Developmental cell, 11(3):279–287, 2006.
 Sbalzarini [2013] Ivo F Sbalzarini. Modeling and simulation of biological systems from image data. Bioessays, 35(5):482–490, 2013.
 Tomlin and Axelrod [2007] Claire J Tomlin and Jeffrey D Axelrod. Biology by numbers: mathematical modelling in developmental biology. Nature reviews genetics, 8(5):331, 2007.
 Duffy [2013] Daniel J Duffy. Finite Difference methods in financial engineering: a Partial Differential Equation approach. John Wiley & Sons, 2013.
 Adomian [1995] George Adomian. Solving the mathematical models of neurosciences and medicine. Mathematics and computers in simulation, 40(12):107–114, 1995.
 Donoho [2017] David Donoho. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745–766, 2017.
 Barnes et al. [2011] Chris P Barnes, Daniel Silk, Xia Sheng, and Michael P H Stumpf. Bayesian design of synthetic biological systems. Proceedings of the National Academy of Sciences of the United States of America, 108(37):15190–15195, 2011. ISSN 00278424. doi: 10.1073/pnas.1017972108.
 Asmus et al. [2017] Josefine Asmus, Christian L Müller, and Ivo F Sbalzarini. L pAdaptation: Simultaneous Design Centering and Robustness Estimation of Electronic and Biological Systems. Scientific Reports, 7(1):6660, 2017. ISSN 20452322. doi: 10.1038/s41598017035565.
 Kitano [2002] Hiroaki Kitano. Computational systems biology. Nature, 420(6912):206, 2002.
 Gregor et al. [2005] Thomas Gregor, William Bialek, Rob R de Ruyter van Steveninck, David W Tank, and Eric F Wieschaus. Diffusion and scaling during early embryonic pattern formation. Proceedings of the National Academy of Sciences, 102(51):18403–18407, 2005.
 Chen et al. [1999] Ting Chen, Hongyu L He, and George M Church. Modeling gene expression with differential equations. In Biocomputing’99, pages 29–40. World Scientific, 1999.
 Ay and Arnosti [2011] Ahmet Ay and David N Arnosti. Mathematical modeling of gene expression: a guide for the perplexed biologist. Critical reviews in biochemistry and molecular biology, 46(2):137–151, 2011.
 Prost et al. [2015] Jacques Prost, Frank Jülicher, and JeanFrançois Joanny. Active gel physics. Nature Physics, 11(2):111, 2015.
 Münster et al. [2019] Stefan Münster, Akanksha Jain, Alexander Mietke, Anastasios Pavlopoulos, Stephan W Grill, and Pavel Tomancak. Attachment of the blastoderm to the vitelline envelope affects gastrulation of insects. Nature, page 1, 2019.
 Voss et al. [1998] H Voss, MJ Bünner, and Markus Abel. Identification of continuous, spatiotemporal systems. Physical Review E, 57(3):2820, 1998.
 Breiman and Friedman [1985] Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 1985. ISSN 1537274X. doi: 10.1080/01621459.1985.10478157.
 Bär et al. [1999] Markus Bär, Rainer Hegger, and Holger Kantz. Fitting partial differential equations to spacetime dynamics. Physical Review E, 59(1):337, 1999.
 Xun et al. [2013] Xiaolei Xun, Jiguo Cao, Bani Mallick, Arnab Maity, and Raymond J. Carroll. Parameter estimation of partial differential equation models. Journal of the American Statistical Association, 108(503):1009–1020, 2013. ISSN 01621459. doi: 10.1080/01621459.2013.794730.
 Raissi et al. [2017] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Machine learning of linear differential equations using gaussian processes. Journal of Computational Physics, 348:683–693, 2017.
 Brunton et al. [2016] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, page 201517384, 2016.
 Rudy et al. [2017] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Datadriven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017.
 Schaeffer [2017] Hayden Schaeffer. Learning partial differential equations via data discovery and sparse optimization. Proc. R. Soc. A, 473(2197):20160446, 2017.
 Zhang and Lin [2018] Sheng Zhang and Guang Lin. Robust datadriven discovery of governing physical laws with error bars. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 474(2217):20180305, 2018. ISSN 13645021. doi: 10.1098/rspa.2018.0305. URL http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2018.0305.
 Mangan et al. [2019] N M Mangan, T Askham, S L Brunton, J N Kutz, and J L Proctor. Model selection for hybrid dynamical systems via sparse regression. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 475(2223):1–22, 2019. ISSN 14712946. doi: 10.1098/rspa.2018.0534.
 Long et al. [2017] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDENet: Learning PDEs from data. arXiv preprint arXiv:1710.09668, 2017.
 Raissi and Karniadakis [2018] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125–141, 2018.

Raissi et al. [2019]
M Raissi, P Perdikaris, and G E Karniadakis.
Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.
Journal of Computational Physics, 378:686–707, 2019. ISSN 10902716. doi: 10.1016/j.jcp.2018.10.045. URL https://doi.org/10.1016/j.jcp.2018.10.045.  Long et al. [2018] Zichao Long, Yiping Lu, and Bin Dong. PDENet 2.0: Learning PDEs from data with a numericsymbolic hybrid deep network. arXiv preprint arXiv:1812.04426, 2018.
 Dong et al. [2017] Bin Dong, Qingtang Jiang, and Zuowei Shen. Image restoration: Wavelet frame shrinkage, nonlinear evolution pdes, and beyond. Multiscale Modeling & Simulation, 15(1):606–660, 2017.
 Meinshausen and Bühlmann [2010] Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
 Shah and Samworth [2013] Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):55–80, 2013.
 Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Blumensath and Davies [2008] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14(56):629–654, 2008.
 Foucart [2011] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011.
 Gross et al. [2019] Peter Gross, K Vijay Kumar, Nathan W Goehring, Justin S Bois, Carsten Hoege, Frank Jülicher, and Stephan W Grill. Guiding selforganized pattern formation in cell polarity establishment. Nature Physics, 15(3):293, 2019.
 Chartrand [2011] Rick Chartrand. Numerical differentiation of noisy, nonsmooth data. ISRN Applied Mathematics, 2011, 2011.
 Stickel [2010] Jonathan J Stickel. Data smoothing and numerical differentiation by a regularization method. Computers & chemical engineering, 34(4):467–475, 2010.
 Wu and Lange [2008] Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2(1):224–244, 2008. ISSN 19326157. doi: 10.1214/07AOAS147.
 Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
 Eckstein and Bertsekas [1992] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(13):293–318, 1992.
 Combettes and Pesquet [2011] Patrick L Combettes and JeanChristophe Pesquet. Proximal splitting methods in signal processing. In Fixedpoint algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011.
 Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A Fast Iterative ShrinkageThresholding Algorithm. Society for Industrial and Applied Mathematics Journal on Imaging Sciences, 2(1):183–202, 2009. ISSN 19364954. doi: 10.1137/080716542.
 Meinshausen et al. [2006] Nicolai Meinshausen, Peter Bühlmann, et al. Highdimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
 Zhao and Yu [2006] Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006.
 Fan and Li [2001] Jianqing Fan and Runze Li. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. ISSN 01621459. doi: 10.1198/016214501753382273.
 Zhang [2010] Cun Hui Zhang. Nearly unbiased variable selection under minimax concave penalty, volume 38. 2010. ISBN 9040210063. doi: 10.1214/09AOS729.
 Tropp [2004] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231–2242, 2004.
 Needell and Tropp [2009] Deanna Needell and Joel A Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301–321, 2009.
 Dai and Milenkovic [2009] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE transactions on Information Theory, 55(5):2230–2249, 2009.
 Blumensath and Davies [2009] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009.
 Herrity et al. [2006] Kyle K Herrity, Anna C Gilbert, and Joel A Tropp. Sparse approximation via iterative thresholding. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 3, pages III–III. IEEE, 2006.
 Tropp [2006] Joel A Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE transactions on information theory, 52(3):1030–1051, 2006.
 Figueiredo et al. [2007] Mário AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of selected topics in signal processing, 1(4):586–597, 2007.

Kohavi [1995]
Ron Kohavi.
A Study of CrossValidation and Bootstrap for Accuracy Estimation
and Model Selection.
International Joint Conference of Artificial Intelligence
, 1995.  Schwarz [1978] Gideon Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, 1978. ISSN 00905364. doi: 10.1214/aos/1176344136.
 Lederer and Müller [2015] Johannes Lederer and Christian L. Müller. Don’t fall for tuning parameters: Tuningfree variable selection in high dimensions with the TREX. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence (AAAI 2015), pages 2729—2735. AAAI Press, 2015.
 Bien et al. [2018] Jacob Bien, Irina Gaynanova, Johannes Lederer, and Christian L. Müller. NonConvex Global Minimization and False Discovery Rate Control for the TREX. Journal of Computational and Graphical Statistics, 27(1):23–33, 2018. ISSN 15372715. doi: 10.1080/10618600.2017.1341414. URL http://arxiv.org/abs/1604.06815.
 Yu [2013] Bin Yu. Stability. Bernoulli, 19(4):1484–1500, 2013. ISSN 13507265. doi: 10.3150/13BEJSP14. URL http://projecteuclid.org/euclid.bj/1377612862.
 Liu et al. [2010] Han Liu, Kathryn Roeder, and Larry Wasserman. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models. Advances in neural information processing systems, 24(2):1432–1440, 2010. ISSN 10495258. URL https://papers.nips.cc/paper/3966stabilityapproachtoregularizationselectionstarsforhighdimensionalgraphicalmodels.pdfhttp://www.ncbi.nlm.nih.gov/pubmed/25152607{%}0Ahttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4138724.
 Bühlmann et al. [2014] Peter Bühlmann, Markus Kalisch, and Lukas Meier. Highdimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014.
 Wang et al. [2011] Sijian Wang, Bin Nan, Saharon Rosset, and Ji Zhu. Random lasso. The annals of applied statistics, 5(1):468, 2011.
 Temam [2001] Roger Temam. NavierStokes equations: theory and numerical analysis, volume 343. American Mathematical Soc., 2001.
 Incardona et al. [2019] Pietro Incardona, Antonio Leo, Yaroslav Zaluzhnyi, Rajesh Ramaswamy, and Ivo F Sbalzarini. OpenFPM: A scalable open framework for particle and particlemesh codes on parallel computers. Computer Physics Communications, 2019.
 Meinhardt [1982] Hans Meinhardt. Models of biological pattern formation. New York, 1982.
 Manukyan et al. [2017] Liana Manukyan, Sophie A Montandon, Anamarija Fofonjka, Stanislav Smirnov, and Michel C Milinkovitch. A living mesoscopic cellular automaton made of skin scales. Nature, 544(7649):173, 2017.
 Wainwright [2009] Martin J Wainwright. Sharp thresholds for highdimensional and noisy sparsity recovery using  constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
 EtemadMoghadam et al. [1995] Bijan EtemadMoghadam, Su Guo, and Kenneth J Kemphues. Asymmetrically distributed PAR3 protein contributes to cell polarity and spindle alignment in early C. elegans embryos. Cell, 83(5):743–752, 1995.
 Goehring et al. [2011] Nathan W Goehring, Philipp Khuc Trong, Justin S Bois, Debanjan Chowdhury, Ernesto M Nicola, Anthony A Hyman, and Stephan W Grill. Polarization of PAR proteins by advective triggering of a patternforming system. Science, 334(6059):1137–1141, 2011.
 Ward Jr [1963] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.
 Kutz et al. [2017] J Nathan Kutz, Samuel H Rudy, Alessandro Alla, and Steven L Brunton. Datadriven discovery of governing physical laws and their parametric dependencies in engineering, physics and biology. In 2017 IEEE 7th International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP), pages 1–5. IEEE, 2017.
 Donoho and Gavish [2013] David L Donoho and Matan Gavish. The optimal hard threshold for singular values is 4/√ 3, 2013.
 Hansen [1990] Per Christian Hansen. Truncated singular value decomposition solutions to discrete illposed problems with illdetermined numerical rank. SIAM Journal on Scientific and Statistical Computing, 11(3):503–518, 1990.
6 Supplementary Material
6.1 Algorithm
In the above algorithm and correspond to the stepsize/learning rates corresponding to the IHT and debiasing steps respectively. The learning rate
is computed as the inverse of the Lipschitz constant of the gradient of the squareloss function
in Eq.(2.1.5), i.e. . In the similar fashion, the learning rate in the debiasing step is computed as . Here, is the Lipschitz constant of the squareloss function restricted to the support set .1  

clean  0  0.0669  0.9999 
0  0.0666  0.9910  
0.0001  0.0667  0.9840  
0.0001  0.0657  0.9677 
.
Comments
There are no comments yet.