Compressive sensing (CS) started as a breakthrough technique in signal processing around a decade ago [Candes2006a, Donoho2006a]. It has since been broadly employed to recover sparse solutions of underdetermined linear systems where the number of unknowns exceeds the number of data measurements. Such a system has an infinite number of solutions, and CS seeks the sparsest solution—that is, solution with the fewest number of non-zero components, minimizing its -norm. This -minimization is an NP-hard problem, however, and a simpler convex relaxation minimizing the -norm is often used as an approximation, which can be shown to reach the solution under certain conditions [Donoho2006]. Several variants of -sparse recovery are thus frequently studied:
where ( is the number of data samples, and the number of basis functions), , , and are scalar tolerances, and is a scalar regularization parameter. Equation 1 (BP) is known as the basis pursuit problem; Equation 2 (BPDN) is the basis pursuit denoising; Equation 3 (LASSO) is the (original) least absolute shrinkage and selection operator; and Equation 4 (uLASSO) is the unconstrained LASSO (also known as the Lagrangian BPDN, or simply the -regularized least squares). These variants are closely related to each other, and in fact Equations 4, 3 and 2 can be shown to arrive at the same solution for appropriate choices of , , and , although their relationships are difficult to determine a priori [VandenBerg2009a].
The connections among these optimization statements do not imply they are equally easy or difficult to solve, and different algorithms and solvers have been developed to specifically target each form. For example, least angle regression (LARS) [Efron2004] accommodates Equation 3 by iteratively expanding a basis following an equiangular direction of the residual; primal-dual interior point methods (e.g., [Chen2001a]) can be used to solve Equations 2 and 1
by reformulating them as “perturbed” linear programs;-magic [Candes2006a] recasts Equation 2 as a second-order cone program and leverages efficient log-barrier optimization; and many more. Greedy algorithms such as orthogonal matching pursuit (OMP) [Davis1997] rely on heuristics, do not target any particular optimization formulation, and have been demonstrated to work very well in many situations.
We focus on the uLASSO problem Equation 4. This unconstrained setup is of significant interest, being related to convex quadratic programming, for which many algorithms and solvers have already been developed [VandenBerg2009a]
. The uLASSO problem also shares a close connection with Bayesian statistics, where the minimizer can be interpreted to be the posterior mode corresponding to a likelihood encompassing additive Gaussian noise on a linear model and a log-prior with theform. Bayesian compressive sensing (BCS) methods [Ji2008, Babacan2009]
take advantage of this perspective, and leverage Bayesian inference methods to explore the posterior distribution that can be useful in assessing solution robustness around the mode. We do not investigate BCS in this paper, and focus only on non-Bayesian approaches for the time being. Various algorithms have been developed to directly targetEquation 4 relying on different mathematical principles, and they may perform differently in practice depending on the problem structure. It is thus valuable to investigate and compare them under problems and scenarios we are interested in. In this study, we perform numerical experiments using several off-the-shelf solvers: l1 ls [Kim2007, Koh2008], sparse reconstruction by separable approximation (SpaRSA) [Wright2009, Wright2009a], conjugate gradient iterative shrinkage/thresholding (CGIST) [Goldstein2010, Goldstein2011], fixed point continuation with active set (FPC AS) [Wen2010, Yin2010], and alternating direction method of multipliers (ADMM) [Boyd2010, Boyd2011a].
The l1 ls algorithm transforms Equation 4 to a convex quadratic problem with linear inequality constraints. The resulting formulation is solved using a primal interior-point method with logarithmic barrier functions while invoking iterations of conjugate gradient (CG) or truncated Newton. SpaRSA takes advantage of a sequence of related optimization subproblems with quadratic terms and diagonal Hessians, where their special structure leads to rapid solutions and convergence. CGIST is based on forward-backward splitting with adaptive step size and parallel tangent (partan) CG acceleration. The eventual stabilization of the active set from the splitting updates renders the problem quadratic, at which point partan CG produces convergence in a finite number of steps. FPC AS alternates between two stages: establishing a working index set using iterative shrinkage and line search, and solving a smooth subproblem defined by the (usually lower dimensional) working index set through a second-order method such as CG. Finally, ADMM combines techniques from dual descent and the method of multipliers, and naturally decomposes the term from
, requiring only iterations of small local subproblems where analytical updates can be obtained from ridge regression and soft thresholding formulas.
We further target the use of CS for polynomial chaos expansion (PCE) construction. PCE is a spectral expansion for random variables, and offers an inexpensive surrogate modeling alternative for representing probabilistic input-output relationships. It is a valuable tool for enabling computationally feasible uncertainty quantification (UQ) analysis of expensive engineering and science applications (e.g.,[Ghanem1991, Najm2009, Xiu2009, LeMaitre2010]). The number of PCE basis terms grows drastically with the parameter dimension and polynomial order, while the number of available model evaluations are often few and limited due to high simulation costs. Consequently, constructing sparse PCEs is both desirable and necessary especially for high-dimensional problems, and research efforts are growing across several fronts to tackle this challenge.
In general, sampling efficiency is a major topic of interest, mainly due to the potential computational expense of each isolated model evaluation. Rauhut and Ward [Rauhut2012] demonstrated advantages of Chebyshev sampling for Legendre PCEs while Hampton and Doostan [Hampton2015]
proposed coherence-optimal sampling for a general orthonormal basis using Markov chain Monte Carlo. Recently, Jakeman, Narayan, and Zhou[Jakeman2017] illustrated the advantages of sampling with respect to a weighted equilibrium measure followed by solving a preconditioned -minimization problem. Fajraoui, Marelli, and Sudret [Fajraoui2017] also adopted linear optimal experimental design practice to iteratively select samples maximizing metrics based on the Fisher information matrix. Another strategy involves tailoring the objective function directly, such as using weighted -minimization [Peng2014] for more targeted recovery of PCE coefficients that are often observed to decay in physical phenomena. Eldred et al. [Eldred2015] described a different perspective that combined CS in a multifidelity framework, which can achieve high overall computation efficiency by trading off between accuracy and cost across different models. All these approaches, however, maintain a static set of basis functions (i.e., regressors or features). A promising avenue of advancement involves adapting the basis dictionary iteratively based on specified objectives. For example, works by Sargsyan et al. [Sargsyan2014] and Jakeman, Eldred, and Sargsyan et al. [Jakeman2015] incorporated the concept of “front-tracking”, where the basis set is iteratively pruned and enriched according to criteria reflecting the importance of each term. In our study, we investigate different numerical aspects of several CS solvers under standard sampling strategies and fixed basis sets.
By promoting sparsity, CS is designed to reduce overfitting. An overfit solution is observed when the error on the training set (i.e., data used to define the underdetermined linear system) is very different (much smaller) than error on a separate validation set, and the use of a different training set could lead to entirely different results. Such a solution has poor predictive capability and is thus unreliable. However, CS is not always successful in preventing overfitting, such as when emphases of fitting the data and regularization are not properly balanced, or if there is simply too little data. In the context of sparse PCE fitting, Blatman and Sudret [Blatman:2011] developed a LARS-based algorithm combined with a model selection score utilizing leave-one-out cross-validation, which helped both to avoid overfitting and to inform the sampling strategy. In this paper, we also explore approaches for addressing these challenges while focusing on solvers aimed at the uLASSO problem. Specifically, we use techniques to help improve the mitigation of overfitting on two levels. First, for a given set of data points, we employ cross-validation (CV) error to reflect the degree of overfitting of solutions obtained under different , and choose the solution that minimizes CV error. However, when sample size is too small, then the solutions could be overfit no matter what is used. A minimally-informative sample size is problem-dependent, difficult to determine a priori, and may be challenging to even define and detect in real applications where noise and modeling errors are large. We provide a practical procedure to use information from existing data to help guide decisions as to whether additional samples would be worthwhile to obtain—i.e., a stop-sampling strategy. While previous work focused on rules based on the stabilization of solution versus sample size [Malioutov2010], we take a goal-oriented approach to target overfitting, and devise a strategy using heuristics based on CV error levels and their rates of improvement.
The main objectives and contributions of this paper are as follows.
We conduct numerical investigations to compare several CS solvers that target the uLASSO problem Equation 4. The scope of study involves using solver implementations from the algorithm authors, and focusing on assessments of linear systems that emerge from PCE constructions. The solvers employ their default parameter settings.
We develop techniques to help mitigate overfitting through
an automated selection of regularization constant based on CV error and
a heuristic strategy to guide the stop-sampling decision.
We demonstrate the overall methodology in a realistic engineering application, where a 24-dimensional PCE is constructed with expensive large eddy simulations of supersonic turbulent jet-in-crossflow (JXF). We examine performance in terms of recovery error and timing.
This paper is outlined as follows. Section 2 describes the numerical methodology used to solve the overall CS problem. Section 3 provides a brief introduction to PCE, which is the main form of linear systems we focus on. Numerical results on different cases of increasing complexity are presented in LABEL:s:results. The paper then ends with conclusions and future work in LABEL:s:conclusions.
We target the uLASSO problem Equation 4 based on several existing solvers, as outlined in section 1. Additionally, we aim to reduce overfitting through two techniques: (1) selecting the regularization parameter via CV error, and (2) stop-sampling strategy for efficient sparse recovery.
2.1 Selecting via cross-validation
Consider Equation 4 for a fixed system and (and thus and ). The non-negative constant is the relative weight between the and terms, with the first term reflecting how well training data is fit, and the latter imposing sparsity via regularization. A large
heavily penalizes nonzero terms of the solution vector, forcing them toward zero (underfitting); a smallemphasizes fitting the training data, and may lead to solutions that are not sparse and that only fit the training data but otherwise do not predict well (overfitting). A useful solution thus requires an intricate choice of , which is a problem-dependent and nontrivial task.
The selection of can be viewed as a model selection problem, where different models are parameterized by
. For example, when a Bayesian perspective is adopted, the Bayes factor (e.g.,[Kass1995, Wasserman2000]) is a rigorous criterion for model selection, but it generally does not have closed forms for non-Gaussian (e.g., form) priors on these linear systems. Quantities simplified from the Bayes factor, such as the Akaike information criterion (AIC) [Akaike1974] and the Bayesian information criterion (BIC) [Schwarz1978], further reduce to formulas involving the maximum likelihood, parameter dimension, and data sample size. In fact, a fully Bayesian approach would assimilate the model-selection problem into the inference procedure altogether, and treat the parameter
(equivalent to the ratio between prior and likelihood “standard deviations”) as a hyper-parameter that would be inferred from data as well.
In this study, we utilize a criterion that more directly reflects and addresses our concern of overfitting: the CV error (e.g., see Chapter 7 in [Hastie2009]), in particular the -fold CV error. The procedure involves first partitioning the full set of training points into equal (or approximately equal) subsets. For each of the subsets, a reduced version of the original CS problem is solved:
where denotes but with rows corresponding the th subset removed, is with elements corresponding to the th subset removed, and is the solution vector from solving this reduced CS problem. The residual from validation using the th subset that was left out is therefore
where denotes that only contains rows corresponding to the th subset, and is containing only elements corresponding to the th subset. Combining the residuals from all subsets, we arrive at the (normalized) -fold CV error:
The CV error thus provides an estimate of the validation error using only the training data set at hand and without needing additional validation points, and reflects the predictive capability of solutions generated by a given solver. The CS problem withselection through CV error is thus
Note that solving Equation 9 does not require the solution from the full CS system, only the reduced systems.
In practice, solutions to Equations 8 and 5 are evaluated numerically and approximately, and so, not only do they depend on the problem (i.e., the linear system, size of and , etc.) but also on the numerical solver. Hence, should also depend on the solver (this encompasses the algorithm, implementation, solver parameters and tolerances, etc.), and subsequently and . The numerical results presented later on will involve numerical experiments comparing several different CS solvers.
For simplicity, we solve the minimization problem Equation 9 for using a simple grid-search across a discretized log- space. More sophisticated optimization methods, such as bisection, linear-programming, and gradient-based approaches, are certainly possible. One might be tempted to take advantage of the fact that the solution to Equation 4 is piecewise linear in [Efron2004]. However, the CV error, especially when numerical solvers are involved, generally no longer enjoys such guarantees. The gain in optimization efficiency would also be small for this one-dimensional optimization problem, and given the much more expensive application simulations and other computational components. Therefore, we do not pursue more sophisticated search techniques.
2.2 Stop-sampling strategy for sparse recovery
When sample size is small, the solutions could be overfit no matter what is used. Indeed, past work in precise undersampling theorems [Donoho2010], phase-transition diagrams [Donoho2009]
, and various numerical experiments suggest drastic improvement in the probability of successful sparse reconstruction when a critical sample size is reached. However, this critical quantity is dependent on the true solution sparsity, correlation structure of matrix, and numerical methods being used (e.g., CS solver), and therefore prediction a priori is difficult, especially in the presence of noise and modeling error. Instead, we introduce a heuristic procedure using CV error trends from currently available data to reflect the rate of improvement, and to guide decisions of whether additional samples would be worthwhile to obtain.
|Initial sample size|
|Sample increment size|
|Total sample size thus far|
|Available parallel batch size for data gathering|
|Number of basis functions (regressors)|
|Moving window size for estimating log-CV error slope|
|Tolerance for log-CV error slope rebound fraction|
|Slope criterion activation tolerance|
|Tolerance for absolute CV error|
|Number of folds in the -fold CV|
|Number of grid points in discretizing|
|Discretized grid points of|
|value that produces the lowest CV error|
|Variable reflecting the choice of CS solver|
Before we describe our procedure, we first summarize a list of parameters in Table 1 that are relevant for stop-sampling and the eventual algorithm we propose in this section. We start with some initial small sample size , and a decision is made whether to obtain an additional batch of new samples or to stop sampling. In the spirit of using only currently available data, we base the decision criteria on the -optimal CV error, for sample size . More specifically, multiple stop-sampling criteria are evaluated. Our primary criterion is the slope of with respect to , as an effective error decay rate. A simple approximation involves using a moving window of the past values of
and estimating its slope through ordinary least squares. We stop sampling when the current slope estimate rebounds to a certain fractionof the steepest slope estimate encountered so far, thus indicates crossing the critical sample size of sharp performance improvement. Since the samples are iteratively appended, its nestedness helps produce smoother than if entirely new sample sets are generated at each . Nonetheless, oscillation may still be present from numerical computations, and larger choices of and can further help with smoothing, but at the cost of lower resolution on and increased influence of non-local behavior. To guard against premature stop-sampling due to oscillations, we activate the slope criterion only after drops below a threshold of . Additionally, we also stop sampling if the value of drops below some absolute tolerance . This is useful for cases where the drop occurs for very small and is not captured from the starting .
2.3 Overall method and implementation
The pseudocode for the overall method is presented in Algorithm 1, and parameter descriptions can be found in Table 1. We provide some heuristic guidelines below for setting its parameters. These are also the settings used for numerical examples in this paper.
For the in -fold CV error, a small
tends toward low variance and high bias, while a largetends toward low bias and high variance as well as higher computational costs since it needs to solve more instances of the reduced problem [Hastie2009]. For problem sizes encountered in this paper, between 20-50 appears to work well. We revert to leave-one-out CV when .
A reasonable choice of is 5% of , and together with , corresponds to a growing uniform grid of 20 nodes over . In practice, sample acquisition is expected to be much more expensive than CS solves. We then recommend adopting the finest resolution where is the available parallel batch size for data gathering (e.g., for serial setups).
Stop-sampling parameters , , , are a good starting point. However, and are a somewhat conservative combination (i.e., less likely to stop sampling prematurely, and more likely to arrive at larger sample sizes). More relaxed values (larger and smaller ) may be used especially for more difficult problems where noise and modeling error are present, and where the CV error may not exhibit a sharp dropoff versus .
The selection of and should cover a good range in the logarithmic scale, such as from to through 15 log-spaced points. The upper bound can also be set to the theoretical that starts to produce the zero solution. Experience also indicates cases with small values of take substantially longer for the CS solvers to return a solution. We thus set ranges in our numerical experiments based on some initial trial-and-error. This could be improved with grid adaptation techniques so that the -grid can extend beyond initial bounds when not wide enough to capture , and also to refine its resolution to zoom in on .
3 Polynomial chaos expansion (PCE)
We now introduce the CS problem emerging from constructing sparse PCEs. PCE offers an inexpensive surrogate modeling alternative for representing probabilistic input and output relationships, and is a valuable tool for enabling computationally-feasible UQ analysis of expensive engineering and science applications. As we shall see below, the number of columns, , in these PCE-induced linear systems becomes very large under high-dimensional and nonlinear settings. Therefore, it is important to find sparse PCEs, and CS provides a useful framework under which this is possible. We present a brief description of the PCE construction below, and refer readers to several references for further detailed discussions [Ghanem1991, Najm2009, Xiu2009, LeMaitre2010].
A PCE for a real-valued, finite-variance random vector can be expressed in the form [Ernst2012]
where are the expansion coefficients, , is a multi-index, is the stochastic dimension (often convenient to be set equal to the number of uncertain model inputs), are a chosen set of independent random variables, and are multivariate polynomials of the product form
with being degree-
polynomials orthonormal with respect to the probability density function of(i.e., ):