where is a linear functional taking unknown parameters to observations . Problem (1) is also known as a Morozov formulation (in contrast to Ivanov or Tikhonov ). The functional can include a transformation to another domain, including Wavelets, Fourier, or Curvelet coefficients , as well as compositions of these transforms with other linear operators such as restriction in interpolation problems. The parameter
controls the error budget, and is based on an estimate of noise level in the data.
and machine learning[11, 10], as well as to applied domains including MRI . Seismic data is a key use case [3, 15, 9], where acquisition is prohibitively expensive and interpolation techniques are used to fill in data volumes by promoting parsimonious representations in the Fourier  or Curvelet  domains. Matricization of the data leads to low-rank interpolation schemes [3, 15, 9, 24].
While BPDN uses nonsmooth regularizers (including the norm, nuclear norm, and elastic net), the inequality constraint is ubiquitously smooth, and often taken to be the norm as in (1). Prior work, including [23, 3, 9, 2], exploits the smoothness of the inequality constraint in developing algorithms for the problem class. Smooth constraints work well when errors are Gaussian, but this assumption fails for seismic data and is often violated in general.
Contributions. The main contribution of this paper is to provide a fast, easily adaptable algorithm to solve non-smooth and nonconvex data constraints in general level-set formulations including BPDN, and illustrate the efficacy of the approach using large-scale interpolation and denoising problems. To do this, we extend the universal regularization framework of  to level-set formulations with nonsmooth/nonconvex constraints. We develop a convergence theory for the optimization approach, and illustrate the practical performance of the new formulations for data interpolation and denoising in both sparse recovery and low-rank matrix factorization.
Roadmap. The paper proceeds as follows. Section II develops the general relaxation framework and approach. Section III specifies this framework to the BPDN setting with nonsmooth, nonconvex constraints. In Section IV we apply the approach to sparse signal recovery problem and sparse Curvelet reconstruction. In Section V, we extend the approach to a low-rank interpolation framework, which embeds matrix factorization within the BPDN constraint. In Section VI we test the low-rank extension using synthetic examples and data extracted from a full 5D dataset simulated on complex SEG/EAGE overthrust model.
Ii Nonsmooth, nonconvex level-set formulations.
We consider the following problem class:
where and may be nonsmooth, nonconvex, but have well-defined proximity and projection operators:
Here, is typically a linear operator that converts to some transform domain, while is a linear observation operator also acting on . In the context of interpolation, is often a restriction operator.
This setting significantly extends that of , who assume and are convex, , and use the value function
to solve (2) using root-finding to solve Variational properties of are fully only understood in the convex setting, and efficient evaluation of requires to be smooth, so that efficient first-order methods are applicable.
Here, we develop an approach to solve any problem of type (2), including problems with nonsmooth and nonconvex
, using only matrix vector products withand simple nonlinear operators. In special cases, the approach can also use equation solves to gain significant speedup.
with and . In contrast to , we use a continuation scheme to force , in order to solve the original formulation (2). Thus the only external algorithmic parameter the scheme requires is , which controls the error budget for .
where the indicator function takes the value if , and infinity otherwise. Problem (4) can now be written as
Applying the prox-gradient descent iteration with step-size
gives the coordinate updates in Algorithm 1.
Prox-gradient has been analyzed in the general nonconvex setting by . However, Problem (5) is the sum of a convex quadratic and a nonconvex regularizer. The rate of convergence for this problem class can be quantified, and [26, Theorem 2], reproduced below, will be very useful here.
Theorem II.1 (Prox-gradient for Regularized Least Squares).
Consider the least squares objective
with bounded below, and potentially nonsmooth, nonconvex, and non-finite valued. With step , the iterates (6) satisfy
is a subgradient (generalized gradient) of at .
with . Plugging this expression back in gives a regularized least squares problem in alone:
Prox-gradient applied to the value function in (7) with step gives the iteration
This iteration, as formally written, requires forming and applying the system in (7) at each iteration. In practice we compute the update on the fly, as detailed in Algorithm 2. The equivalence of Algorithm 2 to iteration (8) comes from the following derivative formula for value functions :
In order to compute , and apply Theorem II.1, we first prove the following lemma:
Lemma II.3 (Bound on ).
The operator norm is bounded above by .
Considering the function
we know that the gradient is given by , and any Lipschitz bound gives
which means . On the other hand, we can write the right hand side as
Using Theorem 1 of  with , we have that the value function
is differentiable, with . Therefore
is also differentiable, with
This immediately gives the result. ∎
Corollary II.4 (Convergence of Algorithm 2).
The convergence statement comes directly from plugging the estimate of iteration 8 into Theorem II.1. The equivalence of Algorithm 3 with Algorithm 2 is obtained by plugging in step size into each line of Algorithm 2.
An important consequence of Corollary II.4 is that the convergence rate of Algorithm 2 does not depend on or , in contrast to Algorithm 1, whose rate depends on both matrices (Corollary II.2). The rates of both algorithms are affected by . We use continuation in , driving to at the same rate, and warm-starting each problem at the previous solution. A convergence theory that takes this continuation into account is left to future work.
|BPDN with Random Linear Operator|
Ii-a Inexact Least-Squares Solves.
Algorithm 3 has a provably faster rate of convergence than Algorithm 1. The practical performance of these algorithms is compared in Figure 1, which is solving a problem with both a norm regularizer and norm BPDN constraint, with , , and . We see a huge performance difference in practice as well as in theory: the proximal gradient descent from Algorithm 1 yields a slower cost function decay than solving exactly for as in Algorithm 3. Indeed, Algorithm 3 admits the fastest cost function decay as shown in Corollary II.4, albeit at the expense of more operations per iteration. This is due to the fact that fully solving the least squares problem in Line 5 is not tractable for large-scale problems. Hence, we implement Algorithm 3 inexactly, using the Conjugate Gradient (CG) algorithm. Figure 1 shows the results when we use 1, 5, and 20 CG iterations. Each CG iteration is implemented using matrix-vector products, and at 20 iterations the results are indistinguishable from those of Algorithm 3 with full solves. Even at 5 iterations, the performance is remarkably close to that of of Algorithm 3 with full solves. Algorithm 3 has a natural warm-start strategy, with the from each previous iteration used in the subsequent LS solve using CG. Using a CG method with a bounded number of iterates gives fast convergence and saves computational time. This approach is used in the subsequent experiments.
Iii Application to Basis Pursuit De-noise Models
The Basis Pursuit De-noise problem can be formulated as
where is classically taken to be the -norm. In this problem, represents unknown coefficients that are sparse in a transform domain, while is a composition of the observation operator with a transform matrix; popular examples of transform domains include discrete cosine transforms, wavelets, and curvelets. The observed and noisy data resides in the temporal/spatial domain, and is the misfit tolerance. This problem was famously solved with the SPGL1  algorithm for . When the observed data is affected by large sparse noise, a smooth constraint is ineffective. A nonsmooth variant of (9) is very difficult for approaches such as SPGL1, which solves subproblems of the form
|See e.g. ||routine|
Algorithm 3 is simple to implement. The least squares update in step 4 can be computed efficiently using either factorization with Woodbury, or an iterative method in cases where is too large to store. For the Woodbury approach, we have
For moderate size systems, we can store Cholesky factor
with , and use with (10) to implement step 4. However, in the seismic/curvelet experiment described below, the left-hand side of Equation 10 is too large to store in memory, but is positive definite. Hence, we solve the resulting linear system in step 4 of Algorithm 3 with CG, using matrix-vector products. The update is implemented via the -proximal operator (soft thresholding), while the update requires a projection onto the ball. The projectors used in our experiments are collected in Table II.
The least squares solve for is when
is an orthogonal matrix or tight frame, so that
; this is the case for Fourier transforms, wavelets, and curvelets. Whenis a restriction operator, as for many data interpolation problems, is a diagonal matrix with zeros and ones, and hence
is a diagonal matrix with entries either or ; the least squares problem for the update is then trivial.
Iv Basis Pursuit De-Noise Experiments
In this application, we consider two examples: the first is a small-scale BPDN to illustrate the proof of concept of our technique, while the second is an application to de-noising a common source gather extracted from a seismic line simulated using a 2D BG Compass model. The data set contains time samples with a temporal-interval of 4ms, and the spatial sampling is 10m. For this example, we use curvelets as a sparsfying transform domain. The first example considers the same model as in (9) where we want to enforce sparsity on while constraining the data misfit. The variable is a vector of length that has values on a random of its entries and zeros everywhere else; represents a spike train that we observe using a linear operator, . was generated with independent standard Gaussian entries, and is observed data with large, sparse noise. We take and . The noise is generated by placing large values on 10% of the observations and assuming everything else was observed cleanly (ie no noise). Here, we test the efficacy of using different norms on the residual constraint. With the addition of large, sparse noise to the data, smooth norms on the residual constraint should not be able to effectively deal with such outlier residuals. With our adaptable formulation, it should be easy to enforce both sparsity in the domain as well as the residuals. Other formulations, such as SPGL1, do not have this capability.
This noise is depicted in as the bottom black dashed line in Figure 2. The results are shown in Figure 3 and in Table I. From these, we can clearly see that the norm is not effective for sparse noise, even at the correct error budget .
Our approach is resilient to different types of noise since we can easily change the residual ball projection.
This is seen by the almost exact accuracy of the and norms, with SNR’s of 33 and 45 respectively.
The next test of the BPDN formulation is for a common source gather where entries are both omitted and corrupted with synthetic noise. Here, the objective function looks for sparsity in the curvelet domain, while the residual constraint seeks to match observed data within a certain tolerance . First, we note that doing interpolation only without added noise yields an SNR of approximately 13 for all formulations and algorithms; that is, all norms for Algorithm 4 and SPGL1. Here, we again want to enforce sparsity both in the curvelet domain () and the data residual (), which SPGL1 and other algorithms lack the capacity to do.
Following the first experiment, we add large sparse noise to a handful of data points; in this case, we added large values to a random 1% of observations (this does not include omitted entries). The noise added is approximately 120, while the observed data can range from 0 to 30. The interpolated and denoising results are shown in Figure 4 and Table III. Large, sparse noise cannot be filtered effectively by a smooth norm constraint, using either Algorithm 4 or SPGL1. However, and norms effectively handle such noise, and can be optimized using our approach. The SNR’s for these implementations are approximately 10 and 11 respectively, approaching that of the noiseless data mentioned above.
|4D Monochromatic Interpolation|
|with SPGL1||1.4594||-||52.5 (early stoppage)|
V Extension to Low-Rank Models
Treating the data as having a matrix structure gives additional regularization tools — in particular low-rank structure in particular domains. The BPDN formulation for residual-constrained low-rank interpolation is given by
for , is a linear masking operator from full to observed (noisy) data , and is the misfit tolerance. The nuclear norm is the
norm of the singular values of. Solving the problem (11) requires using a decision variable that is the size of the data, as well as updates to this variable that require SVDs at each iteration. It is much more efficient to model is a product of two matrices and , given by
where , , and is the low-rank representation of the data. The solution is guaranteed to be at most rank , and in addition, the regularizer is an upper bound for , the sum of singular values of , further penalizing rank by proxy. The decision variables then have combined dimension , which is much smaller than the variables required by convex formulations. When is smooth, the problems are solved using a continuation that interchanges the roles of the objective and constraints, solving a sequence of problems where is minimized over the ball  using projected gradient; an approach we call SPGLR below.
When is not smooth, SPGLR does not work and there are no available implementations for (12). Nonsmooth arise when we want the residual to be in the norm ball, so we are robust to outliers in the data, and can exactly fit inliers.
We now extend Algorithm 3 to this case. For any (smooth or nonsmooth), we introduce a latent variable for the data matrix, and solve
with a parameter that controls the degree of relaxation; as we have . The relaxation allows a simple block-coordinate descent detailed in Algorithm 4.
Algorithm 4 is also simple to implement. It requires two least squares solves (for and ), which are inherently parallelizable. It also requires a projection of the updated data matrix estimates onto the -level set of the misfit penalty . This step is detailed below.
For unobserved data , we have . For observed data, let denote . Then the update step is given by solving
Using the simple substitution , the we get
which is precisely the projection of onto , the -level set of . We use the same projectors for as in Section IV, see Table II. The convergence criteria for Algorithm 4 is based on the optimality of the quadratic subproblems in and feasibility measure of , though in practice we compare performance of algorithms based on a computational budget. This block-coordinate descent scheme converges to a stationary point of Equation 13 by [21, Theorem 4.1].
Implementing block-coordinate descent on these forms until convergence produces the completed low-rank matrix. Setting , we iterate until or a maximum number of iterations is reached. In the next section, we develop an application of this method to seismic interpolation and denoising.
Vi 4D Matrix completion with De-noising
There are two main requirements when using the rank-minimization based framework for seismic data interpolation and denoising: (i) underlying seismic data should exhibit low-rank structure (singular values should decay fast) in some transform domain, and, (ii) subsampling and noise destroy the low-rank structure (singular values decay slow) in that domain. For exploiting the low-rank structure during interpolation and denoising, we follow the matricization strategy proposed by . The matricization (source-x, source-y), i.e., placing both the source coordinates along the columns (Figure 6(a)), gives slow-decay of singular values (Figure 5(a)), while the matricization (source-x, receiver-x) (Figure 6(c)) gives fast decay of the singular values (Figure 5(b)). To understand the effect of subsampling on the low-rank structure, we remove the 50% of the sources. Subsampling destroys the fast singular value decay in the (source-x, receiver-x) matricization, but not in the (source-x, receiver-y) matricization. This is because missing sources are missing columns in the (source-x, source-y) matricization, and missing sub-blocks in the (source-x, receiver-x) matricization (Figure 6(b)). The latter is more effective for low-rank interpolation.
Similar to the BPDN experiments, we want to show that nonsmooth constraints on the data residual can be effective for dealing with large, sparse noise. The smooth norm that is most common in BPDN problem will fail in such examples, thereby leading to better data estimation with the implementation of non-smooth norms on the residuals. Thus, the goal of the below experiments is to show that enforcing sparsity in the singular values (ie low-rank) and sparsity in the residual constraint can be more effective with large, sparse noise than smooth residual constraints solved by most contemporary algorithms.
Vi-a Experiment Description
This example demonstrates the efficacy of the proposed approach using data created by a 5D dataset based on a complex SEG/EAGE overthrust model simulation . The dimension of the model is and is discretized on a grid. The simulated data contains receivers sampled at 50 m and sources sampled at 100 m. We apply the Fourier transform along the time domain and extract a frequency slice at 10 Hz as shown in Figure 7(a)
, which is a 4D object (source-x, source-y, receiver-x and receiver-y). We eliminate 80% of the sources and add large sparse outliers from the random gaussian distribution
(mean zero and variance on the order of the largest value in that particular source). The 10 generated values with the highest magnitudes are kept, and these are randomly added to observations in the remaining sources (Figure7(f)). The largest value of our dataset is approximately 40, while the smallest is close to zero. Thus, we are essentially increasing/decreasing 1% of the entries by several orders of magnitude, which contaminates the data significantly, especially if the original entry was nearly . For all low-rank completion and denoising, we let . The objective is to recover missing sources and eliminate noise from observed data. We use a rank of for the formulation (that is, and similarly for ), and run all algorithms for 150 iterations, using a fixed computational budget. We perform three experiments on the same dataset: 1) De-noising only (Figure 7(c)); 2) Interpolation only (Figure 7(d)); and 3) Combined Interpolation and De-noising (Figure 7(f)). Since we have ground truth, we pick to be the exact difference between generated noisy data and the true data; for the norm is a cardinality measure, so it is set to number of noisy points added.
Tables IV-VI display SNR values for different algorithms and formulations for the three types of experiments, and Figures 8-10 display the results for a randomly selected number of sources for the three experiments. Even a small number of outliers can greatly impact the quality of the low-rank de-noising and interpolation for the standard, smoothly residual-constrained algorithms. The de-noising only results (Figure 8, Table IV) show that all methods perform well when all sources are available. The interpolation only results (Figure 9, Table V) show that all constraints perform well in interpolating the missing data. This makes sense, as all algorithms will simply favor the low-rank nature of the data. However, the combined de-noising and interpolation dataset shows that the norm approach does far better than any smooth norm in comparable time. Table VI shows that when data for similar sources is absent/not observed, the smoothly-constrained formulations fail completely. When noise is added to the low-amplitude section of the observed data, the smoothly-constrained norms fail drastically, while the norm can effectively remove the errors. This is starkly evident in Figures 10(a)-10(e), where all except Figure 10(e) are essentially noise; the result is supported by the SNR values in Table VI. While Figures 10(a)-10(e) can mostly capture the structure of the data where there were nonzero values (ie where the seismic wave is observed in the upper left corner of each source), only the norm can capture the areas of lower energy data.
|4D Monochromatic De-noising|
|4D Monochromatic Interpolation|
|4D Monochromatic De-noising & Interpolation|
We proposed a new approach for level-set formulations, including basis pursuit denoise and residual-constrained low-rank formulations. The approach is easily adapted to a variety of nonsmooth and nonconvex data constraints. The resulting problems are solved using Algorithm 2 and 4; which require only that the penalty has an efficient projector. The algorithms are simple, scalable, and efficient. Sparse curvelet de-noising and low-rank interpolation of a monochromatic slice from the 4D seismic data volumes demonstrate the potential of the approach.
A particular quality of the seismic denoising and interpolation problem is that the amplitudes of the signal have significant spatial variation. The error in the data is a much larger problem for low-amplitude data. This quality makes it very difficult to obtain reasonable results using Gaussian misfits and constraints. Nonsmooth exact formulations (including and particularly
) appear to be extremely well-suited for this magnified heteroscedastic issue.
The authors acknowledge support from the Department of Energy Computational Science Graduate Fellowship, which is provided under grant number DE-FG02-97ER25308, and the Washington Research Foundation Data Science Professorship.
-  F. Aminzadeh, N. Burkhard, L. Nicoletis, F. Rocca, and K. Wyatt. Seg/eaeg 3-d modeling project: 2nd update. The Leading Edge, 13(9):949–952, 1994.
-  A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy. Level-set methods for convex optimization. To appear in Mathematical Programming, Series B., 2018.
-  A. Y. Aravkin, R. Kumar, H. E. Mansour, B. Recht, and F. J. Herrmann. Fast methods for denoising matrix completion formulations, with applications to robust seismic data interpolation. SIAM J. Scientific Computing, 36, 2014.
-  H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
-  B. M. Bell and J. V. Burke. Algorithmic differentiation of implicit functions and optimal values. In Advances in Automatic Differentiation, pages 67–77. Springer, 2008.
-  E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
-  S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
C. Da Silva and F. J. Herrmann.
Optimization on the hierarchical tucker manifold–applications to tensor completion.Linear Algebra and its Applications, 481:131–173, 2015.
-  D. Davis and W. Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163. Springer, 2016.
-  B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–99, 2004.
An equivalence between sparse approximation and support vector machines.Neural Comp., 10(6):1455–1480, 1998.
-  F. J. Herrmann and G. Hennenfent. Non-parametric seismic data recovery with curvelet frames. Geophysical Journal International, 173(1):233–248, 2008.
-  A. Kadu and R. Kumar. Decentralized full-waveform inversion. Submitted to EAGE on January 15, 2018, 2018.
-  R. Kumar, C. Da Silva, O. Akalin, A. Y. Aravkin, H. Mansour, B. Recht, and F. J. Herrmann. Efficient matrix completion for seismic data reconstruction. Geophysics, 80(5):V97–V114, 2015.
-  R. Kumar, O. López, D. Davis, A. Y. Aravkin, and F. J. Herrmann. Beating level-set methods for 5-d seismic data interpolation: A primal-dual alternating approach. IEEE Transactions on Computational Imaging, 3(2):264–274, June 2017.
-  M. Lustig, D. Donoho, and J. Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine, 58:1182–95, 2007.
-  L. Oneto, S. Ridella, and D. Anguita. Tikhonov, Ivanov and Morozov regularization for support vector machine learning. Machine Learning, 103(1):103–136, 2016.
-  B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Aug. 2010.
-  M. D. Sacchi, T. J. Ulrych, and C. J. Walker. Interpolation and extrapolation using a high-resolution discrete fourier transform. IEEE Transactions on Signal Processing, 46(1):31–38, Jan 1998.
-  J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1050, March 2006.
-  P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
-  E. Van Den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008.
-  E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput., 31(2):890–912, Nov. 2008.
-  A. Yurtsever, M. Udell, J. A. Tropp, and V. Cevher. Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage. ArXiv e-prints, Feb. 2017.
-  P. Zheng and A. Aravkin. Fast methods for nonsmooth nonconvex minimization. ArXiv e-prints, Feb. 2018.
-  P. Zheng, T. Askham, S. L. Brunton, J. N. Kutz, and A. Y. Aravkin. A Unified Framework for Sparse Relaxed Regularized Regression: SR3. ArXiv e-prints, July 2018.