1 Introduction
This paper revolves around the Variable Projection technique in nonsmooth optimization. One of the prominent early references on the topic [11]
concentrates on separable leastsquares problems, having numerous applications in chemistry, mechanical systems, neural networks, and telecommunications; see the surveys of
[12] and [18], and references therein. Setting the stage, consider a separable leastsquares problem(1) 
where the matrixvalued map is smooth in the parameter . One may formally eliminate the variable by rewriting the problem:
(2) 
Though is implicitly defined, it can be explicitly evaluated, since is a convex quadratic in . The variable projection technique for this class of problems, going back to [11], then aims to solve the original problem (1) by instead running a nonlinear optimization solver on . The authors of [27] showed that when the GaussNewton method for (1) converges superlinearly, so do GaussNewton variants for (2); moreover, empirical evidence suggests that the latter schemes outperform the former.
The underlying principle is much broader than the class of nonlinear least squares problems. For example, the authors of [5, 4] consider the class of problems
(3) 
where is a
smooth function, the vector
models the primary set of parameters, whileis a secondary set of nuisance parameters (such as variance, tuning, or regularization). Applying the variable projection technique requires approximate evaluation of the function
along with its first (and potentially second) derivatives; such formulas are readily available under mild conditions [5].There is significant practical interest in nonsmooth functions , generalizing (3). Nonsmooth regularization in both and occurs often in high dimensional and signal processing settings, as well as in PDE constrained optimization, and machine learning problems. In many cases, problemspecific projectionbased algorithms have been developed. Our goal is to provide a unifying framework for a wide range of problems, illustrating its current uses in a range of fields, and to explore new applications.
The outline of the paper is as follows. In Section 2, we present important applications of nonsmooth variable projection algorithms. Throughout, we emphasize the subtleties involved, and in several cases propose novel techniques. In Section 3, we justify some technical claims made in Section 2, using basic techniques of variational analysis. Section 4 presents promising numerical results with applications to exponential fitting, sparse deconvolution, and robust learning formulations. Conclusions complete the paper.
2 Nonsmooth Variable Projection in Applications
As alluded to in the introduction, the variable projection technique has been widely used in different application domains. In this section, we describe a number of notable examples: PDEconstrained optimization, exponential data fitting, robust inference, and multiple kernel learning. We emphasize that despite the variety of these applications, the underlying algorithmic techniques all fall within the same paradigm. However, there are subtle modifications that may be required to make the scheme work, which we emphasize when appropriate. As a warmup for the general technique considered, we begin with ODE or PDE constrained problems.
2.1 PDEconstrained optimization
Optimization problems with ODE and PDE constraints occur often when modeling physical systems; see e.g. optimal control [10, 22] and inverse problems in geophysics [13, 6] astrophysics [14] and medical imaging [1]. Setting notation, consider the problem
(4) 
where is the state variable (a vector field), the system is an ODE or PDE with parameters of interest , and is an objective that we aim to minimize (e.g. data misfit). Here we assume that is lower bounded and convex. Upon discretization, becomes a highdimensional vector. For the current exposition, we will assume that the PDE is linear. Abusing notation slightly, we assume that the system (4) is already discretized, and is in the form
(5) 
where
is a square invertible matrix depending
smoothly on , the function is smooth and the margin function is convex for each .In largescale applications, it is impractical to store and update the entire vector . Instead, adjoint state methods compute on the fly and optimize in alone. Observe that the problem is equivalent to minimizing the smooth function and it is wellknown [13, 23] that is given by
where satisfy the equations
(6) 
A variational viewpoint elucidates the formula. Indeed, basic convex duality (e.g. [7, Corollary 3.3.11]) implies that can equivalently be written as
(7) 
Now observe that equation (6) simply says that is a saddle point of (7) while the description of is the partial gradient with respect to of the inner function in (7) evaluated at .
Penalty methods for PDE constrained optimization
In some applications, it may be desirable to relax the constraint and formulate a penalized problem introduced in [35]
where is a penalty parameter and is strongly convex, smooth penalty function (e.g. ). The variable projection technique is immediate here. An easy application of [25][Theorem 10.58] shows that the function
(8) 
is differentiable and its derivative is simply the partial derivative of the inner function in evaluated at , where achieves the minimum. For an extended explanation, see Section 3. Notice that in case of the least squares objective , only one equation solve is needed to compute and hence the derivate . This is in contrast to the adjoint method where another equation solve is needed to compute the adjoint variable . For promising numerical results, see [35]. Hence the variable projection technique, even in this rudimentary setting, can yield surprising computational benefits.
2.2 Exponential datafitting
In this section, we present the general class of exponential datafitting problems – one of the prime applications of variable projection [21]. The general formulation of these problems starts with a signal model of the form
where are unknown weights, are the measurements, and are given functions that depend on an unknown parameter . Some examples of this class are given in table 1.
problem  known parameters  unknown parameters  

pharmacokinetic  time  decay rate  
signal classification  distances  direction  
radial basis functions  locations  center and scale 
Introducing the matrix with entries , we express the noisy measurements as
A natural approach is then to formulate a nonlinear least squares problem,
which is readily solved using the classic variable projection algorithm. In practice, however, we may not know the number of terms, , to include. We can of course overparametrize the problem, choosing a rather large value for
, but this may lead to unreliable estimates. By including a sparsity prior on
we force a sparser solution. We arrive at the formulation:Note also that we can replace the data misfit term to account for the statistics of the noise. Additional constraints on may be added as well [19, 33]. The variable projection approach allows us to efficiently solve this problem by projecting out using any suitable method to solve LASSO problems (we use IPsolve [3]) and using a nonlinear optimization method to find the optimal . Indeed, we will see in Section 3 that provided has full column rank for all , the function
(9) 
is differentiable with gradient where achieves the minimum. If does not have full column rank, we can add a quadratic regularizer in and obtain an analogous result. We note that numerical optimization in both variables , without variable projection, performs poorly in comparison to the reduced approach; see section 4 and figure 2.
2.3 Trimmed Robust Formulations in Machine Learning
Data contamination is a significant challenge for a range of inference problems. An alternative to robust penalties is the trimmed approach [15, 28], where one solves an inference problem by minimizing over least residuals. Suppose that our inference problem can be written as
where represent the error or the negative loglikelihood of the ’th component, while is any regularizer. The trimmed approach aims to only use
“best” components, treating the rest as “outliers” to be excluded. However, correct classification of these residuals can only be done once
is known. A natural approach is to use the smallest residuals at each iteration of a numerical scheme. Recently, such a scheme was used to develop robust graphical models [37], using a sparsifying regularizer .Consider the reformulation
(10) 
where is a scaled simplex, and is the infinity norm unit ball. Variable projection naively seems directly applicable, since the function simply selects the smallest ’s. However, it is easy to see that this function is nonsmooth in the simplest of cases; see Fig. 1.
Instead, we propose a regularized formulation:
(11) 
We then seek to solve the problem , where
Here we use the notation . A standard computation shows that is indeed differentiable with gradient , where is the nearest point to in the set of interest . Projection onto this set has been recently considered and implemented [36]. Hence standard methods are directly applicable to solve (11), including proxgradient for nonsmooth , provided
is Lipschitz continuous. In the numerical section of this paper, we illustrate the approach on a logistic regression model with data contamination. In section
3, we observe that the regularization technique just described fits into a broader paradigm.2.4 Multiple Kernel Learning
Kernel methods are a powerful technique in classification and prediction [32, 34]. In such problems we are given a set of samples and corresponding labels
and the goal is to classify new samples. To do so, we search for a function
, with a given map from to a specified (possibly infinitedimensional) function class (Reproducing Kernel Hilbert Space [29, 31]), such that . We can then use to classify new samples. This problem can be formalized as followss.t. 
The dual to this problem is always a finite dimensional Quadratic Program:
(12) 
where is the Kernel matrix given by . Once the dual is solved, is recovered via .
The choice of kernel is an artform; there are many options available, and different kernels perform better on different problems. To develop a disciplined approach in this setting, the multiple kernel learning framework has been proposed. In this framework, we suppose that we have a choice of kernels functions with corresponding kernels . We can now consider the linear combination
where are some weights. A natural question is to find the best weighted combination; such an approach has been proposed in [24]. Specifically, requiring to be in the unit simplex yields the problem
This problem was solved in [24] by using variable projection, with the outer problem solved by proxgradient. Notice that each iteration requires a complete solve of a Quadratic Program.
Our novel approach switches which variable is projected out. Analogously to the trimmed regression case, a naive approach is given below:
(13) 
where
(14) 
As in trimmed regression, this approach does not work, as can be nonsmooth. We use the same smoothing modification, considering a modified function
(15)  
where . Now, is smooth, with , where is the nearest point to in the unit simplex. This projection has also been studied and implemented [9]. With the smoothing extension, standard approaches can be applied to solve (13) with the modified . In our numerical experiments, demonstrate the efficiency of this novel formulation.
3 Theory
In this section, we discuss some of the claimed derivative formulas in the aforementioned applications. To this end, recall that a function is differentiable at if there exists a vector , called the gradient and denoted , satisfying
On the other hand, a function that is differentiable at is strictly differentiable at if it satisfies the slightly stronger property
The following fundamental result [25][Theorem 10.58] is key for establishing derivative formulas for in (8) and for in (9). Strict differentiability, in particular, implies local Lipschitz continuity of the function, in contrast to conventional differentiability [25, Theorem 9.18].
Theorem 1 (Derivative of the projected function).
Consider a function satisfying the following properties

is continuous;

is levelbounded in locally uniformly in , meaning that for any and any compact set , the union of sublevel sets is bounded.

gradient exists for all and depends continuously on .
Define now the projected function and the minimizing set:
Then is strictly differentiable at any point for which the set is a singleton, and in this case equality holds.
Let us see how this theorem applies. Consider the function in the formulation (8). Clearly the inner function satisfies properties 1 and 3 of Theorem 1. Property 2 follows quickly from strong convexity of and the assumptions that is lower bounded and is invertible. Then the set is clearly a singleton, and the claimed derivative formula follows (even if is nonsmooth).
Next, consider the function in (9). Again, properties 1 and 3 of Theorem 1 hold. To justify property 2, we must assume that has full column rank for all . Then property 2 follows immediately. Since the objective is strongly convex in , the set is a a singleton and the derivative formula follows.
Next note that there are important limitations to Theorem 1. We already saw that the naive projected function in equation (10) can easily be nonsmooth; see figure 1. The same occurs for the naive formulation (14). The difficulty is that the parameter that is projected out varies over a constrained set, and hence properties 1 and 3 decisively fail.
As we saw in sections 2.3 and 2.4, we can circumvent these issues by smoothing the projected problem while still preserving the structure of the solution set (feasibility, sparsity, etc.) This technique seems promising much more generally for problems of the form
where , , and satisfy appropriate conditions. Such a smoothing technique in nonsmooth optimization appears to be new, and will be the subject of further investigation.
4 Numerical Illustrations
Directionofarrival estimation (DOA)
A classical problem in arraybased signal processing is the estimation of the directionofarrival. In a typical setup, incoming signals from multiple sources (e.g., electromagnetic or acoustic waves) are recorded by an array of receivers. The goal is to identify the directions from which various contributions originated. This problem falls in the class of exponential datafitting, as an incoming plane wave can be described by a complex exponential. The signal model is
where denotes the signal, is a matrix with elements , are the amplitudes of the individual components; are the directions and are the locations of the receivers.
The classical way of estimating the parameters is the MUSIC algorithm [30, 8]. This algorithm proceeds to estimate the DOA as follows. First, we treat both and
as random variables and define the signal correlation matrix
which is an matrix with rank . We let be the matrix whose columns span the nullspace of . The MUSIC pseudospectrum is now defined as
where . If is nearly in the range of , its projection onto the nullspace is nearly zero, thus causing a large peak in the pseudospectrum. The components in the signal can then be detected by taking the largest peaks in the pseudospectrum.
In practice, we do not have access to the signal correlation. Instead, we construct the sample average
If the is i.i.d. Gaussian with variance and independent of , we have
in which case we construct from the singular vectors of
with singular values
. Proper identification of the nullspace requires an estimate of the noise level.MUSIC vs. Nonsmooth DOA (full and reduced)
An example with , and is shown in figure 2. Note that the observed data contains arrivals from only 3 distinct directions. We solved the LASSO problems using an interior point method (IPsolve [3]) while using a QuasiNewton method to solve the outer optimization problem. We compare this variable projection approach to solving the full optimization problem in jointly using a nonsmooth QuasiNewton method [16].
The observed and fitted data are shown in figure 2 (a) and the true and estimated DOAs and amplitudes as well as the MUSIC spectrum are shown in figure 2 (b). The MUSIC spectrum clearly picks up the strongest modes but fails to recover the weaker mode. Both the full and variable projections approaches correctly estimate both the directions and amplitudes of the 3 arrivals. The full approach, however, fails to converge as can be seen in 2 (c).
We repeated the above experiment for 100 different initial guesses and summarized the results in terms of the number of iterations required to reach the desired tolerance of and optimality (relative norm of the gradient) in figure 3. We see that the reduced approach almost always reaches the desired tolerance and requires less iterations to do so. In contrast, the full approach fails to reach the desired tolerance in roughly 35/100 cases.
(a)  (b) 
Blind sparse deconvolution: full vs. reduced
In this example, we discuss a parametrized blind deconvolution problem. We are given a noisy image which we want to decompose into a number of parametrized basis functions. Such problems occur frequently in astrophysics [33], image deblurring [20, 2] and seismology [26, 17].
In this case we we let and use the interior point method IPsolve [3] to solve the LASSO problems with positivity constraints. The outer optimization over is done using a QuasiNewton method. We compare the VP approach to a full optimization over the space. The results are shown in figure 4. The ground truth image consists of four blobs; the initial guess for is obtained by picking the 50 largest peaks in the noisy image. The reconstructions obtained by the reduced and full approaches look similar, but we see that the full approach has trouble getting rid of the superfluous points. The convergence plots, shown in figure 5, show that the full approach does not fit the data as well as the reduced approach and converges very slowly.
(a)  (b) 
(c)  (d) 
(a)  (b) 
Robust logistic regression using variable projection
Logistic regression is a popular alternative to support vector machines that seeks to find the best hyperplane to separate two classes of points. It can be formulated as an optimization problem as follows:
(16) 
where represents the th feature, is a class label, describes the hyperplane we seek, and is a regularizer; in this example, we use a small multiple of as the regularizer. In this section, we consider a classification problem where a portion of the data has been contaminated. We pick features in , and we replace 10% (200 features) with random noise, with entries on average ten times larger than those of actual features. The smoothed trimmed approach for this problem is given by
(17) 
as described in section 2. We then compare three scenarios:

Data has not been contaminated, standard logistic regression used

Data has been contaminated, standard logistic regression used

Data has been contaminated, robust logistic regression (17) used with .
Note that the algorithm doesn’t use information about the exact number of contaminated features. Rather, it assumes that at least half of the data is good. The results were nearly identical without using regularization . We used for the smoothing.
Fraction Correct  Standard Deviation  

No contamination  0.95  0.006 
Contam. and standard LR  0.74  0.02 
Contam. and robust LR  0.86  0.02 
Fast MKL using variable projection
In this example, we show how variable projection can be used for Multiple Kernel Learning. We solve the smoothed dual problem (15) using SQP with approximate Hessian . We consider two parametrized classes of kernel functions, polynomial and Gaussian:
As an illustrative example, we consider classifying points in to the left and to the right of an elliptic curve. In figure 6, we show the results for various kernels. In table 3, we show the number of iterations required for each kernel.
In figure 7 we show the result using a total of 12 kernels; 5 polynomial kernels with and 7 Gaussian kernels with parameters . Our approach quickly hones in on the Gaussian kernel with and takes a total of 107 iterations.
kernel  1  2  3  4  5  6 
iterations  30  33  21  94  57  45 
5 Conclusions
Variable projection has been successfully used in a variety of contexts; the popularity of the approach is largely due to its superior numerical performance when compared to joint optimization schemes. In this paper, we considered a range of nonsmooth applications, illustrating the use of variable projection for sparse deconvolution and direction of arrivals estimation. We showed that differentiability of the projected function can be understood using basic variational analysis, and that this approach has limitations and can fail for interesting cases. In particular, for robust formulations and multiple kernel learning, we showed that the projected function can fail to be differentiable. To circumvent this difficulty, we proposed a novel smoothing technique for the inner problem that preserves feasibility and structure of the solution set, while guaranteeing differentiability of the projected function. Numerical examples in all cases showed that variable projection schemes, when appropriately applied, are highly competitive for a wide range of nonsmooth applications.
Acknowledgment.
Research of A. Aravkin was partially supported by the Washington Research Foundation Data Science Professorship. Research of D. Drusvyatskiy was partially supported by the AFOSR YIP award FA95501510237. The research of T. van Leeuwen was in part financially supported by the Netherlands Organisation of Scientific Research (NWO) as part of research programme 613.009.032.
References
 [1] G. S. Abdoulaev, K. Ren, and A. H. Hielscher. Optical tomography as a PDEconstrained optimization problem. Inverse Problems, 21(5):1507–1530, oct 2005.
 [2] M. S. C. Almeida and M. a. T. Figueiredo. Parameter estimation for blind and nonblind deblurring using residual whiteness measures. IEEE Transactions on Image Processing, 22(7):2751–2763, 2013.
 [3] A. Y. Aravkin, J. V. Burke, and G. Pillonetto. Sparse/robust estimation and kalman smoothing with nonsmooth logconcave densities: Modeling, computation, and theory. Journal of Machine Learning Research, 14:2689–2728, 2013.
 [4] A. Y. Aravkin and T. van Leeuwen. Estimating nuisance parameters in inverse problems. Inverse Problems, 28(11):115016, nov 2012.
 [5] B. Bell and J. Burke. Algorithmic differentiation of implicit functions and optimal values. In C. H. Bischof, H. M. Bücker, P. D. Hovland, U. Naumann, and J. Utke, editors, Advances in Automatic Differentiation, pages 67–77. Springer, 2008.
 [6] G. Biros and O. Ghattas. Inexactness issues in the LagrangeNewtonKrylovSchur method for PDEconstrained optimization. In LargeScale PDEConstrained Optimization, pages 93–114. Springer, 2003.
 [7] J. Borwein and A. Lewis. Convex analysis and nonlinear optimization. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 3. Springer, New York, second edition, 2006. Theory and examples.
 [8] M. Cheney. The linear sampling method and the MUSIC algorithm. Inverse Problems, 17(4):591–595, 2001.
 [9] L. Condat. Fast Projection onto the Simplex and the l1 Ball. Technical report, 2014.
 [10] J. Dennis, M. Heinkenschloss, and L. Vicente. Trustregion interiorpoint SQP algorithms for a class of nonlinear programming problems. SIAM Journal on Control and Optimization, 36(5):1750–1794, 1998.
 [11] G. Golub and V. Pereyra. The differentiation of pseudoinverses and nonlinear least squares which variables separate. SIAM J. Numer. Anal., 10(2):413–432, 1973.
 [12] G. Golub and V. Pereyra. Separable nonlinear least squares: the variable projection method and its applications. Inverse Problems, 19(2):R1, 2003.
 [13] E. Haber, U. M. Ascher, and D. Oldenburg. On optimization techniques for solving nonlinear inverse problems. Inverse Problems, 16(5):1263–1280, oct 2000.
 [14] S. M. Hanasoge. Full waveform inversion of solar interior flows. The Astrophysical Journal, 797(1):23, nov 2014.

[15]
R. Koenker and G. Bassett Jr.
Regression quantiles.
Econometrica: journal of the Econometric Society, pages 33–50, 1978.  [16] A. S. Lewis and M. L. Overton. Nonsmooth optimization via quasiNewton methods. Mathematical Programming, 141(12):135–163, oct 2013.
 [17] T. T. Y. Lin and F. J. Herrmann. Robust estimation of primaries by sparse inversion via onenorm minimization. Geophysics, 78(3):R133–R150, 2013.
 [18] M. Osborne. Separable least squares, variable projection, and the GaussNewton algorithm. Electronic Transactions on Numerical Analysis, 28(2):1–15, 2007.
 [19] D. P. O’Leary and B. W. Rust. Variable projection for nonlinear least squares problems. Computational Optimization and Applications, 54(3):579–593, apr 2013.
 [20] S. U. Park, N. Dobigeon, and A. O. Hero. Semiblind sparse image reconstruction with application to MRFM. IEEE Transactions on Image Processing, 21(9):3838–3849, 2012.
 [21] V. Pereyra and G. Scherer, editors. Exponential Data Fitting and its Applications. Bentham Science and Science Publishers, mar 2012.

[22]
P. Philip.
Optimal Control and Partial Differential Equations.
Technical report, 2009.  [23] R.E. Plessix. A review of the adjointstate method for computing the gradient of a functional with geophysical applications. Geophysical Journal International, 167(2):495–503, 2006.
 [24] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In Proceedings of the 24th international conference on Machine learning, pages 775–782. ACM, 2007.
 [25] R. Rockafellar and R. Wets. Variational Analysis, volume 317. Springer, 1998.
 [26] A. A. Royer, M. G. Bostock, and E. Haber. Blind deconvolution of seismograms regularized via minimum support. Inverse Problems, 28(12):125010, dec 2012.
 [27] A. Ruhe and P. Wedin. Algorithms for separable nonlinear least squares problems. SIAM Rev., 22(3):318–337, 1980.
 [28] D. Ruppert and R. J. Carroll. Trimmed least squares estimation in the linear model. Journal of the American Statistical Association, 75(372):828–838, 1980.
 [29] S. Saitoh. Theory of reproducing kernels and its applications, volume 189 of Pitman Research Notes in Mathematics Series. Longman Scientific & Technical, Harlow; copublished in the United States with John Wiley & Sons, Inc., New York, 1988.
 [30] R. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276–280, mar 1986.
 [31] B. Schölkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Computational learning theory (Amsterdam, 2001), volume 2111 of Lecture Notes in Comput. Sci., pages 416–426. Springer, Berlin, 2001.
 [32] B. Schölkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.
 [33] P. Shearer and A. C. Gilbert. A generalization of variable elimination for separable inverse problems beyond least squares. Inverse Problems, 29(4):045003, Apr. 2013.
 [34] A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199–222, 2004.
 [35] T. van Leeuwen and F. J. Herrmann. A penalty method for PDEconstrained optimization in inverse problems. Technical Report 1, jan 2016.
 [36] W. Wang and M. A. CarreiraPerpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541, 2013.
 [37] E. Yang and A. C. Lozano. Robust gaussian graphical modeling with the trimmed graphical lasso. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2584–2592. Curran Associates, Inc., 2015.
Comments
There are no comments yet.