# Robust Nonparametric Regression via Sparsity Control with Application to Load Curve Data Cleansing

Nonparametric methods are widely applicable to statistical inference problems, since they rely on a few modeling assumptions. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robustify nonparametric regression against outliers - that is, data markedly deviating from the postulated models. A variational counterpart to least-trimmed squares regression is shown closely related to an L0-(pseudo)norm-regularized estimator, that encourages sparsity in a vector explicitly modeling the outliers. This connection suggests efficient solvers based on convex relaxation, which lead naturally to a variational M-type estimator equivalent to the least-absolute shrinkage and selection operator (Lasso). Outliers are identified by judiciously tuning regularization parameters, which amounts to controlling the sparsity of the outlier vector along the whole robustification path of Lasso solutions. Reduced bias and enhanced generalization capability are attractive features of an improved estimator obtained after replacing the L0-(pseudo)norm with a nonconvex surrogate. The novel robust spline-based smoother is adopted to cleanse load curve data, a key task aiding operational decisions in the envisioned smart grid system. Computer simulations and tests on real load curve data corroborate the effectiveness of the novel sparsity-controlling robust estimators.

## Authors

• 13 publications
• 98 publications
• ### Robust PCA as Bilinear Decomposition with Outlier-Sparsity Regularization

Principal component analysis (PCA) is widely used for dimensionality red...
11/08/2011 ∙ by Gonzalo Mateos, et al. ∙ 0

• ### High-dimensional inference robust to outliers with l1-norm penalization

This paper studies inference in the high-dimensional linear regression m...
12/28/2020 ∙ by Jad Beyhum, et al. ∙ 0

• ### Inference robust to outliers with l1-norm penalization

This paper considers the problem of inference in a linear regression mod...
06/04/2019 ∙ by Jad Beyhum, et al. ∙ 0

• ### Minimum Distance Estimation for Robust High-Dimensional Regression

We propose a minimum distance estimation method for robust regression in...
07/11/2013 ∙ by Aurélie C. Lozano, et al. ∙ 0

• ### GEP-MSCRA for computing the group zero-norm regularized least squares estimator

This paper concerns with the group zero-norm regularized least squares e...
04/26/2018 ∙ by Shujun Bi, et al. ∙ 0

• ### Robust censored regression with l1-norm regularization

This paper considers inference in a linear regression model with random ...
10/05/2021 ∙ by Jad Beyhum, et al. ∙ 0

• ### How Entropic Regression Beats the Outliers Problem in Nonlinear System Identification

System identification (SID) is central in science and engineering applic...
05/16/2019 ∙ by Abd AlRahman R. AlMomani, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Consider the classical problem of function estimation, in which an input vector is given, and the goal is to predict the real-valued scalar response . Function is unknown, to be estimated from a training data set . When

is assumed to be a member of a finitely-parameterized family of functions, standard (non-)linear regression techniques can be adopted. If on the other hand, one is only willing to assume that

belongs to a (possibly infinite dimensional) space of “smooth” functions , then a nonparametric approach is in order, and this will be the focus of this work.

Without further constraints beyond , functional estimation from finite data is an ill-posed problem. To bypass this challenge, the problem is typically solved by minimizing appropriately regularized criteria, allowing one to control model complexity; see, e.g., [12, 34]. It is then further assumed that has the structure of a reproducing kernel Hilbert space (RKHS), with corresponding positive definite reproducing kernel function , and norm denoted by . Under the formalism of regularization networks, one seeks as the solution to the variational problem

 minf∈H[N∑i=1V(yi−f(x))+μ∥f∥2H] (1)

where

is a convex loss function, and

controls complexity by weighting the effect of the smoothness functional . Interestingly, the Representer Theorem asserts that the unique solution of (1) is finitely parametrized and has the form , where can be obtained from ; see e.g.,  [29, 38]. Further details on RKHS, and in particular on the evaluation of , can be found in e.g., [38, Ch. 1]. A fundamental relationship between model complexity control and generalization capability, i.e., the predictive ability of beyond the training set, was formalized in [37].

The generalization error performance of approaches that minimize the sum of squared model residuals [that is in (1)] regularized by a term of the form , is degraded in the presence of outliers. This is because the least-squares (LS) part of the cost is not robust, and can result in severe overfitting of the (contaminanted) training data [21]. Recent efforts have considered replacing the squared loss with a robust counterpart such as Huber’s function, or its variants, but lack a data-driven means of selecting the proper threshold that determines which datum is considered an outlier [43]; see also [27]. Other approaches have instead relied on the so-termed

-insensitive loss function, originally proposed to solve function approximation problems using support vector machines (SVMs)

[37]. These family of estimators often referred to as support vector regression (SVR), have been shown to enjoy robustness properties; see e.g., [32, 28, 26] and references therein. In [8], improved performance in the presence of outliers is achieved by refining the SVR solution through a subsequent robust learning phase.

The starting point here is a variational least-trimmed squares (VLTS) estimator, suitable for robust function approximation in (Section II). It is established that VLTS is closely related to an (NP-hard) -(pseudo)norm-regularized estimator, adopted to fit a regression model that explicitly incorporates an unknown sparse vector of outliers [17]. As in compressive sampling (CS) [35], efficient (approximate) solvers are obtained in Section III by replacing the outlier vector’s -norm with its closest convex approximant, the -norm. This leads naturally to a variational M-type estimator of , also shown equivalent to a least-absolute shrinkage and selection operator (Lasso) [33] on the vector of outliers (Section III-A). A tunable parameter in Lasso controls the sparsity of the estimated vector, and the number of outliers as a byproduct. Hence, effective methods to select this parameter are of paramount importance.

The link between -norm regularization and robustness was also exploited for parameter (but not function) estimation in [17] and [22]; see also [40]

for related ideas in the context of face recognition, and error correction codes

[4]. In [17] however, the selection of Lasso’s tuning parameter is only justified for Gaussian training data; whereas a fixed value motivated by CS error bounds is adopted in the Bayesian formulation of [22]. Here instead, a more general and systematic approach is pursued in Section III-B, building on contemporary algorithms that can efficiently compute all robustifaction paths of Lasso solutions, i.e., for all values of the tuning parameter [11, 16, 41]. In this sense, the method here capitalizes on but is not limited to sparse settings, since one can examine all possible sparsity levels along the robustification path. An estimator with reduced bias and improved generalization capability is obtained in Section IV, after replacing the -norm with a nonconvex surrogate, instead of the -norm that introduces bias [33, 44]. Simulated tests demonstrate the effectiveness of the novel approaches in robustifying thin-plate smoothing splines [10] (Section V-A), and in estimating the sinc function (Section V-B) – a paradigm typically adopted to assess performance of robust function approximation approaches [8, 43].

In this context, it is critical to effectively reject outliers, and replace the contaminated data with ‘healthy’ load predictions, i.e., to cleanse the load data. While most utilities carry out this task manually based on their own personnel’s know-how, a first scalable and principled approach to load profile cleansing which is based on statistical learning methods was recently proposed in [6]; which also includes an extensive literature review on the related problem of outlier identification in time-series. After estimating the regression function

via either B-spline or Kernel smoothing, pointwise confidence intervals are constructed based on

. A datum is deemed as an outlier whenever it falls outside its associated confidence interval. To control the degree of smoothing effected by the estimator, [6] requires the user to label the outliers present in a training subset of data, and in this sense the approach therein is not fully automatic. Here instead, a novel alternative to load curve cleansing is developed after specializing the robust estimators of Sections III and IV, to the case of cubic smoothing splines (Section V-C). The smoothness-and outlier sparsity-controlling parameters are selected according to the guidelines in Section III-B; hence, no input is required from the data analyst. The proposed spline-based method is tested on real load curve data from a government building.

Concluding remarks are given in Section VI, while some technical details are deferred to the Appendix.

Notation: Bold uppercase letters will denote matrices, whereas bold lowercase letters will stand for column vectors. Operators , and will denote transposition, matrix trace and expectation, respectively; will be used for the cardinality of a set and the magnitude of a scalar. The norm of vector is for ; and is the matrix Frobenious norm. Positive definite matrices will be denoted by . The identity matrix will be represented by , while will denote the vector of all zeros, and .

## Ii Robust Estimation Problem

The training data comprises noisy samples of taken at the input points (also known as knots in the splines parlance), and in the present context they can be possibly contaminated with outliers. Building on the parametric least-trimmed squares (LTS) approach [31], the desired robust estimate can be obtained as the solution of the following variational (V)LTS minimization problem

 minf∈H[s∑i=1r2[i](f)+μ∥f∥2H] (2)

where is the -th order statistic among the squared residuals , and . In words, given a feasible , to evaluate the sum of the cost in (2) one: i) computes all squared residuals , ii) orders them to form the nondecreasing sequence ; and iii) sums up the smallest terms. As in the parametric LTS [31], the so-termed trimming constant (also known as coverage) determines the breakdown point of the VLTS estimator, since the largest residuals do not participate in (2). Ideally, one would like to make equal to the (typically unknown) number of outliers in the training data. For most pragmatic scenaria where is unknown, the LTS estimator is an attractive option due to its high breakdown point and desirable theoretical properties, namely -consistency and asymptotic normality [31].

The tuning parameter in (2) controls the tradeoff between fidelity to the (trimmed) data, and the degree of “smoothness” measured by . In particular,

can be interpreted as a generalized ridge regularization term penalizing more those functions with large coefficients in a basis expansion involving the eigenfunctions of the kernel

.

Given that the sum in (2) is a nonconvex functional, a nontrivial issue pertains to the existence of the proposed VLTS estimator, i.e., whether or not (2) attains a minimum in . Fortunately, a (conceptually) simple solution procedure suffices to show that a minimizer does indeed exist. Consider specifically a given subsample of training data points, say , and solve

 minf∈H[s∑i=1r2i(f)+μ∥f∥2H].

A unique minimizer of the form is guaranteed to exist, where is used here to denote the chosen subsample, and the coefficients can be obtained by solving a particular linear system of equations [38, p. 11]. This procedure can be repeated for each subsample (there are of these), to obtain a collection of candidate solutions of (2). The winner(s) yielding the minimum cost, is the desired VLTS estimator. A remark is now in order.

###### Remark 1 (VLTS complexity)

Even though conceptually simple, the solution procedure just described guarantees existence of (at least) one solution, but entails a combinatorial search over all subsamples which is intractable for moderate to large sample sizes . In the context of linear regression, algorithms to obtain approximate LTS solutions are available; see e.g., [30].

### Ii-a Robust function approximation via ℓ0-norm regularization

Instead of discarding large residuals, the alternative approach proposed here explicitly accounts for outliers in the regression model. To this end, consider the scalar variables one per training datum, taking the value whenever datum adheres to the postulated nominal model, and otherwise. A regression model naturally accounting for the presence of outliers is

 yi=f(xi)+oi+εi,i=1,…,N (3)

where

are zero-mean independent and identically distributed (i.i.d.) random variables modeling the observation errors. A similar model was advocated under different assumptions in

[17] and [22], in the context of robust parametric regression; see also [4] and [40]. For an outlier-free datum , (3) reduces to ; hence, will be often referred to as the nominal noise. Note that in (3), both as well as the vector are unknown; thus, (3) is underdetermined. On the other hand, as outliers are expected to often comprise a small fraction of the training sample say, not exceeding 20% – vector is typically sparse, i.e., most of its entries are zero; see also Remark 3. Sparsity compensates for underdeterminacy and provides valuable side-information when it comes to efficiently estimating , identifying outliers as a byproduct, and consequently performing robust estimation of the unknown function .

A natural criterion for controlling outlier sparsity is to seek the desired estimate as the solution of

 (4)

where is a preselected threshold, and denotes the -norm of , which equals the number of nonzero entries of its vector argument. Sparsity is directly controlled by the selection of the tuning parameter . If the number of outliers were known a priori, then should be selected equal to . Unfortunately, analogously to related -norm constrained formulations in compressive sampling and sparse signal representations, problem (4) is NP-hard. In addition, (4) can be recast to an equivalent (unconstrained) Lagrangian form; see e.g., [3]

 (5)

where the tuning Lagrange multiplier plays a role similar to in (4), and the -norm sparsity encouraging penalty is added to the cost.

To further motivate model (3) and the proposed criterion (5) for robust nonparametric regression, it is worth checking the structure of the minimizers of the cost in (5). Consider for the sake of argument that is given, and its value is such that , for some . The goal is to characterize , as well as the positions and values of the nonzero entries of . Note that because , the last term in (5) is constant, hence inconsequential to the minimization. Upon defining , it is not hard to see that the entries of satisfy

 ^oi={0,|^ri|≤√λ0^ri,|^ri|>√λ0,i=1,…,N (6)

at the optimum. This is intuitive, since for those the best thing to do in terms of minimizing the overall cost is to set , and thus null the corresponding squared-residual terms in (5). In conclusion, for the chosen value of it holds that squared residuals effectively do not contribute to the cost in (5).

To determine the support of and , one alternative is to exhaustively test all admissible support combinations. For each one of these combinations (indexed by ), let be the index set describing the support of , i.e., if and only if ; and . By virtue of (6), the corresponding candidate minimizes

 minf∈H⎡⎣∑i∈Sjr2i(f)+μ∥f∥2H⎤⎦

while is the one among all that yields the least cost. The previous discussion, in conjunction with the one preceding Remark 1 completes the argument required to establish the following result.

Proposition 1: If minimizes (5) with chosen such that , then also solves the VLTS problem (2).

The importance of Proposition II-A is threefold. First, it formally justifies model (3) and its estimator (5) for robust function approximation, in light of the well documented merits of LTS regression [30]. Second, it further solidifies the connection between sparse linear regression and robust estimation. Third, the -norm regularized formulation in (5) lends itself naturally to efficient solvers based on convex relaxation, the subject dealt with next.

## Iii Sparsity Controlling Outlier Rejection

To overcome the complexity hurdle in solving the robust regression problem in (5), one can resort to a suitable relaxation of the objective function. The goal is to formulate an optimization problem which is tractable, and whose solution yields a satisfactory approximation to the minimizer of the original hard problem. To this end, it is useful to recall that the -norm of vector is the closest convex approximation of . This property also utilized in the context of compressive sampling [35], provides the motivation to relax the NP-hard problem (5) to

 (7)

Being a convex optimization problem, (7) can be solved efficiently. The nondifferentiable -norm regularization term controls sparsity on the estimator of

, a property that has been recently exploited in diverse problems in engineering, statistics and machine learning. A noteworthy representative is the least-absolute shrinkage and selection operator (Lasso)

[33], a popular tool in statistics for joint estimation and continuous variable selection in linear regression problems. In its Lagrangian form, Lasso is also known as basis pursuit denoising in the signal processing literature, a term coined by [7] in the context of finding the best sparse signal expansion using an overcomplete basis.

It is pertinent to ponder on whether problem (7) has built-in ability to provide robust estimates in the presence of outliers. The answer is in the affirmative, since a straightforward argument (details are deferred to the Appendix) shows that (7) is equivalent to a variational M-type estimator found by

 minf∈H[N∑i=1ρ(yi−f(xi))+μ∥f∥2H] (8)

where is a scaled version of Huber’s convex loss function [21]

 ρ(u):={u2,|u|≤λ1/2λ1|u|−λ21/4,|u|>λ1/2. (9)
###### Remark 2 (Regularized regression and robustness)

Existing works on linear regression have pointed out the equivalence between -norm regularized regression and M-type estimators, under specific assumptions on the distribution of the outliers (-contamination) [17, 23]. However, they have not recognized the link with LTS through the convex relaxation of (5), and the connection asserted by Proposition II-A. Here, the treatment goes beyond linear regression by considering nonparametric functional approximation in RKHS. Linear regression is subsumed as a special case, when the linear kernel is adopted. In addition, no assumption is imposed on the outlier vector.

It is interesting to compare the - and -norm formulations [cf. (5) and (7), respectively] in terms of their equivalent purely variational counterparts in (2) and (8), that entail robust loss functions. While the VLTS estimator completely discards large residuals, still retains them, but downweighs their effect through a linear penalty. Moreover, while (8) is convex, (2) is not and this has a direct impact on the complexity to obtain either estimator. Regarding the trimming constant in (2), it controls the number of residuals retained and hence the breakdown point of VLTS. Considering instead the threshold in Huber’s function , when the outliers’ distribution is known a-priori, its value is available in closed form so that the robust estimator is optimal in a well-defined sense [21]

. Convergence in probability of M-type cubic smoothing splines estimators – a special problem subsumed by (

8) – was studied in [9].

### Iii-a Solving the convex relaxation

Because (7) is jointly convex in and , an alternating minimization (AM) algorithm can be adopted to solve (7), for fixed values of and . Selection of these parameters is a critical issue that will be discussed in Section III-B. AM solvers are iterative procedures that fix one of the variables to its most up to date value, and minimize the resulting cost with respect to the other one. Then the roles are reversed to complete one cycle, and the overall two-step minimization procedure is repeated for a prescribed number of iterations, or, until a convergence criterion is met. Letting denote iterations, consider that is fixed in (7). The update for at the -th iteration is given by

 f(k):=argminf∈H[N∑i=1((yi−o(k−1)i)−f(xi))2+μ∥f∥2H] (10)

which corresponds to a standard regularization problem for functional approximation in  [12], but with outlier-compensated data . It is well known that the minimizer of the variational problem (10) is finitely parameterized, and given by the kernel expansion  [38]. The vector is found by solving the linear system of equations

 [K+μIN]β(k)=y−o(k−1) (11)

where , and the matrix has entries .

In a nutshell, updating is equivalent to updating vector as per (11), where only the independent vector variable changes across iterations. Because the system matrix is positive definite, the per iteration systems of linear equations (11) can be efficiently solved after computing once, the Cholesky factorization of .

For fixed in (7), the outlier vector update at iteration is obtained as

 o(k):=argmino∈RN[N∑i=1(r(k)i−oi)2+λ1∥o∥1] (12)

where . Problem (12) can be recognized as an instance of Lasso for the so-termed orthonormal case, in particular for an identity regression matrix. The solution of such Lasso problems is readily obtained via soft-thresholding [15], in the form of

 o(k)i:=S(r(k)i,λ1/2),i=1,…,N (13)

where is the soft-thresholding operator, and denotes the projection onto the nonnegative reals. The coordinatewise updates in (13) are in par with the sparsifying property of the norm, since for “small” residuals, i.e., , it follows that , and the -th training datum is deemed outlier free. Updates (11) and (13) comprise the iterative AM solver of the -norm regularized problem (7), which is tabulated as Algorithm 1. Convexity ensures convergence to the global optimum solution regardless of the initial condition; see e.g., [3].

Algorithm 1 is also conceptually interesting, since it explicitly reveals the intertwining between the outlier identification process, and the estimation of the regression function with the appropriate outlier-compensated data. An additional point is worth mentioning after inspection of (13) in the limit as . From the definition of the soft-thresholding operator , for those “large” residuals exceeding in magnitude, when , and otherwise. In other words, larger residuals that the method identifies as corresponding to outlier-contaminated data are shrunk, but not completely discarded. By plugging back into (7), these “large” residuals cancel out in the squared error term, but still contribute linearly through the -norm regularizer. This is exactly what one would expect, in light of the equivalence established with the variational -type estimator in (8).

Next, it is established that an alternative to solving a sequence of linear systems and scalar Lasso problems, is to solve a single instance of the Lasso with specific response vector and (non-orthonormal) regression matrix.

Proposition 2: Consider defined as

 ^oLasso:=argmino∈RN∥Xμy−Xμo∥22+λ1∥o∥1 (14)

where

 Xμ:=[IN−K(K+μIN)−1(μK)1/2(K+μIN)−1]. (15)

Then the minimizers of (7) are fully determined given , as and , with .

###### Proof:

For notational convenience introduce the vectors and , where is the minimizer of (7). Next, consider rewriting (7) as

 mino∈RN[minf∈H∥(y−o)−f∥22+μ∥f∥2H]+λ1∥o∥1. (16)

The quantity inside the square brackets is a function of , and can be written explicitly after carrying out the minimization with respect to . From the results in [38], it follows that the vector of optimum predicted values at the points is given by ; see also the discussion after (10). Similarly, one finds that . Having minimized (16) with respect to , the quantity inside the square brackets is

 minf∈H[∥(y−o)−f∥22+μ∥f∥2H] =∥∥(y−o)−^f∥∥22+μ∥^f∥2H =∥∥(y−o)−KΓμ(y−o)∥∥22+μ(y−o)′ΓμKΓμ(y−o) =∥∥(IN−KΓμ)y−(IN−KΓμ)o∥∥22+μ(y−o)′ΓμKΓμ(y−o). (17)

After expanding the quadratic form in the right-hand side of (III-A), and eliminating the term that does not depend on , problem (16) becomes

 mino∈RN[∥∥(IN−KΓμ)y−(IN−KΓμ)o∥∥22−2μy′ΓμKΓμo+μo′ΓμKΓμo+λ1∥o∥1].

Completing the square one arrives at

which completes the proof.

The result in Proposition 1 opens the possibility for effective methods to select . These methods to be described in detail in the ensuing section, capitalize on recent algorithmic advances on Lasso solvers, which allow one to efficiently compute for all values of the tuning parameter . This is crucial for obtaining satisfactory robust estimates , since controlling the sparsity in by tuning is tantamount to controlling the number of outliers in model (3).

### Iii-B Selection of the tuning parameters: robustification paths

As argued before, the tuning parameters and in (7) control the degree of smoothness in and the number of outliers (nonzero entries in

), respectively. From a statistical learning theory standpoint,

and

control the amount of regularization and model complexity, thus capturing the so-termed effective degrees of freedom

[19]. Complex models tend to have worse generalization capability, even though the prediction error over the training set may be small (overfitting). In the contexts of regularization networks [12] and Lasso estimation for regression [33], corresponding tuning parameters are typically selected via model selection techniques such as cross-validation, or, by minimizing the prediction error over an independent test set, if available [19]. However, these simple methods are severely challenged in the presence of multiple outliers. For example, the swamping effect refers to a very large value of the residual corresponding to a left out clean datum , because of an unsatisfactory model estimation based on all data except ; data which contain outliers.

The idea here offers an alternative method to overcome the aforementioned challenges, and the possibility to efficiently compute for all values of , given . A brief overview of the state-of-the-art in Lasso solvers is given first. Several methods for selecting and are then described, which differ on the assumptions of what is known regarding the outlier model (3).

Lasso amounts to solving a quadratic programming (QP) problem [33]; hence, an iterative procedure is required to determine in (14) for a given value of . While standard QP solvers can be certainly invoked to this end, an increasing amount of effort has been put recently toward developing fast algorithms that capitalize on the unique properties of Lasso. The LARS algorithm [11] is an efficient scheme for computing the entire path of solutions (corresponding to all values of ), sometimes referred to as regularization paths. LARS capitalizes on piecewise linearity of the Lasso path of solutions, while incurring the complexity of a single LS fit, i.e., when . Coordinate descent algorithms have been shown competitive, even outperforming LARS when is large, as demonstrated in [16]; see also [42, 15], and the references therein. Coordinate descent solvers capitalize on the fact that Lasso can afford a very simple solution in the scalar case, which is given in closed form in terms of a soft-thresholding operation [cf. (13)]. Further computational savings are attained through the use of warm starts [15], when computing the Lasso path of solutions over a grid of decreasing values of . An efficient solver capitalizing on variable separability has been proposed in [41].

Consider then a grid of values of in the interval , evenly spaced in a logarithmic scale. Likewise, for each consider a similar type of grid consisting of values of , where is the minimum value such that  [16], and in (14). Typically, with , say. Note that each of the values of gives rise to a different grid, since depends on through . Given the previously surveyed algorithmic alternatives to tackle the Lasso, it is safe to assume that (14) can be efficiently solved over the (nonuniform) grid of values of the tuning parameters. This way, for each value of one obtains samples of the Lasso path of solutions, which in the present context can be referred to as robustification path. As decreases, more variables enter the model signifying that more of the training data are deemed to contain outliers. An example of the robustification path is given in Fig. 3.

Based on the robustification paths and the prior knowledge available on the outlier model (3), several alternatives are given next to select the “best” pair in the grid .

Number of outliers is known: When is known, by direct inspection of the robustification paths one can determine the range of values for , for which has exactly nonzero entries. Specializing to the interval of interest, and after discarding outliers which are now fixed and known, -fold cross-validation methods can be applied to determine .

Variance of the nominal noise is known:

Supposing that the variance

of the i.i.d. nominal noise variables in (3) is known, one can proceed as follows. Using the solution obtained for each pair on the grid, form the sample variance matrix with -th entry

 [¯Σ]ij:=∑u|^oLasso,u=0^r2u/^No=∑u|^oLasso,u=0(yu−^f(xu))2/^No (18)

where stands for the number of nonzero entries in . Although not made explicit, the right-hand side of (18) depends on through the estimate , and . The entries correspond to a sample estimate of , without considering those training data that the method determined to be contaminated with outliers, i.e., those indices for which . The “winner” tuning parameters are such that

 [i∗,j∗]:=argmini,j|[¯Σ]ij−σ2ε| (19)

which is an absolute variance deviation (AVD) criterion.

Variance of the nominal noise is unknown: If is unknown, one can still compute a robust estimate of the variance , and repeat the previous procedure (with known nominal noise variance) after replacing with in (19). One option is based on the median absolute deviation (MAD) estimator, namely

 ^σε:=1.4826×mediani(|^ri−medianj(^rj)|) (20)

where the residuals are formed based on a nonrobust estimate of , obtained e.g., after solving (7) with and using a small subset of the training dataset . The factor

provides an approximately unbiased estimate of the standard deviation when the nominal noise is Gaussian. Typically,

in (20) is used as an estimate for the scale of the errors in general M-type robust estimators; see e.g., [9] and [27].

###### Remark 3 (How sparse is sparse)

Even though the very nature of outliers dictates that is typically a small fraction of – and thus in (3) is sparse – the method here capitalizes on, but is not limited to sparse settings. For instance, choosing along the robustification paths allows one to continuously control the sparsity level, and potentially select the right value of for any given . Admittedly, if is large relative to , then even if it is possible to identify and discard the outliers, the estimate may not be accurate due to the lack of outlier-free data.

## Iv Refinement via Nonconvex Regularization

Instead of substituting in (5) by its closest convex approximation, namely , letting the surrogate function to be non-convex can yield tighter approximations. For example, the -norm of a vector was surrogated in [5]

by the logarithm of the geometric mean of its elements, or by

. In rank minimization problems, apart from the nuclear norm relaxation, minimizing the logarithm of the determinant of the unknown matrix has been proposed as an alternative surrogate [14]. Adopting related ideas in the present nonparametric context, consider approximating (5) by

 (21)

where is a sufficiently small positive offset introduced to avoid numerical instability.

Since the surrogate term in (21) is concave, the overall problem is nonconvex. Still, local methods based on iterative linearization of , around the current iterate , can be adopted to minimize (21). From the concavity of the logarithm, its local linear approximation serves as a global overestimator. Standard majorization-minimization algorithms motivate minimizing the global linear overestimator instead. This leads to the following iteration for (see e.g., [25] for further details)

 [f(k),o(k)] (22) w(k)i :=(|o(k−1)i|+δ)−1,i=1,…,N. (23)

It is possible to eliminate the optimization variable from (22), by direct application of the result in Proposition 1. The equivalent update for at iteration is then given by

 o(k):=argmino∈RN[∥Xμy−Xμo∥22+λ0N∑i=1w(k)i|oi|] (24)

which amounts to an iteratively reweighted version of (14). If the value of is small, then in the next iteration the corresponding regularization term has a large weight, thus promoting shrinkage of that coordinate to zero. On the other hand when is significant, the cost in the next iteration downweighs the regularization, and places more importance to the LS component of the fit. For small , analysis of the limiting point of (24) reveals that

 λ0w∗i|o∗i|≈{λ0,|o∗i|≠00,|o∗i|=0

and hence, .

A good initialization for the iteration in (24) and is , which corresponds to the solution of (14) [and (7)] for and . This is equivalent to a single iteration of (24) with all weights equal to unity. The numerical tests in Section V will indicate that even a single iteration of (24) suffices to obtain improved estimates , in comparison to those obtained from (14). The following remark sheds further light towards understanding why this should be expected.

###### Remark 4 (Refinement through bias reduction)

Uniformly weighted -norm regularized estimators such as (7) are biased [44], due to the shrinkage effected on the estimated coefficients. It will be argued next that the improvements due to (24) can be leveraged to bias reduction. Several workarounds have been proposed to correct the bias in sparse regression, that could as well be applied here. A first possibility is to retain only the support of (14) and re-estimate the amplitudes via, e.g., the unbiased LS estimator [11]. An alternative approach to reducing bias is through nonconvex regularization using e.g., the smoothly clipped absolute deviation (SCAD) scheme [13]. The SCAD penalty could replace the sum of logarithms in (21), still leading to a nonconvex problem. To retain the efficiency of convex optimization solvers while simultaneously limiting the bias, suitably weighted -norm regularizers have been proposed instead [44]. The constant weights in [44] play a role similar to those in (23); hence, bias reduction is expected.

## V Numerical Experiments

### V-a Robust thin-plate smoothing splines

To validate the proposed approach to robust nonparametric regression, a simulated test is carried out here in the context of thin-plate smoothing spline approximation [10, 39]. Specializing (7) to this setup, the robust thin-plate splines estimator can be formulated as

 (25)

where denotes the Frobenius norm of the Hessian of . The penalty functional

 J[f]:=∫R2∥∇2f∥2Fdx=∫R2⎡⎣(∂2f∂x21)2+2(∂2f∂x1∂x2)2+(∂2f∂x22)2⎤⎦dx (26)

extends to the one-dimensional roughness regularization used in smoothing spline models. For , the (non-unique) estimate in (25) corresponds to a rough

function interpolating the outlier compensated data; while as

the estimate is linear (cf. ). The optimization is over , the space of Sobolev functions, for which is well defined [10, p. 85]. Reproducing kernel Hilbert spaces such as , with inner-products (and norms) involving derivatives are studied in detail in [38].

Different from the cases considered so far, the smoothing penalty in (26) is only a seminorm, since first-order polynomials vanish under . Omitting details than can be found in [38, p. 30], under fairly general conditions a unique minimizer of (25) exists. The solution admits the finitely parametrized form , where in this case

is a radial basis function. In simple terms, the solution as a kernel expansion is augmented with a member of the null space of

. The unknown parameters are obtained in closed form, as solutions to a constrained, regularized LS problem; see [38, p. 33]. As a result, Proposition 1 still holds with minor modifications on the structure of .

###### Remark 5 (Bayesian framework)

Adopting a Bayesian perspective, one could model in (3) as a sample function of a zero mean Gaussian stationary process, with covariance function  [24]. Consider as well that are mutually independent, while and in (3) are i.i.d. Gaussian and Laplace distributed, respectively. From the results in [24] and a straightforward calculation, it follows that setting and in (25) yields estimates (and ) which are optimal in a maximum a posteriori sense. This provides yet another means of selecting the parameters and , further expanding the options presented in Section III-B.

The simulation setup is as follows. Noisy samples of the true function comprise the training set . Function is generated as a Gaussian mixture with two components, with respective mean vectors and covariance matrices given by

 μ1=[0.22950.4996],Σ1=[2.24310.45770.45771.0037],μ2=[2.45662.9461],Σ2=[2.90690.52360.52361.7299].

Function is depicted in Fig. 4 (a). The training data set comprises examples, with inputs

drawn from a uniform distribution in the square

. Several values ranging from to of the data are generated contaminated with outliers. Without loss of generality, the corrupted data correspond to the first training samples with , for which the response values are independently drawn from a uniform distribution over . Outlier-free data are generated according to the model , where the independent additive noise terms

are Gaussian distributed, for

. For the case where , the data used in the experiment is shown in Fig. 2. Superimposed to the true function are black points corresponding to data drawn from the nominal model, as well as red outlier points.

For this experiment, the nominal noise variance is assumed known. A nonuniform grid of and values is constructed, as described in Section III-B. The relevant parameters are , and . For each value of , the grid spans the interval defined by and , where . Each of the robustification paths corresponding to the solution of (14) is obtained using the SpaRSA toolbox in [41], exploiting warm starts for faster convergence. Fig. 3 depicts an example with and . With the robustification paths at hand, it is possible to form the sample variance matrix [cf. (18)], and select the optimum tuning parameters based on the criterion (19). Finally, the robust estimates are refined by running a single iteration of (24) as described in Section IV. The value was utilized, and several experiments indicated that the results are quite insensitive to the selection of this parameter.

The same experiment was conducted for a variable number of outliers , and the results are listed in Table I. In all cases, a outlier identification success rate was obtained, for the chosen value of the tuning parameters. This even happened at the first stage of the method, i.e., in (14) had the correct support in all cases. It has been observed in some other setups that (14) may select a larger support than , but after running a few iterations of (24) the true support was typically identified. To assess quality of the estimated function , two figures of merit were considered. First, the training error was evaluated as

 ¯err=1N−NoN∑i=No(yi−^f(xi))2

i.e., the average loss over the training sample after excluding outliers. Second, to assess the generalization capability of , an approximation to the generalization error was computed as

 ErrT=E[(y−^f(x))2|T]≈1~N~N∑i=1(~yi−^f(~xi))2 (27)

where is an independent test set generated from the model . For the results in Table I, was adopted corresponding to a uniform rectangular grid of points in