Adaptive multi-penalty regularization based on a generalized Lasso path

10/11/2017
by   Markus Grasmair, et al.
NTNU
Simula Research Lab
0

For many algorithms, parameter tuning remains a challenging and critical task, which becomes tedious and infeasible in a multi-parameter setting. Multi-penalty regularization, successfully used for solving undetermined sparse regression of problems of unmixing type where signal and noise are additively mixed, is one of such examples. In this paper, we propose a novel algorithmic framework for an adaptive parameter choice in multi-penalty regularization with a focus on the correct support recovery. Building upon the theory of regularization paths and algorithms for single-penalty functionals, we extend these ideas to a multi-penalty framework by providing an efficient procedure for the construction of regions containing structurally similar solutions, i.e., solutions with the same sparsity and sign pattern, over the whole range of parameters. Combining this with a model selection criterion, we can choose regularization parameters in a data-adaptive manner. Another advantage of our algorithm is that it provides an overview on the solution stability over the whole range of parameters. This can be further exploited to obtain additional insights into the problem of interest. We provide a numerical analysis of our method and compare it to the state-of-the-art single-penalty algorithms for compressed sensing problems in order to demonstrate the robustness and power of the proposed algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

08/07/2019

Linear convergence and support recovery for non-convex multi-penalty regularization

We provide a comprehensive convergence study of the iterative multi-pena...
09/12/2011

Efficient algorithm to select tuning parameters in sparse regression modeling with regularization

In sparse regression modeling via regularization such as the lasso, it i...
01/11/2022

A Doubly Adaptive Penalty Method for the Navier Stokes Equations

We develop, analyze and test adaptive penalty parameter methods. We prov...
07/02/2020

2DNMR data inversion using locally adapted multi-penalty regularization

A crucial issue in two-dimensional Nuclear Magnetic Resonance (NMR) is t...
10/08/2018

A Unified Dynamic Approach to Sparse Model Selection

Sparse model selection is ubiquitous from linear regression to graphical...
07/12/2021

Bayesian Atlas Building with Hierarchical Priors for Subject-specific Regularization

This paper presents a novel hierarchical Bayesian model for unbiased atl...
01/22/2019

Minimal penalties and the slope heuristics: a survey

Birgé and Massart proposed in 2001 the slope heuristics as a way to choo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Multi-penalty regularization for unmixing problems

Support recovery of a sparse high-dimensional signal still remains a challenging task from both theoretical and practical perspectives for a variety of applications from harmonic analysis, signal processing, and compressed sensing, see [1, 2] and references therein. Indeed, provided with the signal’s support, the signal entries can be easily recovered with optimal statistical rate [3]. Therefore, support recovery has been a topic of active and fruitful research in the last years [4, 5, 6]. One typically considers linear observation model problems of the form

(1)

where is the unknown signal with only a few non-zero entries, is the linear measurement matrix,

is a noise vector affecting the measurements, and

is the result of the measurement, usually . However, in a more realistic scenario, it is very common that noise is present not only in the measurements but also in the signal itself. In this context we study the impact of the noise folding phenomenon, a situation commonly encountered in compressed sensing [7, 8] due to subsampling in the presence of signal noise. Mathematically we can formalize this scenario by the model

(2)

where is the signal noise and is again a measurement noise. The impact of signal noise on the exact support recovery of the original signal was reported and analysed in [7, 8]. Essentially, the authors [8] claim that the exact support recovery is possible when the number of measurements scales linearly with , leading to a poor compression performance in such cases. These negative results can be circumvented by using, for instance, decoding procedures based on multi-penalty regularization as proposed in [9, 10, 11]. There, the authors provided theoretical and numerical pieces of evidence of improved performance of the multi-penalty regularization schemes for the solution of (2), especially with respect to support identification, as compared to their single-penalty counterparts. Inspired by these results, in this paper we consider the following minimization problem

(3)

where , , denotes the norm and are regularization parameters. We will denote any solution of (3) by The functional (3) uses a non-smooth term for promoting sparsity of and a quadratic penalty term for modeling the noise.

One of the main ingredients for the optimal performance of multi-penalty regularization is an appropriate choice of multiple regularization parameters, ideally in a data-adaptive manner. This issue has not been studied so far for this type of problems. Instead, the earlier works rather rely on brute-force approaches to choose parameters.

To circumvent difficulties related to the regularization parameters choice, in this paper we propose a two-step procedure for reconstructing the support of the signal of interest . First, we compute in a cost-efficient way all possible sufficiently sparse supports and sign patterns of the signal attainable from by means of (3

) for different regularization parameters. Then we employ standard regression methods for estimating a vector

that provides the best explanation of the datum for each found support and sign pattern. As such, we have a complete overview of the solution behaviour over the range of parameters, without imposing any a priori assumption. This allow us to choose regularization parameters in a data-adaptive manner.

1.2. Related work

The formulation of multi-penalty functionals of the type (3) is not novel. Especially in image and signal analysis, multi-penalty regularization has been presented and analysed in seminal papers by Meyer [12] and Donoho [13, 14]. We refer to [10, 15] for a thorough overview of the work on multi-penalty regularization in image and signal processing communities.

Multi-penalty regularization functional (3) is equivalent to Huber-norm regularization [16]

, which is commonly used in image super resolution optimization problems and computer-graphics problems. The recent work

[17]

provides an upper bound on the statistical risk of Huber-regularized linear models by means of the Rademacher complexity. It also provides empirical evidences that the support vector machine with Huber regularizer outperforms its single-penalty counterparts and Elastic-Net

[18] on a wide range of benchmark datasets. In this paper, we pursue a different goal, where we focus on the support recovery, rather than classification or regression problems. In this context, systematic theoretical and numerical investigations have only been considered recently [9, 10, 11].

At the same time, one of the key ingredients for optimal regularization is an adaptive parameter choice. This is generally a challenging task, especially so in a multi-parameter setting, and it has not yet been studied in the context of sparse support recovery. Grasmair and Naumova present a solution that is conceptually closest to the present paper and also provides bounds on admissible regularization parameters [11]. However, these bounds require accurate knowledge about the nature of the signal and the noise, which is rarely available in practice.

1.3. Contributions

The paper provides an efficient algorithm for the identification of possible parameter regions leading to structurally similar solutions for the multi-penalty functional (3), i.e., solutions with the same sparsity and sign pattern. The main advantage of the proposed algorithm is that, without strong a priori assumptions on the solution, it provides an overview on the solution stability over the whole parameter range. This information can be exploited for obtaining additional insights into the problem of interest. Furthermore, by combining the algorithm with a simple region selection criterion, regularization parameters can be chosen in a fully adaptive data-driven manner. In Section 6, we provide extensive numerical evidence that our combined algorithm shows a better performance in terms of support recovery when compared to its closest counterparts such as Lasso and pre-conditioned Lasso (pLasso), while still having a reasonable computational complexity. However, it is worthwhile to stress that our method recovers the solution and for the entire range of parameters rather than only at specific parameter combinations.

1.4. Organization of the paper

The rest of the paper is organized as follows. Section 2 contains the complete problem set-up, explaining multi-penalty regularization, and provides in-depth overview of the relevant existing literature. In Section 3, we introduce tilings as a conceptual framework for studying the structure of the support of the solution. Section 4 presents an algorithmic approach to the construction of the complete support tiling. In Section 5 we then discuss the actual realization of our algorithm, providing also its complexity analysis. Numerical experiments are carried out in Section 6. Finally, Section 7 offers a snapshot of the main contributions and points out open questions and directions for future work.

1.5. Notation

We provide a short overview of the standard notations used in this paper. The true solution of the unmixing problem (2) is called -sparse if it has at most non-zero entries, i.e., , where denotes the support of . For a matrix , we denote its transpose by The restriction of the matrix to the columns indexed by is denoted by , i.e., . The matrix

denotes the identity matrix of relevant size. The sign function

is interpreted as the set valued function if , if and if , applied componentwise to the entries of the vector.

2. Multi-parameter regularization

In this section, after formally introducing the multi-penalty regularization for solving (2) and relevant known results, we discuss a possible parameter choice for multi-penalty functional (3) based on the Lasso path [19]. We show that an extension of the single-penalty Lasso path by partial discretization of the parameter space to the multi-parameter case can have difficulties in capturing the exact support, especially when the solution is very sensitive with respect to the parameters change.

Inspired by recent theoretical results [11], we propose to solve the unmixing problem (2) using multi-penalty Tikhonov functional (3), where and are regularization parameters. The starting point for the approach proposed in this paper is the observation that (3) reduces to standard -regularization but where the measurement matrix and the datum are additionally tuned by the second regularization parameter . As it has been shown in [11], this modification leads to a superior performance over standard sparsity-promoting regularization. The main result to that end is the following.

Lemma 1.

The pair solves (3) if and only if

and solves the optimisation problem

(4)

with

and

In the following, we will assume that (3) always has a unique solution within the considered parameter range.

Assumption 1.

For all parameters and that are considered in the following, the optimization problem (3) has a unique solution .

As a consequence of this assumption, we have the following:

Lemma 2.

If Assumption 1 holds then and depend continuously on the parameters , .

Proof.

Since is a solution of the single parameter problem

it is uniquely characterized by the Karush-Kuhn-Tucker (KKT) conditions

(5)

Now note that and depend continuously on , which implies that also the KKT conditions change continuously with both and . Because of the uniqueness of , its continuous dependence on the parameters follows. ∎

The behaviour of the -regularization is by now fairly well understood [20] and there are well-established approaches especially in statistical learning literature for calculating regularized solution as a function of [19], i.e., for finding the solution for all . The existing algorithms are based on the observation that the solution path of the regularized or, as it is called in statistics, Lasso problem, is piecewise-linear in each component of the solution [21]. Our goal in this paper is to provide an efficient procedure for tracking the behaviour of in terms of the support and sign pattern as a function of both regularization parameters. We will assume that the regularization parameter is defined in a finite interval which ideally should be chosen depending on a problem at hand. For the sake of self-containedness, we present the Lasso path algorithm from a conceptual point of view, and we refer to [19] for rigorous arguments for its correctness.

The Lasso path

The algorithm is essentially given by an iterative or inductive verification of the optimality conditions (5), starting with for fixed , so that solution (4) is trivial. As the parameter decreases, the algorithm computes that is piecewise linear and continuous as a function of Each knot in this path corresponds to an iteration of the algorithm, in which the update direction of the solution is altered in order to satisfy the optimality conditions.

The support of changes only at the knot points: as decreases, each knot corresponds to entries added to or deleted from the support of the solution. Such a model is attractive and efficient because it allows to generate the whole regularization path simply by sequentially finding the knots and calculating the solution at those points. For fixed , these knots can be computed explicitly, see Lemma 3 below.

The discretized Lasso path for multi-penalty regularization

A proper choice of is essential for a good performance of the algorithm (3). The most straightforward way of extending the Lasso path algorithm to the multiple parameter setting is to discretize the range of the parameters and then solve the single-penalty problem (4) that we obtain for each parameter .

Specifically, we choose an upper bound and a lower bound of possible values of , and discretization points . For each we can use the Lasso path algorithm in order to compute the solutions of the multi-penalty regularization problem with and all parameters up to a predefined sparsity level. That is, we start for each with some sufficiently large such that the corresponding index set is , and the sign pattern . Then we inductively compute the knot points where, as discussed above, indices either enter or leave the support of the solution . This allows us to update the current support and sign pattern according to the following result:

Lemma 3 (Lasso path algorithm for (4), see [19]).

Assume that has already been computed and the solution has the support and sign pattern for values of slightly smaller than . We define for every

(6)

Here unless was contained in the support of the solution at the previous step. In this case, we take with an opposite sign to the enumerator of the first case in (6).

Then we have with

that

(7)

The updated index set is formed from by adding all indices for which the maximum in (7) is attained, or removing all indices for which the maximum in (7) is attained. In addition, the signs corresponding to the newly added indices are precisely the values of the corresponding in (6).

Proof.

This is nothing else than the usual Lasso path algorithm applied to the equivalent single parameter problem (4). See [19] for more details. ∎

Provided that the discretization of the parameters is sufficiently fine, this method is capable of computing a reasonable approximation of the complete tiling over the parameter space and, in particular, of identifying all possible supports and sign patterns of the regularized solution .

However, for certain configurations it can happen that the support of the correct solution is achievable only for a small range of -parameters. An example of such a situation is presented in Figure 1

. Thus, a very fine discretization of the parameter range will be needed in order to capture all possible supports. Moreover, in practice, very often even the support size is unknown and must be chosen according to some heuristic principle. In this situation, having an overview of the solution behaviour over the whole range of parameters rather than solution at discrete points could provide some additional hints for such a choice.

Figure 1. Part of the parameter space detailing the different solutions. Each of the different tiles corresponds to a different support or sign pattern of the solution . Note in particular the small details indicating that certain supports can only be obtained for very specific parameter values. In this particular case, the yellow tile in the center of the zoomed region describes the parameters leading to the exact support recovery.
Remark 4.

It is worthwhile to mention that multi-penalty regularization can be interpreted as an interpolation between a standard and pre-conditioned

[22] Lasso. The latter computes approximations of the solution of the equation by solving the optimization problem

where and

has the singular value decomposition

where denotes the pseudo-inverse of , that is, the diagonal matrix with non-zero entries equal to the inverse of the non-zero entries of .

Indeed, for the solution of the multi-penalty regularization with converges to the solution of the pre-conditioned Lasso This can be seen by noting that (4) is equivalent to the following minimization problem

(8)

with

and

As we have and By convergence, it follows that the minimizers also converge, i.e., Conversely, for multi-penalty regularization (3) is equivalent to a standard Lasso.

Therefore, with a properly chosen parameter , one can expect that multi-penalty regularization combines the advantages of both regularization methods and mitigates their drawbacks. In particular, it is known [22] that the pLasso is less robust to measurement noise and to ill-conditioned measurement operators compared to the standard Lasso; whereas the standard Lasso has problems identifying the correct support for problems of unmixing type as exemplified in Section 6.

3. Support tiling

Instead of reconstructing the solution at fixed parameters , we approach the extension of the Lasso path algorithm for multi-penalty regularization by constructing regions or, the so-called, tilings, containing structurally similar solutions of (3), i.e., solutions with the same sparsity and sign pattern, while still preserving simplicity in calculations. In this section, we introduce this concept, providing some geometrical interpretations for ease of understanding together with necessary notation.

We denote by

the support of the regularized solution , and by

the sign pattern of . Given , , we then define

that is, contains all parameters that lead to the same support and sparsity pattern for the reconstruction of . Additionally, we denote by the connected component of the set that contains . Then the family

forms a tiling of .

Given a tile , we now denote

the common support and sign pattern of the reconstructions of on the tile . Moreover, we define

the left and right hand side borders of the tile . We note that, because of the connectedness of the tiles, there exists for every some such that . In the following Lemma, we will show that the set of these parameters is actually an interval (see also [19, Proof of Lemma 6]).

Lemma 5.

Assume that and . Then the set of parameters with is a non-empty interval.

Proof.

Assume that are such that for , . Then satisfy the conditions

for , , and

As a consequence, for every , we have that satisfies the same conditions with replaced by . Since and also the sign patterns on the support are the same, this implies that and thus . ∎

Next we define for every tile and every the functions

Provided these functions are continuous, a criterion for which will be given in the next lemma, their graphs form precisely the boundary of the tile . We note that the conditions in the next lemma are satisfied for almost all matrices .

Lemma 6.

Let . Assume that there exist no parameter , no subset with , and no such that the equations

and

are simultaneously satisfied. Then for all tiles with , the functions and are continuous.

Proof.

Assume to the contrary that, for some tile there exists such that the function is discontinuous at . Moreover, denote

Then there exists an index and , as well as a sequence such that

for all and some . As a consequence, the KKT conditions (5) imply that

for all and . Taking the limit and recalling the continuous dependence of on both and (see Lemma 2), it follows that also

(9)

On the other hand, the continuous dependence of on both and also implies that for all .

Let now . Then for all , and thus, because of (9), we have

for all . Because for all these , it follows that both

and

which contradicts the assumption that these two terms cannot be simultaneously equal to zero. ∎

Graph structure of the tiling

In the following we discuss the structure of the tiling in more detail in order to to formulate an algorithm for its computation. To that end, we introduce the structure of a directed multi-graph on by including an edge from a tile to another tile for each maximal open and non-empty interval for which

(10)

Two tiles are connected by an edge whenever they share a common boundary, and we define the edge to run from the tile with larger values of to the tile with smaller values of . Moreover, we denote by the maximal interval for which (10) holds.

Figure 2. Example of part of a tiling (left) and the corresponding directed multigraph (right). Note that the tile is connected to the tile via two different edges. The ticks on the -axis are the left and the right borders of the tiles, i.e., .

We note here that it is possible that two tiles are connected by several edges, if their common boundary consists of disjoint intervals, as depicted in Figure 2 (left panel).

Now let be any tile with edges ,…, starting at . Then each edge corresponds to a maximal open subinterval of , and for all . Therefore we can define an order on the set of edges starting in by setting if . Also, we can order the edges leading into in analogous manner.

In order to find simultaneously the edges and the functions , it is possible to employ an adaptation of the Lasso path algorithm to the multi-parameter problem as presented in Lemma 3.

Lemma 7.

Assume that is an edge leading from the tile into tile , and that is the interval on which (10) holds. For denote as defined in (6) and let

(11)

Then for each and each we have and

(12)

If additionally , then is equal to the parameter in the definition of , see (6).

Remark 8.

The previous lemma provides a method for simultaneously computing the functions and finding the edges leading out of : In the special case where for some given the maximum in (12) is attained at a single index , the neighboring tile has the support if and otherwise. Also, the boundary between and is in a neighborhood of given by the function .

In order to simplify further discussion, we introduce the notion of parents and children. Given a tile , we denote by the set of its parents, that is, all tiles with edges leading to , and by its children, that is, all tiles with edges starting from . Using the order of the edges between an edge and its children, we can in particular define the youngest child and the oldest child in the following way: We denote by the child of corresponding to the minimal edge starting in , and by the child corresponding to the maximal edge. Because there might be multiple edges between and any of its children, it can happen that even though the tile has multiple children. In an analogous manner, we define and to be the tile corresponding to the minimal and maximal edge leading into and call these tiles the youngest and the oldest parent of , respectively.

Finally, we recall that for each parameter and sufficiently large , we have . As a consequence, the directed graph describing the tiling is actually rooted, with the root given by the tile corresponding to the zero solution with support set . We denote this root tile as . This is also the only tile that does not have any parents, and all other tiles are descendants of . Figure 2 depicts the directed graph (right panel) representing the tiling (left panel).

4. Algorithmic approach

In the following, we will discuss an algorithmic approach to the construction of the complete support tiling, that is, of all the tiles together with the corresponding supports and sign patterns , and the boundaries of , which are given by and the functions . Starting with the tile , we construct the lower boundary and the children of the current tile using equation (12). To that end, we will first subdivide the interval into subintervals on which the index set is constant. On each of these subintervals, the indices that are either added to or removed from the current support can be identified as . In order to simplify the algorithm, we will assume that these maxima are attained at a single index for almost all parameters . We note that this is not a severe restriction as it holds for almost all matrices . Also, a similar restriction has been used in the original Lasso path algorithm [23].

Assumption 2.

For each tile there are at most finitely many values of such that the maximum in (12) is attained simultaneously for different indices .

Since the functions , which define the maximum in (12), are rational, this is equivalent to the assumption that all these functions are different. As a consequence, this assumption is very weak and can be safely assumed to hold in all practical situations.

Each tile processed by the algorithm is defined by means of its support and sign structure , together with a range of values and all of its parents with edges defined over these values. In particular, this means that its upper boundary is well-defined for the given range. However, we do not necessarily assume that all the children of have already been constructed, and thus the lower boundary need not be well-defined everywhere. In such a case, we say that the tile is incomplete. See Figure 3 for a sketch of a situation where this occurs.

In each step of the algorithm, we choose an incomplete tile . Then we compute all of its children together with the function and thus complete the tile. After its completion, we merge newly created children with previously established tiles, if necessary.

4.1. Computation of children

(a)
(b)
Figure 3. Exemplary extracts of support tilings where incomplete ranges upon the first discovery occur. (a): Coming from or , the first discovery of would always only reveal the left (from ) or right (from ) part of . (b): In Step 1) we discover a part of and the complete . In Step 2), we discover a part of from and subsequently the rest of through . Hence we search for children of before having discovered it completely and this situation will repeat itself more often.

Assume that we are given an incomplete tile with -range and current (incomplete) set of children . We denote by the set of edges connecting with its children. Moreover, we denote by

the subset of where is known.

We next compute a subdivision of into subintervals on each of which the index set is constant, see (11). Given a connected component of , any of the functions either is below on the whole interval , above on the whole interval , or has discontinuities within . In the latter case, because of the definition of the functions see (6), these discontinuities can only occur at the parameters for which either

(13)

or

In the following result, we will show that, actually, it is only the first of these cases that can occur.

Lemma 9.

For all we have

where is defined as in (6).

Proof.

The optimality conditions (5) imply that

(14)

for all . Moreover,

Inserting this into (14), we obtain the inequality

(15)

for all .

Assume now that was not contained in the support of in the previous step of the Lasso path algorithm. Then, see Lemma 3,

If additionally , then (15) reduces to an inequality of the form

with , which is impossible.

Conversely, if was contained in the support of in the previous step of the Lasso path algorithm, then the upper boundary of the tile is at the point given by

Thus

If in this case , then we obtain from (15) the inequality

for all , which is again impossible. ∎

As a consequence of Lemma 9, if we subdivide the set into intervals

(16)

where the boundary points of these intervals are either original boundary points of or points for which for some , then the set remains constant over each of the subintervals .

Choose now one of these subintervals and let for any . Then we can compute a preliminary set of all children of within the interval as follows:

  1. Start with .
  2. Compute the index for which
for some small .
  3. Add a child with to :
  if  then
               Let and for , .
  else
               Let , , and for .
  end if
  4. Compute the next solution of any of the equations , :
  if  then
               Set and repeat from 2 with .
  else
               Set and stop.
  end if

Now assume that all subintervals have been processed in that manner. Then the tile is completed in the sense that its lower boundary is defined everywhere and that a preliminary set of children of is defined on the whole range of values of in . However, it is possible that some of these preliminary children actually should be merged together, because they point to same tile. This will occur at the boundaries of the subintervals , as all of these subintervals have been processed independently from each other.

In order to merge children, we may scan through all preliminary children of from oldest to youngest. If we then find two adjacent children with same support and sign pattern , we merge them in the sense that we treat them as only one child. The corresponding range for this new child is the union of the ranges of the preliminary children.

4.2. Merging procedure

After a tile has been processed as described in Section 4.1, children of this tile will be defined on the whole interval . However, the children defined on the boundaries of this interval, that is, the oldest and the youngest child of , might coincide with children from neighboring tiles to the left or right, which again might require the merging of these children. In order to perform this merging, we make use of the ordered graph structure we imposed on the tiling.

We first consider the youngest child . Starting with , we trace back the line of youngest parents, until we reach an ancestor for which the branch from to does not start with the youngest child of . In case no such tile exists, the tile has no possible merging partners. Now denote by the child of in the line leading to , and let be its next younger sibling. Then the potential merging partners for