Robust method for finding sparse solutions to linear inverse problems using an L2 regularization

01/03/2017 ∙ by Gonzalo H Otazu, et al. ∙ 0

We analyzed the performance of a biologically inspired algorithm called the Corrected Projections Algorithm (CPA) when a sparseness constraint is required to unambiguously reconstruct an observed signal using atoms from an overcomplete dictionary. By changing the geometry of the estimation problem, CPA gives an analytical expression for a binary variable that indicates the presence or absence of a dictionary atom using an L2 regularizer. The regularized solution can be implemented using an efficient real-time Kalman-filter type of algorithm. The smoother L2 regularization of CPA makes it very robust to noise, and CPA outperforms other methods in identifying known atoms in the presence of strong novel atoms in the signal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Representation of time-varying signals in terms of a sparse set of atoms from an overcomplete dictionary is important in machine learning, and it has been proposed as one of the fundamental computations of the nervous system. Reconstruction algorithms create an estimate of the observed signal using the appropriate weighted atoms of an overcomplete dictionary. The overcompleteness of the dictionary creates a situation where there are multiple sets of dictionary atoms that reconstruct the signal equally well. Under these conditions, a solution that minimizes the number of dictionary atoms that have a non-zero contribution to the signal is preferred. Directly finding the solution that minimizes the number of atoms used from the dictionary or L0 norm is an NP-complete problem, making it intractable even for moderately sized dictionaries. However, it has been shown that under very general conditions the solution that minimizes the sum of the absolute values of the contributions of the dictionary atoms or L1 norm can be used to identify the sparsest solution

[Don06]. Although there is no analytical expression for the minimal L1 norm solution, there are effective algorithms for finding L1 minimal solutions, which has triggered an explosion of interest in the use of the L1 norm for sparse reconstructions. On the other hand, minimization of the sum of the squares of the contributions or L2 norm yields an analytical solution. However, the minimal L2 norm solutions are, in general, not sparse. 111This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

In this contribution we show that an algorithm described as a model for sound identification in the mammalian auditory system[OL11], the Corrected Projections Algorithm or CPA, can be used for finding sparse representations of temporally uncorrelated atoms chosen from overcomplete dictionaries. CPA yields an optimization problem that directly infers the presence or absence of all the dictionary atoms by way of a least-squares minimization using an L2 constraint on the estimated parameters. The L2 regularization adds robustness in the presence of strong noise, outperforming standard sparse representation methods in novel situations.

In section 2 we present the problem and the algorithm and show how the algorithm identifies the atoms present in the overdetermined case. In section 3 we show that the algorithm can extended to an overcomplete situation where the number of temporal observations is insufficient for a non-ambiguous reconstruction of the signal and show that the change in the geometry of the problem allows for a reconstruction using the L2 norm. In section 4 we show how the L2 regularized solution can be implanted using a Kalman filter. In section 5 we compare the performance of CPA with other sparse representation algorithms. We show that CPA produces weak dense representations for novel atoms, whereas other sparse representation algorithms create sparse amplitude-dependent representations of novel atoms. We also show that CPA outperforms other methods in identifying known atoms in the presence of strong novel atoms.

2 Corrected Projections Algorithm

2.1 Problem statement

At each time t, an observation vector

of dimension by is produced by the combination of a few atoms of dimension by from a dictionary of possible atoms. That is:

(1)

where is the i-th dictionary atom, which is an by vector, and is the contribution of the i-th dictionary atom at time . We will represent any by vector using the symbol . The general assumption is that the observation is generated by a few dictionary atoms that belong to an active set of atoms, whereas the other atom’s contributions are equal to zero:

(2)

In order to determine the dictionary atoms that contributed to the observed signal, an estimate of each temporal observation is calculated by combining the dictionary atoms by way of an estimate of the contributions of each dictionary atom, that is:

(3)

The estimates of the contributions are found by minimizing a cost function defined as the Euclidean distance between the estimate and the observation, that is:

(4)

We will represent the L2 norm of a column vector as . In general, for large dictionaries, there might be multiple solutions that minimize the cost function. Therefore, current algorithms complement the cost function with a penalty associated with the value of the estimated contribution of the dictionary atoms, that is:

(5)

where is a parameter that determines the balance between the estimation error and the degree of sparseness of the solution. Single measurement algorithms such as Matching Pursuit[MZ93] calculate a time-varying estimate of the contribution for each individual observation while trying to reduce the number of dictionary atoms that make a non-zero contribution. Conversely, multiple measurement vector (MMV) versions of these algorithms [CRKEKD05] take into account all the temporal observations available of the signal and use the sparseness constraint to minimize the number of dictionary atoms that make any contribution at any point in time, that is:

(6)

where . The use of multiple observations allows for better estimates compared to using individual observations.

2.2 Corrected Projections Algorithm

CPA is similar to multiple measurement vector algorithms in that it uses the information from multiple observations to identify a time-invariant binary variable for each dictionary atom that indicates whether or not the corresponding atom was present for any of the temporal observations of the signal. These binary variables, called the presence parameters, can be written as a column vector:

(7)

In order to create the estimate of the observed signal, CPA combines a time-varying rough estimate of the contribution of an individual dictionary atom corrected by the corresponding time-invariant presence parameter , that is:

(8)

The rough estimate at time of the contribution of a given dictionary atom is given by the scalar product between the current observation and the dictionary atom, that is:

(9)

We can define an by projection matrix at time , where each column corresponds to a dictionary atom weighted by the rough estimate of its contribution to the signal :

(10)

where is the j-th component of the i-th dictionary atom. We can arrange all the available observations into a long by vector :

(11)

We can also arrange all the projection matrices (one for each temporal observation) into one large by matrix :

(12)

Using this notation, we can write the CPA estimate of the observation vector as:

(13)

CPA estimates the presence parameter vector as the one that minimizes the square error between the observation vector and its estimate . The presence parameter vector is given by:

(14)

Here we provide a simple proof that the solution of equation 14 identifies the atoms present in the signal.

Theorem 1.

The solution for the presence parameters given by equation 14 is for the atoms that are present and for the atoms that are not present as long as:

  1. the dictionary atoms that are simultaneously present in the signal are orthogonal to each other, and

  2. the by matrix is invertible.

Proof.

We will assume that the observed signal originates from an active set of dictionary atoms indexed by that are mutually orthogonal, that is, the observations are given by:

(15)

where the atoms present obey:

If we replace this signal into equations 8 and 9 we obtain:

(16)

Using the orthogonality condition for the atoms in the active set, we can simplify our estimate as:

(17)

If we replace the following solution for the presence parameters

(18)

into equation 17, we obtain:

(19)

With this choice of the presence parameters, our estimate and the actual observation are identical for all temporal observations. This would result in the mean square error having a value of zero. Therefore, there are no other solutions that could produce a smaller value of the squared error. The solution is also unique because we have assumed that the matrix is invertible. Notice that this solution for the presence parameters is independent of the contribution of an atom to the observed signal, being either 1 or 0. This contrasts to algorithms that directly determine the amplitude of the contribution , where the estimated variables would be larger for larger contributions. ∎

The theorem depends on the orthogonality of the atoms present in the observed signal. Although this condition seems restrictive, the restrictive isometry property [BD08] (RIP) states that any set of atoms of a random dictionary of size would approximate orthogonality, as long as the dimensions of the dictionary are:

(20)

Given that the maximum size of the dictionary grows exponentially with the number of dimensions of the signal, CPA can handle large dictionaries with a moderately sized number of dimensions .

3 CPA finds a sparse solution using the L2 regularization for the presence parameters

In order to identify the dictionary atoms that are present, CPA requires that the matrix has an inverse. For conditions where , this condition would not be satisfied. Therefore, CPA needs to use regularization to find a sparse solution. Surprisingly, CPA does not need to use the L1 regularization but can use the L2 regularization or Tikhonov regularization to identify sparse solutions. The modified cost function is:

(21)

where the CPA estimate is described by equations 8 and 9. We can also write this cost function as:

(22)

In contrast to the minimization problems 5 and 6 that use the L1 norm, there is an analytical solution for this minimization problem, which is given by:

(23)

where is the by identity matrix. The L2 regularization does not provide a sparse representation when it is used to directly determine the contributions of the dictionary atoms, that is, when the cost function is:

(24)

Therefore, in general, the L1 norm has to be used for determining sparse representations.

Figure 1: A Standard methods require the L1 norm to find sparse solutions. B CPA change in geometry permits sparse solutions to be found using the L2 norm.

How does the L2 minimization of the presence parameters in CPA determine a sparse distribution for the dictionary atoms? The reason lies in the change in the geometry of the manifold of possible solutions (see Figure 1). In order to gain some intuition, we will use a simple signal that is generated from a single dictionary atom yielding a single temporal observation:

(25)

We will assume a small dictionary of two atoms: and . In standard approaches, we would estimate the contributions and . All the solutions to the equation describe a line in the space . The sparsest solution is the intersection of this solution line with the horizontal axis, that is:

(26)

We can show that the slope of the solution line is given by:

(27)

On the other hand, the family of circles

(28)

are the loci with an equal L2 norm. The minimum L2 norm of the solutions is determined by the radius of the largest circle that is enclosed by the solution line , where the tangent point is the L2 minimal solution. We notice that the steeper the slope of the solution line, the closer the minimum L2 norm solution would be to the sparsest solution. Given this slope, the L2 minimal solution would be, in general, far from the sparsest solution. In contrast, the family of squares

(29)

represents the loci with an equal L1 norm. The intersection between the largest square enclosed by the solution line corresponds to the sparsest solution. Conversely, in the space described by the CPA presence parameters , the slope of the solution line is:

(30)

The slope of the solution line in CPA has increased by a factor of compared to other methods. For a large enough number of dimensions , the dictionary atoms are close to being orthogonal to each other, and this slope increase is very large. Having a large slope for the line makes the L2 minimal solution very close to the sparsest solution, meaning that the values of presence parameters of the atoms present are much larger than the values of the presence parameters of the atoms that are not present. We can formalize this idea, and we will show that the L2 regularized solution can deal with very large dictionaries, where , the total number of dictionary atoms, grows exponentially with , the number of dimensions.

Theorem 2.

Assuming that:

  1. dictionary atoms that are simultaneously present are orthogonal to each other and,

  2. the amplitude modulations of the atoms that are present are uncorrelated to each other in time,

then the regularized L2 solution of CPA will find an average presence parameter for that is larger than as long as the number of dimensions is:

(31)

where is the number of simultaneously present dictionary atoms.

Proof.

The expected square error between the observation and the presence parameters is given by:

(32)

where the brackets indicate the average over time. The values of the presence parameters that minimize the expected square error can be calculated by taking the derivatives for all the presence parameters :

(33)

If we define as the set of dictionary atoms that are present in a signal, we can simplify (for details on the derivation, see [OL11]) the equations defined in 33, using the assumptions 1 and 2, to the following equations:

(34)

where

(35)

This set of M equations is satisfied by the following solution, which identifies the atoms present in a signal:

(36)

However, if does not have an inverse, solution 36 is just one of multiple possible solutions. We will show that the solution with the minimum L2 norm of the presence parameters identifies the sources present. In order to find the solution with the minimum L2 norm, we will use the Lagrange multiplier method. We will define as a cost function the sum of the squares of the values of the presence parameters. We add as constraints the equations defined in 34, multiplied by factors , which are the Lagrange multipliers. The new cost function is:

(37)

If we take the derivatives for and make them equal to zero, we obtain:

(38)

We can express the presence parameter as a function of the Lagrange multipliers, yielding:

(39)

If we take the derivatives for and make them equal to zero, we obtain:

(40)

We can also express the presence parameter as a function of the Lagrange multipliers, yielding:

(41)

The presence parameters that belong to the active set are related to the presence parameters that are not part of the active set by way of the Lagrange multipliers. We can use these relationships to infer the relative sizes of the presence parameters. If we calculate the sum of all , we obtain:

(42)

We can find an upper bound for , given by:

(43)

where is the mutual coherence[MZ93] of the dictionary matrix, that is:

(44)

We can use the average to obtain the following inequality:

(45)

The last inequality holds as long as:

(46)

For a random dictionary, and for large number of dimensions , the mutual coherence converges to [CJ11]:

(47)

So our inequality 46 will become:

(48)

Therefore, the number of dimensions that guarantees that for is larger than is given by:

(49)

Figure 2: A Addition of noise can cause the minimal L1 norm solution to switch to a different sparse solution set, represented by the intersection of the solutions line with a different vertex of the L1 ball. B Addition of noise causes less perturbation on the minimal L2 norm solution in CPA, represented by the intersection of the solutions line with the L2 ball.

The number of dimensions required by the regularized CPA is increased by a factor of compared to the limit provided by the RIP for reconstructions using the L1 regularization[CT05]. However, the use of the L2 norm, as opposed to an L1 norm, allows an analytical expression to be found for the solution. In addition, using the smooth L2 confers CPA robustness in the presence of noise (see Figure 2

). Intuitively, the solution of CPA using the L2 regularization is given by the intersection of a hyperplane and a hypersphere. Added noise would change the hyperplane, but given the smooth nature of the hypersphere, the new intersection would still be close to the original solution. On the other hand, for sparse representation methods, the solution is given by the intersection of a hyperplane and a high-dimension polyhedron. Perturbing the hyperplane could radically change the intersection point. While still giving a sparse solution, this would yield a different sparse set than the original solution

4 Efficient implementation of the L2 regularized CPA

Figure 3: A Example of the responses of the three algorithms to a signal composed of atoms of dimension . The three algorithms had access to a dictionary of atoms, which included the 20 atoms that generated the signal. B The performance of CPA for complex signals of more than 80 elements diminished compared to the performances of the other algorithms.

4.1 Iterative solution of CPA

The CPA estimate is linear with respect to the presence parameters , and it can be estimated using equation 14. This equation requires the inversion of the by matrix . For a large , matrix inversion is not numerically stable. It has been shown [OL11] that the solution to equation 14 can be found in an iterative, numerically stable manner that is more appropriate for a real-time application. Here we will show that a similar set of iterative equations can be used in the L2 regularized case, that is, to solve equation 23.

In the iterative version of CPA (or iCPA) the temporal observations from are processed to calculate a presence parameter set . Upon arrival of a new observation , this estimate of the presence parameters is updated to . The update on the presence parameter set is proportional to the estimation error:

(50)

where is the by projection matrix as defined in equation 10 and calculated using the current observation . The current estimate is derived from the projection matrix and the previous estimate of the presence parameters as follows:

(51)

The proportionality factor converts the estimation error into the update of the presence parameters as given by:

(52)

where is the by identity matrix. The proportionality factor is in fact the by matrix that is calculated by matrix inversion in equation 14, that is:

(53)

Equation 52 defines an iterative relation where the new is calculated using the previous value combined with the new value of the observation through the projections . The real-time implementation is computationally more stable as it requires the inversion of an by matrix as opposed to an by matrix .

4.2 Iterative solution of CPA with L2 regularization

In the case of the L2 regularized CPA, we need to find the inverse of to calculate the presence parameters using equation 23. We will show that we could also use the iterative equation 52 to calculate this inverse in an efficient manner. The L2 regularized version of is:

(54)

At T=0, the estimate of becomes:

(55)

Therefore, if we initialize to an by identity matrix times , the iterative equation 52 would calculate the L2 regularized solution with as the regularization constant . By having a large initialization value of , we would prefer an exact reconstruction over a sparse representation. Using this initialization and using equations 50, 51, and 52 would produce the L2 regularized solution for CPA in a computationally efficient manner.

5 CPA outperforms other sparse representation algorithms in the presence of strong novel atoms

CPA can identify the atoms present in a signal, and its performance is comparable to other sparse representation algorithms (see Figure 3). We compared CPA against two other MMV methods [CRKEKD05], basic matching pursuit (M-BMP), a greedy algorithm that approximates the L1 regularization, and M-FOCUSS. When a large number of dictionary atoms were simultaneously present, CPA performance was lower than the performance of the other sparse representation algorithms. Interestingly, the performance was better than what we would have expected given the number of dimensions in equation 49 , indicating that CPA has an even better performance than our estimate.

The main advantage of CPA was in the handling of novel atoms. CPA did not produce a sparse representation for a novel atom (see Figure 4). Instead, it represented the novel atom with a dense set of presence parameters of small amplitude. The amplitude of this dense set of presence parameters was not very sensitive to the input signal amplitude, keeping them small, even as the contribution of the novel atom increased. This dense representation was different from the sparse representation of a signal composed of known atoms.

Figure 4: CPA and the other algorithms generated sparse representations for an atom that belonged to the dictionary (left column). CPA representation of a novel atom was denser and more amplitude-invariant compared to other algorithms (left and right columns).
Figure 5: A Example of the responses of the three algorithms to a signal composed of atoms from the dictionary, in the absence (left column) and presence (right column) of a novel atom with a strong contribution. CPA presence parameters for the two atoms from the dictionary were very salient, whereas for the other two algorithms, the two atoms were effectively masked by the novel atom contribution. B The performance of CPA in detecting atoms that were in the dictionary () was comparable to that of the other algorithms in the absence of strong novel atoms (left column). CPA performance for complex signals was more robust to a novel atom than the other algorithms (right column).

M-FOCUSS and M-BMP did represent a novel atom with a sparse representation whose magnitude depended on the contribution of the novel atom to the signal. This sparse representation of a novel atom was indistinguishable from a representation of a signal composed of multiple known atoms.

In a complex scene that consisted of a strong novel atom and other atoms that belonged to the dictionary, CPA identified the dictionary atoms, with the presence of the strong novel atom adding only small amounts of noise distributed uniformly across the presence parameters (see Figure 5). In contrast, the representations produced by the sparse dictionary algorithms were dominated by the novel atom representation, which masked the known atoms. We explored several parameters for the regularization constant of M-FOCUSS (see Figure 6), but we could not find a parameter regime that increased the performance, indicating that the deleterious effects of strong novel atoms could not be overcome with parameter adjustment but are intrinsic to the L1 regularization approach.

Figure 6: Performance of M-FOCUSS for detecting atoms in a signal composed of atoms as a function of the regularization parameter in the absence (left column) and presence (right column) of a novel atom. The performance of M-FOCUSS cannot be improved by changing over six orders of magnitude.

6 Discussion

We have extended CPA, a biologically inspired algorithm, for the case of overcomplete dictionaries. We found that by changing the geometry of the solution space, CPA found sparse solutions using the L2 norm. We presented a Kalman filter implementation of the L2 regularized solution for a numerical stable implementation of this MMV algorithm. CPA outperformed other sparse representation algorithms in identifying sources in the presence of strong novel atoms.

CPA is particularly suitable when we are interested in identifying previously acquired independent atoms that might be present in a signal but there might also be novel atoms that are not part of the dictionary masking them. This makes CPA suitable for online applications where the currently used dictionary does not contain all the possible atoms that may appear in a signal and we are still learning the dictionary. In contrast, algorithms that specifically minimize the L1 norm will find a sparse solution even if the atom generating the observed signal is not part of the dictionary. In this case the estimated amplitudes would be temporally correlated in time and would not constitute independent sources.

We found that CPA is computationally more expensive than the other algorithms tested. Evaluation of , using equation 52, is the most computationally expensive calculation, requiring the multiplication of 3 by matrices, that is, CPA is . However, all operations for CPA are matrix multiplications and CPA could be optimized for implementation using GPUs.

CPA does not produce a sparse representation for novel atoms. The lack of a sparse representation in the presence of an input might be used as an indication that a new atom should be incorporated into the dictionary.

7 Appendix

7.1 Dictionary atoms

All simulations were performed using Matlab. For all simulations we used a dictionary of atoms. The dimensions of the atom were

. Initially, the atoms in the dictionary were independently generated from a Gaussian distribution of zero mean and a variance of 1. Each dictionary atom was normalized such as

. An identical procedure was used to generate the novel atoms that were not part of the dictionary.

7.2 Signal generation

The temporal-varying amplitudes for the atoms present,

, were taken independently from a Gaussian distribution of zero mean and a standard deviation of 1. We used

(total number of observed samples of the signal). The signal generated by combining the dictionary atoms was corrupted by additive Gaussian noise of standard deviation of the standard deviation of the atom-generated signal. The amplitude of the novel atom in Figures 5 and 6 was taken from a Gaussian distribution of zero mean and a standard deviation of 10.

7.3 Corrected Projections Algorithm

We implemented the iterative version of CPA using equations 50, 51, and 52 . We report the presence parameter after the last sample is processed. For all the simulations, we used a value of for the regularization constant. The code for implementing CPA is available 111https://github.com/gotazu/CPA.

7.4 M-Focuss

We used the regularized M-FOCUSS as described in [CRKEKD05], and we used the code in 222https://sites.google.com/site/researchbyzhang/software. We used a regularization constant value , , the threshold for stopping iteration , the threshold for pruning small gamma, prune , and the maximum number of iterations was set to 500. We used the same set of parameters for all simulations, except for Figure 6, where we systematically changed the regularization constant value between and .

7.5 M-Bmp

We used the M-BMP as described in [CRKEKD05]. We selected 200 as the maximal number of iterations. The same set of parameters were used for all the simulations.

7.6 Algorithms performance assessment

In order to evaluate the performance of the algorithms we used the F-measurement, a more appropriate measurement for sparse representations than ROC analysis. It is defined as . Precision is the fraction of detected atoms that were actually present in the signal. Recall is the fraction of atoms present in the signal that were detected. A value of indicates perfect detection, i.e., that all the atoms present were detected, and only those atoms were detected. For each of the three algorithms tested, CPA, M-FOCUSS, and M-BMP, we calculated a detection threshold for their output that maximized the F-measurement. For each condition, we repeated the simulation 10 times and reported the average of the optimal F-measurement.

References

  • [BD08] R Baraniuk and M Davenport. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 2008.
  • [CJ11] T Tony Cai and Tiefeng Jiang. Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. The Annals of Statistics, 39(3):1496–1525, 2011.
  • [CRKEKD05] S.F. Cotter, B.D. Rao, Kjersti Kjersti Engan, and K. Kreutz-Delgado. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Transactions on Signal Processing, 53(7):2477–2488, 7 2005.
  • [CT05] E.J. Candes and T. Tao.

    Decoding by Linear Programming.

    IEEE Transactions on Information Theory, 51(12):4203–4215, 12 2005.
  • [Don06] David L. Donoho. For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 6 2006.
  • [MZ93] S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993.
  • [OL11] Gonzalo H Otazu and Christian Leibold. A corticothalamic circuit model for sound identification in complex scenes. PloS one, 6(9):e24270, 1 2011.