We begin with the likelihood model
where is a dictionary of unit
-norm basis vectors,is a vector of unknown coefficients we would like to estimate, is the observed signal, and is noise distributed as (later we consider more general likelihood models). In many practical situations where large numbers of features are present relative to the signal dimension, the problem of estimating given becomes ill-posed. A Bayesian framework is intuitively appealing for formulating these types of problems because prior assumptions must be incorporated, whether explicitly or implicitly, to regularize the solution space.
Recently, there has been a growing interest in models that employ sparse priors to encourage solutions with mostly small or zero-valued coefficients and a few large or unrestricted values, i.e., we are assuming the generative is a sparse vector. Such solutions can be favored by using
, and generalized Gaussian distributions. Roughly speaking, the ‘more concave’, the more sparse we expect to be. For example, with , we recover a Gaussian, which is not sparse at all, while gives a Laplacian distribution, with characteristic heavy tails and a sharp peak at zero.
All sparse priors of the form (2) can be conveniently framed in terms of a collection of non-negative latent variables or hyperparameters for purposes of optimization, approximation, and/or inference. The hyperparameters dictate the structure of the prior via
is some non-negative function that is sometimes treated as a hyperprior, although it will not generally integrate to one. For the purpose of obtaining sparse point estimates of, which will be our primary focus herein, models with latent variable sparse priors are frequently handled in one of two ways. First, the latent structure afforded by (3) offers a very convenient means of obtaining (possibly local) maximum a posteriori (MAP) estimates of by iteratively solving
where and is commonly referred to as a Type I estimator. Examples include minimum -norm approaches [4, 11, 16], Jeffreys prior-based methods sometimes called FOCUSS [7, 6, 9], algorithms for computing the basis pursuit (BP) or Lasso solution [6, 16, 18], and iterative reweighted methods .
where and . Once is obtained, the conditional distribution is Gaussian, and a point estimate for naturally emerges as the posterior mean
Pertinent examples include sparse Bayesian learning and the relevance vector machine (RVM) , automatic relevance determination (ARD) , methods for learning overcomplete dictionaries , and large-scale experimental design .
While initially these two approaches may seem vastly different, both can be directly compared using a dual-space view  of the underlying cost functions. In brief, this involves expressing both the Type I and Type II objective solely in terms of either or as reviewed in Section 2. The dual-space view is advantageous for several reasons, such as establishing connections between algorithms, developing efficient update rules, or handling more general (non-Gaussian) likelihood functions. In Section 3, we utilize -space cost functions to develop a principled method for choosing the trade-off parameter (which accompanies the Gaussian likelihood model and essentially balances sparsity and data fit) and demonstrate its effectiveness via simulations. Section 4 then derives a new Type II-inspired algorithm in -space that can compute maximally sparse (minimal norm) solutions even with highly coherent dictionaries, proving a result for clustered dictionaries that previously has only been shown empirically . Finally, Section 5 leverages duality to address Type II methods with generalized likelihood functions that previously were rendered untenable because of intractable integrals. In general, some tasks and analyses are easier to undertake in -space (Section 3), while others are more transparent in -space (Sections 4 and 5). Here we consider both with the goal of advancing the proper understanding and full utilization of the sparse linear model.
2 Dual-Space View of the Sparse Linear Model
Type I is based on a natural cost function in -space, , while Type II involves an analogous function in -space, . The dual-space view defines a corresponding -space cost function for Type I and a -space cost function for Type II to complete the symmetry.
Type II in -Space: Using the relationship
This reformulation of Type II in -space is revealing for multiple reasons (Sections 4 and 5 will address additional reasons in detail). For many applications of the sparse linear model, the primary goal is simply a point estimate that exhibits some degree of sparsity, meaning many elements of near zero and a few relatively large coefficients. This requires a penalty function that is concave and non-decreasing in . In the context of Type I, any prior expressible via (2) will satisfy this condition by definition; such priors are said to be strongly super-Gaussian
and will always have positive kurtosis. Regarding Type II, because the associated -space penalty (9
) is represented as a minimum of upper-bounding hyperplanes with respect to(and the slopes are all non-negative given ), it must therefore be concave and non-decreasing in .
For compression, interpretability, or other practical reasons, it is sometimes desirable to have exactly sparse point estimates, with many (or most) elements of equal to exactly zero. This then necessitates a penalty function that is concave and non-decreasing in , a much stronger condition. In the case of Type I, if is concave and non-decreasing in , then satisfies this condition. The Type II analog, which emerges by further inspection of (9) stipulates that if
is a concave and non-decreasing function of , then will be a concave, non-decreasing function of . For this purpose it is sufficient, but not necessary, that be a concave and non-decreasing function. Note that this is a somewhat stronger criteria than Type I since the first term on the righthand side of (10) (which is absent from Type I) is actually convex in . Regardless, it is now very transparent how Type II may promote sparsity akin to Type I.
The dual-space view also leads to efficient, convergent algorithms such as iterative reweighted minimization and its variants as discussed in . However, building on these ideas, we can demonstrate here that it also elucidates the original, widely applied update procedures developed for implementing the relevance vector machine (RVM), a popular Type II method for regression and classification that assumes 
. In fact these updates, which were inspired by a fixed-point heuristic from
, have been widely used for a number of Bayesian inference tasks without any formal analyses or justification.111Although a more recent, step-wise variant of the RVM has been shown to be substantially faster , the original version is still germane since it can easily be extended to handle more general structured sparsity problems. The step-wise method cannot without introducing additional approximations . The dual-space formulation can be leveraged to show that these updates are in fact executing a coordinate-wise, iterative min-max procedure in search of a saddle point. Specifically we have the following result (all proofs are in the supplementary material):
Type I in -Space: Similar methodology and the expansion of can be used to express the Type I optimization problem in -space, which serves several useful purposes. Let , with
Then the Type I coefficients obtained from (4) satisfy
Section 3 will use -space cost functions to derive well-motivated approaches for learning the trade-off parameter .
3 Choosing the Trade-off Parameter
The trade-off parameter is crucial for obtaining good estimates of . In general, if is too large, ; too small and is overfitted to the noise. In practice, either expensive cross-validation or some heuristic procedure is often required. However, because can be interpreted as a variance, it is useful to address its estimation in -space, in which existing unknowns (i.e., ) are also variances.
Learning with Type I: Consider the Type I cost function . The data-dependent term can be shown to be a convex, non-increasing function of , which encourages each element to be large. The second term is a penalty factor that regulates the size of . It is here that a convenient regularizer for can be incorporated.
This can be accomplished as follows. First we expand via , where denotes the -th column of and is a column vector of zeros with a ‘’ in the -th location. Thus we observe that is embedded in the data-dependent term in the exact same fashion as each . This motivates a penalty on with similar correspondence, leading to the objective
While admittedly simple, this construction is appealing because, regardless of how each is penalized, is penalized in a proportional manner, so both and have a properly balanced chance of explaining the observed data. This is important because the optimal will be highly dependent on both the true noise level, and crucially, the particular sparse prior assumed (as reflected by ).
For analysis or implementational purposes, we may convert back to -space, with -dependency now removed. It can then be shown that solving (4), with fixed to the value that minimizes (14), is equivalent to solving
If and minimize (15), then we can demonstrate using  that the corresponding estimate, which also minimizes (14), is given by evaluated at . Note that if we were just performing maximum likelihood estimation of given , the optimal value would reduce to simply , with no influence from the prior on . This is a fundamental weakness.
Learning with Type II: The same procedure can be adopted for Type II yielding the cost function
where we note that, unlike in the Type I case above, the -based term is already naturally balanced between and by virtue of the symmetric embedding in . It is important to stress that this Type II prescription for learning is not the same as originally proposed in the literature for Type II models of this genre. In this context, is interpreted a hyperprior on , and an equivalent distribution is assumed on the noise variance . Importantly, these assumptions leave out the factor of in (16), and so an asymmetry is created.
Simulation Examples: Empirical tests help to illustrate the efficacy of this procedure. As in many applications of sparse reconstruction, here we are only concerned with accurately estimating , whose nonzero entries may have physical significance (e.g., source localization , compressive sensing , etc.), as opposed to predicting new values of . Therefore, automatically learning the value of is particularly relevant, since cross-validation is often not possible.222For example, in non-stationary environments, the value of both and may be completely different for any new , which then necessitates that we estimate both jointly. Simulations are helpful for evaluation purposes since we then have access to the true sparse generating vector.
Figure 1 compares the estimation performance obtained by minimizing (15) with two different selections for : , with and . Data generation proceeds as follows: We create a random dictionary , with -normalized, iid Gaussian columns. is randomly generated with 10 unit Gaussian nonzero elements. We then compute , where is iid Gaussian noise producing an SNR of dB. To determine what values lead to optimal performance we solve (4) with the appropriate over a range of fixed values ( to ) and then compute the error between and . The minimum of this curve reflects the best performance we can hope to achieve when learning blindly. In Figure 1 (Top) we plot these curves for both Type I methods averaged over 1000 independent trials.
Next we solve (15), which produces an estimate of both and . We mark with an ‘+’ the learned versus the corresponding error of . In both cases the learned ’s (averaged across trials) perform just as well as if we knew the optimal value a priori. Results using other noise levels, problem dimensions and , sparsity levels , and sparsity penalties are similar. See the supplementary material for more examples.
Figure 1 (Bottom) shows the average sparsity of estimates , as quantified by the norm , across values ( returns a count of the number of nonzero elements in ). The ‘+’ indicates the average sparsity of each for the learned as before. In general, the penalty produces a much sparser estimate, very near the true value of at the optimal . The penalty, which is substantially less concave/sparsity-inducing, still sets some elements to exactly zero, but also substantially shrinks nonzero coefficients in achieving a similar overall reconstruction error. This highlights the importance of learning a via a penalty that is properly matched to the prior on : if we instead tried to force a particular sparsity value (in this case 10), then the solution would be very suboptimal. Finally we note that maximum likelihood (ML) estimation of performs very poorly (not shown), except in the special case where the ML estimate is equivalent to solving (14) as occurs when (see ). The proposed method can be viewed as adding a principled hyperprior on , properly matched to , that compensates for this shortcoming of standard ML.
Type II estimation has been explored elsewhere for the special case where , which renders the factor of in (16) irrelevant; however, for other selections we have found this factor to improve performance (not shown). For space considerations we have focused our attention here on Type I, which has frequently been noted for not lending itself well to estimation (or related parameters) [6, 13]. In fact, the symmetry afforded by the dual-space perspective reveals that Type I is just as natural a candidate for this task as Type II, and may be preferred in high-dimensional settings where computational resources are at a premium.
4 Maximally Sparse Estimation
With the advent of compressive sensing and other related applications, there has been growing interest in finding maximally sparse signal representations from redundant dictionaries () [3, 5]. The canonical form of this problem involves solving
While (17) is NP-hard, whenever the dictionary satisfies a restricted isometry property (RIP)  or a related structural assumption, meaning that each columns of are sufficiently close to orthonormal (i.e., mutually uncorrelated), then replacing with in (17
) leads to a convex problem with an equivalent global solution. Unfortunately however, in many situations (e.g., feature selection, source localization) these RIP equivalence conditions are grossly violated, implying that thesolution may deviate substantially from .
An alternative is to instead replace (17) with minimization of (8) and then take the limit as . (Note that the extension to the noisy case with is straightforward, but analysis is more difficult.) In this regime the optimization problem reduces to
If is concave, then (18) can be minimized using reweighted minimization. With initial weight vector , the -th iteration involves computing
With , iterating (19) will provably lead to an estimate of that is as good or better than the solution , in particular when has highly correlated columns. Additionally, the assumption leads to a closed-form expression for the weights . Let
where denotes a diagonal matrix with -th diagonal entry given by . Then can be computed via . It remains unclear however in what circumstances this type of update can lead to guaranteed improvement nor if the functions are even the optimal choice. We will now demonstrate that for certain selections of and , we can guarantee that reweighted using is guaranteed to recover exactly if is drawn from what we call a clustered dictionary model.
Clustered Dictionary Model: Let denote any dictionary such that minimization succeeds in solving (17) for all . Let denote any dictionary obtained by replacing each column of with a “cluster” of basis vectors such that the angle between any two vectors within a cluster is less than some . We also define the cluster support as the set of cluster indices whereby has at least one nonzero element. Finally, we assume that the resulting is such that every submatrix is full rank.
For any sparse vector and any dictionary obtained from the clustered dictionary model with sufficiently small, reweighted minimization using weights with some and sufficiently small will recover exactly provided that , , and within each cluster the coefficients do not sum to zero.
Theorem 2 implies that even though may fail to find the maximally sparse because of severe RIP violations (high correlations between groups of dictionary columns as dictated by lead directly to a poor RIP), a Type II-inspired method can still be successful. Moreover, because whenever does succeed, Type II will always succeed as well (assuming a reweighted implementation), the converse (RIP violation leading to Type II failure but not failure) can never happen. Recent work from  has argued that Type II may be useful for addressing the sparse recovery problem with correlated dictionaries, and empirical evidence is provided showing vastly superior performance on clustered dictionaries. However, we stress that no results proving global convergence to the correct, maximally sparse solution have been shown before in the case of structured dictionaries (except in special cases with strong, unverifiable constraints on coefficient magnitudes ). Moreover, the proposed weighting strategy accomplishes this without any particular tuning to the clustered dictionary model under consideration and thus likely holds in many other cases as well.
5 Generalized Likelihood functions
Type I methods naturally accommodate alternative likelihood functions. We simply must replace the quadratic data fit term from (4) with some preferred function and then coordinate-wise optimization may proceed provided we have an efficient means of computing a weighted -norm penalized solution. In contrast, generalizing Type II is substantially more complicated because it is no longer possible to compute the marginalization (5) or the posterior distribution . Therefore, to obtain a tractable estimate
additional heuristics are required. For example, the RVM classifier from employs a Laplace approximation for this purpose; however, it is not clear what cost function is being minimized nor rigorous properties of the estimated solutions.
Fortunately, the dual -space view provides a natural mechanism for generalizing the basic Type II methodology to address alternative likelihood functions in a more principled manner. In the case of classification problems, we might want to replace the Gaussian likelihood implied by (1
) with a multivariate Bernoulli distributionwhere is the function
Here and , with denoting the -th row of . This function may be naturally substituted into the -space Type II cost function (8
) giving us the candidate penalized logistic regression function
Importantly, recasting Type II classification using -space in this way, with its attendant well-specified cost function, facilitates more concrete analyses (see below) regarding properties of global and local minima that were previously rendered inaccessible because of intractable integrals and compensatory approximations. Moreover, we retain a tight connection with the original Type II marginalization process as follows.
Consider the strict upper bound on the function (obtained by a Taylor series approximation and a Hessian bound) given by
where with . This bound holds for all with equality when . Using this result we obtain the lower bound on the marginal likelihood given by . The dual-space framework can then be used to derive the following result:
Thus we may conclude that (22) provides a principled approximation to (5) when a Bernoulli likelihood function is used for classification purposes. In empirical tests on benchmark data sets (see supplementary material) using , it performs nearly identically to the original RVM (which also implicitly assumes ), but nonetheless provides a more solid theoretical justification for Type II classifiers because of the underlying similarities and identical generative model. But while the RVM and its attendant approximations are difficult to analyze, (22) is relatively transparent. Additionally, for other sparse priors, or equivalently other selections for , we can still perform optimization and analyze cost functions without any conjugacy requirements on the implicit .
Despite the practical success of the RVM and related Bayesian techniques, and empirical evidence of sparse solutions, there is currently no proof that the standard variants of these classification methods will always produce exactly sparse estimates. Thus Theorem 4 provides some analytical validation of these types of classifiers.
Finally, if we take (22) as our starting point, we may naturally consider modifications tailored to specific sparse classification tasks (that may or may not retain an explicit connection with the original Type II probabilistic model). For example, suppose we would like to obtain a maximally sparse classifier, where regularization is provided by a penalty. Direct optimization is combinatorial because of what we call the global zero attraction property: Whenever any individual coefficient goes to zero, we are necessarily at a local minimum with respect to this coefficient because of the infinite slope (discontinuity) of the norm at zero. However, (22) can be modified to approximate the without this property as follows.
Consider the Type II-inspired minimization problem
which is equivalent to (22) with when . For some and sufficiently small (but not necessarily equal), the support333Support refers to the index set of the nonzero elements. of will match the support of . Moreover, (25) does not satisfy the global zero attraction property.
Thus Type II affords the possibility of mimicking the norm in the presence of generalized likelihoods but with the advantageous potential for drastically fewer local minima. This is a direction for future research. Additionally, while here we have focused our attention on classification via logistic regression, these ideas can presumably be extended to other likelihood functions provided certain conditions are met. To the best of our knowledge, while already demonstrably successful in an empirical setting, Type II classifiers and other related Bayesian generalized likelihood models have never been analyzed in the context of sparse estimation as we have done in this section.
The dual-space view of sparse linear or generalized linear models naturally allows us to transition -space ideas originally developed for Type I and apply them to Type II, and conversely, apply -space techniques from Type II to Type I. The resulting symmetry promotes a mutual understanding of both methodologies and helps ensure that they are not underutilized.
-  S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
-  E. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006.
-  E. Candès, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted minimization,” J. Fourier Anal. Appl., vol. 14, no. 5, pp. 877–905, 2008.
-  R. Chartrand and W. Yin, “Iteratively reweighted algorithms for compressive sensing,” Proc. Int. Conf. Accoustics, Speech, and Signal Proc., 2008.
-  D.L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via minimization,” Proc. National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, March 2003.
-  M.A.T. Figueiredo, “Adaptive sparseness using Jeffreys prior,” Advances in Neural Information Processing Systems 14, pp. 697–704, 2002.
C. Févotte and S.J. Godsill,
“Blind separation of sparse sources using Jeffreys inverse prior and
the EM algorithm,”
Proc. 6th Int. Conf. Independent Component Analysis and Blind Source Separation, Mar. 2006.
-  M. Girolami, “A variational method for learning sparse and overcomplete representations,” Neural Computation, vol. 13, no. 11, pp. 2517–2532, 2001.
-  I.F. Gorodnitsky and B.D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Transactions on Signal Processing, vol. 45, no. 3, pp. 600–616, March 1997.
-  S. Ji, D. Dunson, and L. Carin, “Multi-task compressive sensing,” IEEE Trans. Signal Processing, vol. 57, no. 1, pp. 92–106, Jan 2009.
-  K. Kreutz-Delgado, J. F. Murray, B.D. Rao, K. Engan, T.-W. Lee, and T.J. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural Computation, vol. 15, no. 2, pp. 349–396, February 2003.
“Bayesian interpolation,”Neural Computation, vol. 4, no. 3, pp. 415–447, 1992.
-  J. Mattout, C. Phillips, W.D. Penny, M.D. Rugg, and K.J. Friston, “MEG source localization under multiple constraints: An extended Bayesian framework,” NeuroImage, vol. 30, pp. 753–767, 2006.
Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.
-  J.A. Palmer, D.P. Wipf, K. Kreutz-Delgado, and B.D. Rao, “Variational EM algorithms for non-Gaussian latent variable models,” Advances in Neural Information Processing Systems 18, pp. 1059–1066, 2006.
-  B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado, “Subset selection in noise based on diversity measure minimization,” IEEE Trans. Signal Processing, vol. 51, no. 3, pp. 760–770, March 2003.
-  M. Seeger and H. Nickisch, “Large scale Bayesian inference and experimental design for sparse linear models,” SIAM J. Imaging Sciences, vol. 4, no. 1, pp. 166–199, 2011.
-  R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.
“Sparse Bayesian learning and the relevance vector machine,”
Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001.
M.E. Tipping and A.C. Faul,
“Fast marginal likelihood maximisation for sparse Bayesian models,”
Ninth Int. Workshop. Artificial Intelligence and Statistics, Jan. 2003.
-  D.P. Wipf “Sparse estimation with structured dictionaries,” Advances in Nerual Information Processing 24, 2011.
-  D.P. Wipf, B.D. Rao, and S. Nagarajan, “Latent variable Bayesian models for promoting sparsity,” IEEE Trans. Information Theory, vol. 57, no. 9, Sept. 2011.