# An MDL framework for sparse coding and dictionary learning

The power of sparse signal modeling with learned over-complete dictionaries has been demonstrated in a variety of applications and fields, from signal processing to statistical inference and machine learning. However, the statistical properties of these models, such as under-fitting or over-fitting given sets of data, are still not well characterized in the literature. As a result, the success of sparse modeling depends on hand-tuning critical parameters for each data and application. This work aims at addressing this by providing a practical and objective characterization of sparse models by means of the Minimum Description Length (MDL) principle -- a well established information-theoretic approach to model selection in statistical inference. The resulting framework derives a family of efficient sparse coding and dictionary learning algorithms which, by virtue of the MDL principle, are completely parameter free. Furthermore, such framework allows to incorporate additional prior information to existing models, such as Markovian dependencies, or to define completely new problem formulations, including in the matrix analysis area, in a natural way. These virtues will be demonstrated with parameter-free algorithms for the classic image denoising and classification problems, and for low-rank matrix recovery in video applications.

## Authors

• 5 publications
• 71 publications
• ### Low-rank data modeling via the Minimum Description Length principle

Robust low-rank matrix estimation is a topic of increasing interest, wit...
09/28/2011 ∙ by Ignacio Ramírez, et al. ∙ 0

• ### Astronomical Image Denoising Using Dictionary Learning

Astronomical images suffer a constant presence of multiple defects that ...
04/12/2013 ∙ by Simon Beckouche, et al. ∙ 0

• ### Learning Dictionaries with Bounded Self-Coherence

Sparse coding in learned dictionaries has been established as a successf...
05/28/2012 ∙ by Christian D. Sigg, et al. ∙ 0

• ### Robust Kronecker-Decomposable Component Analysis for Low-Rank Modeling

Dictionary learning and component analysis are part of one of the most w...
03/22/2017 ∙ by Mehdi Bahri, et al. ∙ 0

• ### Estimating the distribution of Galaxy Morphologies on a continuous space

The incredible variety of galaxy shapes cannot be summarized by human de...
06/29/2014 ∙ by Giuseppe Vinci, et al. ∙ 0

• ### On the Reconstruction Risk of Convolutional Sparse Dictionary Learning

Sparse dictionary learning (SDL) has become a popular method for adaptiv...
08/29/2017 ∙ by Shashank Singh, et al. ∙ 0

• ### Simple, Efficient, and Neural Algorithms for Sparse Coding

Sparse coding is a basic task in many fields including signal processing...
03/02/2015 ∙ by Sanjeev Arora, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

A sparse model is one in which signals of a given type can be represented accurately as sparse linear combinations of the columns (atoms) of a learned dictionary , where by accurate we mean that (in some norm), and by sparse we mean that the number of non-zero elements in , denoted by , is small compared to its dimension . These concepts will be formalized in the next section.

Such models, especially when is learned from training samples, are by now a well established tool in a variety of fields and applications, see [1, 2, 3] for recent reviews.

When sparsity is a modeling device and not an hypothesis about the nature of the analyzed signals, parameters such as the desired sparsity in the solutions, or the size of the dictionaries to be learned, play a critical role in the effectiveness of sparse models for the data and tasks at hand. However, lacking theoretical guidelines for such parameters, published applications based on learned sparse models often rely on either cross-validation or ad-hoc methods for determining such critical parameters (an exception for example being the Bayesian approach, e.g., [4]). Clearly, such techniques can be impractical and/or ineffective in many cases. This in turn hinders the further application of such models to new types of data and applications, or their evolution into different, possibly more sophisticated, models.

At the bottom of the aforementioned problem lie fundamental questions such as: How rich or complex is a sparse model? How does this depend on the required sparsity of the solutions, or the size of the dictionaries? What is the best model for a given data class and a given task?

The general problem of answering such questions and, in particular, the latter, is known as model selection. Popular model selection techniques such as Akaike’s Information Criterion (AIC) [5], Bayes Information Criterion (BIC) [6], and the Minimum Description Length principle (MDL) [7, 8, 9], work by building a cost function which balances a measure of goodness of fit with one of model complexity, and search for a model that minimizes such cost. In this sense, these tools can be regarded as practical implementations of the Occam’s razor principle, which states that, given two (equally accurate) descriptions for a given phenomenon, the simpler one is usually the best.

In the Minimum Description Length principle, given a family or model class of candidate models indexed by a parameter , and a data sample , the best model is the one that can be used to describe completely (including the parameters themselves) with the fewest number of bits,

 ^M=argminM∈ML(y,M), (1)

where is a codelength assignment function which defines the theoretical codelength required to describe uniquely, and which is a key component of any MDL-based framework. The underlying idea of MDL is that compressibility is a good indirect way of measuring the ability of a model to capture regularity from the data. Common practice in MDL uses the Ideal Shannon Codelength Assignment [10, Chapter 5] to define in terms of a probability assignment as (all logarithms will be assumed on base hereafter). In this way, the problem of choosing

becomes one of choosing a suitable probability model for

. Note here how MDL considers probability models not as a statement about the true nature of the data, but only as a modeling tool. If we now write , we obtain the more familiar penalized likelihood form,

 ^M=argminM∈M−logP(y|M)−logP(M), (2)

with representing the model complexity, or model cost, term.

The use of MDL for sparse signal modeling has been explored for example in the context of wavelet-based denoising (where , and is fixed) of images corrupted by additive white Gaussian noise (AWGN) [11, 12, 13, 14, 15]. In [11, 12, 13], the data is described using (2) with assumed to be solely due to noise, and an term which exploits sparsity,

 ^a=argmina∈M12σ2e∥y−Da∥22+L(a). (3)

Here the first term corresponds to the ideal codelength, up to a constant, of an IID Gaussian sequence of zero mean and known variance

. The difference between [11, 12, 13] lies in the definition of . The line of work [14, 15] follows the modern MDL approach by using sophisticated tools from coding theory, the so called one-part universal codes, which encodes jointly, and reduces the arbitrariness in defining . However, such tools can only be applied for certain choices of and . In the case of [14, 15], the choice is to use continuous Gaussian models for both. As Gaussian models are not well suited to the typically observed statistical properties of such data, the performance of the resulting denoisers for example is very poor compared to the current state-of-the-art.

The present work extends and/or improves on the aforementioned work in the following ways:111This paper extends preliminary results reported in [16]. In particular, new dictionary learning algorithms are developed which include atom regularization, forward and backward dictionary size adaptation. We also develop a new model for the low-rank matrix approximation problem.

• MDL-based sparse coding is extended to the case of non-orthonormal, possibly over-complete and learned dictionaries . As we will see in Section V, this extension, critical to deal with modern, very successful sparse modeling approaches, poses not only new design problems but also significant computational challenges compared to the orthonormal case.

• Efficient codelengths (probability distributions) for the different components to encode (error, coefficients, dictionary) are obtained by

applying universal coding schemes to priors that are suited to the typically observed statistics of such data.

• As a particular point of the above item, systematic model-fit deviations are naturally taken into account in

. The resulting fitting terms fall into the category of robust estimators (see

[17]), thus marrying robust statistics with information theory and with sparse modeling (dictionary learning).

• We comply with the basic MDL sanity check, meaning, that the theoretical codelengths obtained are smaller than a “raw” description of the data. We do so by including quantization in our models, and treating its effect rigorously.

• Dictionary learning within the MDL framework allows us to optimize both the number of atoms , as well as their structure, resulting in a natural and objective form of regularization for .

• Structure is naturally added to the sparse models in the form of Markovian dependencies between adjacent data samples. We also show an extension of the model to the problem of low-rank matrix completion.

As a result of the above features, we obtain for the first time an MDL-based, parameter-free framework for signal modeling that is able to yield state-of-the-art results.

At the theoretical level, this brings us a step closer to the fundamental understanding of learned sparse models and brings a different perspective, that of MDL, into the sparse modeling world.

The remainder of this paper is organized as follows. Sparse models, and the associated notation, are described in detail in Section II. Section III introduces MDL, and its application to sparse models. In Section IV we present the probability models used to assign codelengths to different parts of the encoded data, while sections V and VI describe the actual sparse coding and dictionary learning algorithms developed. Experimental results follow in Section VII, and the paper is concluded in Section VIII.

## Ii Background on sparse modeling

Assume we are given -dimensional data samples ordered as columns of a matrix . Consider a linear model for , where is an dictionary consisting of atoms, is a matrix of coefficients where each -th column specifies the linear combination of columns of that approximates , and is a matrix of approximation errors.

We define the support, or active set, of a vector

as . Let . We also represent the support of as a binary vector such that for , and otherwise. We refer to the sub-vector in of non-zero elements of as either or . Both conventions are extended to refer to sets of columns of matrices, for example, is the matrix formed by the columns of indexed by . We will use the pseudo-norm to denote the number of non-zero elements of . We say that the model is sparse if we can achieve and simultaneously for all or most .

The result of quantizing a real-valued variable to precision is denoted by . This notation is extended to denote element-wise quantization of vector (e.g., ) and matrix operands (e.g., ).

### Ii-a Sparse coding

One possible form of expressing the sparse coding problem is given by

 ^aj=argminu∈Rp∥∥yj−Du∥∥2s.t.∥u∥0≤γ, (4)

where indicates the desired sparsity level of the solution. Since problem (4) is non-convex and NP-hard, approximate solutions are sought. This is done either by using greedy methods such as Matching Pursuit (MP) [18], or by solving a convex approximation to (4), commonly known as the lasso [19],

 ^aj=argminu∈Rp12∥∥yj−Du∥∥2s.t.∥u∥1≤τ. (5)

There exists a body of results showing that, under certain conditions on and , the problem (4) can be solved exactly via (5) or MP (see for example [1, 20]). In other cases, the objective is not to solve (4), but to guarantee some property of the estimated . For example, in the above mentioned case of AWGN denoising in the wavelets domain, the parameter can be chosen so that the resulting estimators are universally optimal with respect to some class of signals [21]. However, if is arbitrary, no such choice exists. Also, if is orthonormal, the problem (5) admits a closed form solution obtained via the so-called soft thresholding [21]. However, again, for general , no such solution exists, and the search for efficient algorithms has been a hot topic recently, e.g., [22, 23, 24].

### Ii-B Dictionary learning

When is an optimization variable, we refer to the resulting problem as dictionary learning:

 (^A,^D)=argminA,Dn∑j=112∥∥yj−Daj∥∥22s.t.∥∥aj∥∥r≤τ∀j,∥dk∥2≤1∀k, (6)

with . The constraint , is necessary to avoid an arbitrary decrease of the cost function by setting , , for any . The cost function in (6) is non-convex in , so that only local convergence can be guaranteed. This is usually achieved using alternate optimization in and . See for example [25, 26] and references therein.

### Ii-C Issues with traditional sparse models: a motivating example

Consider the K-SVD-based [25] sparse image restoration framework [27]. This is an -based dictionary learning framework, which approximates (6) for the case by alternate minimization. In the case of image denoising, the general procedure can be summarized as follows:

1. An initial, global dictionary is learned using training samples for the class of data to be processed (in this case small patches of natural images). The user must supply a patch width , a dictionary size and a value for .

2. The noisy image is decomposed into overlapping patches (one patch per pixel of the image), and its noisy patches are used to further adapt using the following denoising variant of (6),

 (^D,^A)=argminD,An∑j=1∥∥aj∥∥0,s.t. 12∥∥yj−Daj∥∥22≤Cσ2, ∥dk∥2=1,k=1,…,p. (7)

Here the user must further supply a constant (in [27], it is ), the noise variance , and the number of iterations of the optimization algorithm, which is usually kept small to avoid over-fitting (the algorithm is not allowed to converge).

3. The final image is constructed by assembling the patches in into the corresponding original positions of the image. The final pixel value at each location is an average of all the patches to which it belongs, plus a small fraction of the original noisy pixels ( in [27]).

Despite the good results obtained for natural images, several aspects of this method are not satisfactory:

• Several parameters (, , , , , ) need to be tuned. There is no interpretation, and therefore no justifiable choice for these parameters, other than maximizing the empirical performance of the algorithm (according to some metric, in this case PSNR) for the data at hand.

• The effect of such parameters on the result is shadowed by the effects of later stages of the algorithm and their associated parameters (e.g. overlapping patch averaging). There is no fundamental way to optimize each stage separately.

As a partial remedy to the first problem, Bayesian sparse models were developed (e.g., [4]) where these parameters are assigned prior distributions which are then learned from the data. However, this approach still does not provide objective means to compare different models (with different priors, for example). Further, the Bayesian technique implies having to repeatedly solve possibly costly optimization problems, increasing the computational burden of the application.

As mentioned in the introduction, this work proposes to address the above practical issues, as well as to provide a new angle into dictionary learning, by means of the MDL principle for model selection. The details on how this is done are the subject of the following sections.

## Iii Sparse model selection and MDL

Given data , a maximum support size and a dictionary size , traditional sparse modeling provides means to estimate the best model for within the set defined as

 M(γ,p):={(A,D):∥∥aj∥∥0≤γ,j=1,…,n,D∈Rm×p}. (8)

We call such set a sparse model class with hyper-parameters . Such classes are nested in the following way: first, for a fixed dictionary size we have . Also, for fixed , if we consider to be a particular case of where the -th atom is all-zeroes and , then we also have that .

If one wants to choose the best model among all possible classes , the problem becomes one of model selection. The general objective of model selection tools is to define an objective criterion for choosing such model. In particular, MDL model selection uses codelength as such criterion. More specifically, this means first computing the best model within each family as

 (A(γ,p),D(γ,p))=argmin{L(Y,A,D):(A,D)∈M(γ,p)},

and then choosing .

When is fixed, which is the case of sparse coding, the only model parameter is , and we have possible classes, , one for each . If each data sample from is encoded independently, then, as with traditional sparse coding (the framework can also be extended to collaborative models), the model selection problem can be broken into sub-problems, one per sample, by redefining the model class accordingly as . Clearly, in the latter case, the optimum can vary from sample to sample.

Compared to the algorithm in Section II-C, we have a fundamental, intrinsic measure of the quality of each model, the codelength , to guide our search through the models, and which is unobscured from the effect of possible later stages of the application. In contrast, there is no obvious intrinsic measure of quality for models learned through (8), making comparisons between models learned for different parameters (patch width , regularization parameter , norm , constants ) possible only in terms of the observed results of the applications where they are embedded. The second advantage of this framework is that it allows to select, in a fundamental fashion, the best model parameters automatically, thus resulting in parameter-free algorithms.222For the case of image processing, the patch width is also a relevant parameter that could be automatically learned with the same MDL-based framework presented here. However, since it is specific to image processing, and due to space constraints and for clarity of the exposition, it will not be considered as part of the model selection problem hereafter.

Such advantages will be of practical use only if the resulting computational algorithms are not orders of magnitude slower than the traditional ones, and efficient algorithms are a critical component of this framework, see Section V.

### Iii-a A brief introduction to MDL

For clarity of the presentation, in this section we will consider fixed, and a single data sample to be encoded. The Minimum Description Length principle was pioneered by Rissanen [7] in what is called “early MDL,” and later refined by himself [8] and other authors to form what is today known as “modern MDL” (see [28] for an up-to-date extensive reference on the subject). The goal of MDL is to provide an objective criterion to select the model , out of a family of competing models , that gives the best description of the given data . In this case of sparse coding with fixed dictionary we have .

The main idea of MDL is that, the best model for the data at hand is the one that is able to capture more regularity from it. The more regularity a model captures, the more succinct the description of the data will be under that model (by avoiding redundancy in the description). Therefore, MDL will select the best model as the one that produces the shortest (most efficient) description of the data, which in our case is given by .

As mentioned in Section I, MDL translates the problem of choosing a codelength function to one of choosing probability models by means of the ideal Shannon codelength assignment

. It is common to extend such ideal codelength to continuous random variables

as , by assuming that they will be quantized with sufficient precision so that

 P([x]δ)≈p(x)δ, (9)

and disregarding the constant term in , as it is inconsequential for model selection. However, in our framework, the optimum quantization levels will often be large enough so that such approximations are no longer valid.

To produce a complete description of the data , the best model parameters used to encode need to be included in the description as well. If the only thing we know is that belongs to a given class , then the cost of this description will depend on how large and complex is. MDL will penalize more those models that come from larger (more complex) classes. This is summarized in one of the fundamental results underlying MDL [8], which establishes that the minimum number of bits required for encoding any data vector using a model from a class has the form where is called the stochastic complexity, which depends only on the particular instance of being encoded, and is an unavoidable parametric complexity term, which depends solely on the structure, geometry, etc., of the model class .

In the initial version of MDL [7], the parameter was first encoded separately using bits, and then was described given using bits, so that the complete description of required bits. This is called a two-parts code. An asymptotic expression of this MDL was developed in [7] which is equivalent to the BIC criterion [6], only in the asymptotic regime. As we will see next, modern MDL departs significantly from this two-parts coding scheme.

### Iii-B Modern MDL and universal coding

The main difference between “early” [7] and “modern” [8, 9] MDL is the introduction of universal codes as the main building blocks for computing codelengths. In a nutshell, universal coding can be regarded as an extension of the original Shannon theory to the case where the probability model of the data to be encoded is not fully specified, but only known to belong to a certain class of candidate probability models (recall that classic Shannon theory assumes that is perfectly known). For example, can be a family of parametric distributions indexed by some parameter . Akin to Shannon theory, for an encoding scheme to be called universal, the codelengths it produces need to be optimal, in some sense, with respect to the codelengths produced by all the models in .

Various universality criteria exist. For example, consider the codelength redundancy of a model , In words, this is the codelength overhead obtained with for describing an instance , compared to the best model in that could be picked for , with hindsight of . For example, if is a parametric family, such model is given by the maximum likelihood (ML) estimator of . A model is called minimax universal, if it minimizes the worst case redundancy, One of the main techniques in universal coding is one-part coding, where the data and the best class parameter are encoded jointly. Such codes are used in the line of work of “MDL denoising” due to Rissanen and his collaborators [14, 15]. However, applying one-part codes at this level restricts the probability models to be used.333In particular, those used in [14, 15] are based on the Normalized Maximum Likelihood (NML) universal model [29], which requires closed-form MLE estimators for its evaluation, something that cannot be obtained for example with a Laplacian prior on and non-orthogonal dictionaries. As a consequence, the results obtained with this approach in such works are not competitive with the state-of-the-art. Therefore, in this work, we maintain a two-parts encoding scheme (or three parts, if is to be encoded as well), where we separately describe , , and given . We will however use universal codes to describe each of these parts as efficiently as possible. Details on this are given in the next section.

## Iv Encoding scheme

We now define the models and encoding schemes used to describe each of the parts that comprise a sparse model for a data sample ; that is, the dictionary , the coefficients , and the approximation error (which can include both the noise and the model deviation), which can be regarded as the conditional description of given the model parameters . The result will be a cost function of the form (note that can be fully recovered from ),

 L(y,a,D)=L(e|a,D)+L(a|D)+L(D).

While computing each of these parts, three main issues need to be dealt with:

1. Define appropriate probability models. Here, it is fundamental to incorporate as much prior information as possible, so that no cost is paid in learning (and thereby coding) already known statistical features of the data. Examples of such prior information include sparsity itself, invariance to certain transformations or symmetries, and (Markovian) dependencies between coefficients.

2. Deal with unknown parameters. We will use universal encoding strategies to encode data efficiently in terms of families of probability distributions.

3. Model the effect of quantization. All components, need to be quantized to some precisions, respectively , in order to obtain finite, realistic codelengths for describing (when the precision variable is obvious from the argument to which it is applied, we drop it to simplify the notation, for example, we will write as ). This quantization introduces several complications, such as optimization over discrete domains, increase of sparsity by rounding to zero, increase of approximation error, and working with discrete probability distributions.

All such issues need to be considered with efficiency of computation in mind. The discussion will focus first on the traditional, single-signal case where each sample is encoded separately from the rest. At the end of this section, we will also discuss the extension of this framework to a multi-signal case, which has several algorithmic and modeling advantages over the single-signal case, and which forms the basis for the dictionary learning algorithms described later.

### Iv-a Encoding the sparse coefficients

Probability model: Each coefficient in is modeled as the product of three (non-independent) random variables, , where implies , , and is the absolute value of corrected for the fact that when . 444Note that it is necessary to encode and separately, instead of considering as one random variable, so that the sign of can be recovered when .

We model as a Bernoulli variable with . Conditioned on , with probability , so no encoding is needed.555We can easily extend the proposed model beyond and consider a distribution for when . This will naturally appear as part of the coding cost. This extends standard sparse coding to the case where the non-sparse component of the vector are not necessarily zero. Conditioned on , we assume , and to be a (discretized) exponential, . With these choices, is a (discretized) Laplacian distribution, which is a standard model for transform (e.g., DCT, Wavelet) coefficients. This encoding scheme is depicted in Figure 1(a,b). The resulting model is a particular case of the “spike and slab” model used in statistics (see [30] and references therein). A similar factorization of the sparse coefficients is used in the Bayesian framework as well [4].

Unknown parameters: According to the above model, the resulting encoding scheme for the coefficients (sparse code) is a three-parts code: . The support is described using the enumerative two-parts code [31], which first describes its size, , using bits, and then the particular arrangement of the ones in using bits. The total codelength for coding is then This is a universal encoding scheme, and as such is more efficient than those used previously in [11, 13]. Then, bits are needed to encode , the actual signs of the non-zero coefficients. Finally, we need to encode the magnitudes of the non-zero coefficients,

. We do so by considering it first as a sequence of exponentially-distributed continuous random variables, to which quantization is applied later. Since the parameter

of the exponential is unknown,666This parameter is related to the sparsity level, and as discussed in Section II-C, is usually assumed known or determined via cross-validation. Following [32], here we use tools from universal modeling, which permit to also automatically handle the non-stationarity of this parameter and its expected variability for different non-zero entries of . we use a universal model for the class of continuous exponential distributions instead. We obtain such universal model via a convex mixture, one of the standard techniques for this,

 qV(V;κa,βa) =∫+∞0Γ(θ;κa,βa)θ2e−θ|V|dθ, (10)

where the mixing function is the Gamma density function of (non-informative) shape and scale parameters and . With this choice, (10) has a closed form expression, and the degenerate cases and are given zero weight. The resulting Mixture of Exponentials (MOE) density function , is given by (see [32] for details),

 qV(V;βa,κa)=κaβκaa(V+βa)−(κa+1),V∈R+.

Note that the universality of this mixture model does not depend on the values of the parameters , and guided by [32], we set and . The ideal Shannon codelength for this density function distribution is given by This function, shown in Figure 1(c), is non-convex, however continuous and differentiable for .

Quantization: On one hand, quantizing the coefficients to a finite precision increases the approximation/modeling error, from to . This additional error, , will clearly increase with . On the other hand, larger will reduce the description length of the non-zero values of the coefficients, . In practice, for reasonable quantization steps, the error added by such quantization is negligible compared to the approximation error. For example, for describing natural images divided in patches of pixels, our experiments indicate that there is no practical advantage in using a value smaller than . Consequently, our current algorithms do not attempt to optimize the codelength on this parameter, and we have kept this value fixed throughout all the experiments of Section VII.

### Iv-B Encoding the error

Probability model: Most sparse coding frameworks, including all the mentioned MDL-based ones, assume the error to be solely due to measurement noise, typically of the AWGN type. However, actually contains a significant component which is due to a systematic deviation of the model from the clean data. Following this, we model the elements of as samples of an IID random variable which is the linear combination of two independent variables, . Here represents random measurement noise in . We assume the noise variance known, as it can be easily and reliably estimated from the input data (for example, taking the minimum empirical variance over a sufficient number of sub-samples). The distribution of the second variable, is a Laplacian of unknown parameter , which represents the error component due to the model itself. The resulting continuous distribution , which we call “LG,” is the convolution of the distributions of both components (see  [33] for details on the derivation),

 pE(E;σ2e,θe)= ∫+∞ζ=−∞1√2πσ2ee−ζ12σ2e12θee−|E−ζ|θedζ = 14θeeσ2e2θ2e[eE/θeerfc(E+σ2e/θe√2σe)+e−E/θeerfc(−E+σ2e/θe√2σe)], (11)

where is the complimentary Gauss error function. The ideal codelength, , is shown in Figure 2(a) for various parameter values. This function is convex and differentiable on , which is nice for optimization purposes. Figure 2(b) shows its derivative, or so called “influence function” in robust statistics. It can be verified that behaves like a Laplacian with parameter for large values of

. Further, since its derivative is bounded, the influence of outliers is diminished. In fact,

is easily verified to be a -type M-estimator, a family of functions used in robust statistics (see [17]). Thus, using this model, we obtain an information-theoretic robust estimator, which is consistent with the motivations leading to its use in our framework, and which has a significant practical impact in the experimental results.

Unknown parameters: Since is unknown, encoding efficiently calls for the use of universal codes. In this case, again, we employ a mixture model. Since the parameter comes from the underlying Laplacian component, we again use a Gamma for the mixing function,

 qE(E;σ2e,κe,βe) =∫+∞0Γ(θ;κe,βe)pE(E;σ2e,θ)dθ. (12)

We call this model MOEG. As with the MOE model, the universality of this model is guaranteed by the theory for the choice of its underlying mixing function, for any (non-informative) and . In this case, we use and . Also, we know from the discussion above that can be easily and reliably estimated from the data. Thus, we can say that the model for is parameter-free in this case as well. Figure 2(c) shows the numerical evaluation of the ideal Shannon codelength , which is non-convex. However, it is twice differentiable everywhere, again a desirable property for optimization purposes (more on this in sections V and VI). As with the LG distribution, is an -type M-estimator, in this case, a redescending M-estimator, since its derivative (Figure 2(d)) vanishes to at . As such, , derived from the universal model corresponding to , can reject outliers even more aggressively than , again marrying robust statistics with information theory in a natural way.

Quantization: To losslessly encode finite-precision input data such as digital images, the quantization step of the error coefficients needs not be more than that of the data itself, , and we simply quantize the error coefficients uniformly with step . For example, for -bit digital images, we set .

### Iv-C Model for the dictionary

Probability model: Dictionary learning practice shows that learned atoms, unsurprisingly, present features that are similar to those of the original data. For example, the piecewise smoothness of small image patches is to be expected in the atoms of learned dictionaries for such data. This prior information, often neglected in dictionary learning algorithms, needs to be taken into account for encoding such atoms efficiently.

We embody such information in the form of predictability. This is, we will encode an atom as a sequence of causal prediction residuals, , , a function of the previously encoded elements in . In particular, if we restrict to be a linear function, the residual vector can be written as , where is lower triangular due to the causality constraint (this aspect has important efficiency consequences in the algorithms to be developed in Section VI). This is depicted in Figure 3, along with the specific prediction scheme that we adopted for the image processing examples in Section VII. In this case we consider an atom to be an image patch, and use a causal bi-linear predictor where the prediction of each pixel in the dictionary atom is given by .

As a general model for linear prediction residuals, we assume to be a sequence of IID Laplacian samples of parameter . In principle, is also unknown. However, describing is only meaningful for dictionary learning purposes, and, in that case, is updated iteratively, so that when computing an iterate of , we can use to estimate and fix via ML (more on this later in Section VI). Thus, we consider to be known.

Quantization: When is fixed during a dictionary learning iteration (which consists of an alternate descent between and ), we can view as input-output training pairs, and as the ML estimator of the linear coefficients describing such mapping via . Based on this, we use the quantization step , which is an optimal step for encoding the ML parameter in two-part codes, as described in [8, Theorem 1].

Computation: Computing is only relevant for learning purposes. In general, since , and , we have that , and the error of using the approximation (9) is not significant,

 L(D)= p∑k=1L(dk)≈p∑k=1{−logp(Wdk;θd)−mlogδd}=θdp∑k=1∥Wdk∥1+mp2logn+c, (13)

where is the IID Laplacian distribution over the th atom prediction residual vector , and is a fixed constant. For fixed (we will later see how to learn the dictionary size as well), the above expression is simply an penalty on the atom prediction residual coefficients. As we will see in Section VI, this allows us to use efficient convex optimization tools to update the atoms.

### Iv-D Extension to sequential (collaborative) coding

One natural assumption that we can often make on the set of data samples is that, besides all being sparsely representable in terms of the learned dictionary , they share other statistical properties. For example, we can assume that the underlying unknown model parameters, , , , , are the same for all columns of the sparse data decomposition (, ).

Under such assumption, if we encode each column of sequentially, we can learn statistical information from the ones already encoded and apply it to estimate the unknown parameters of the distributions used for encoding the following ones. The general idea is depicted in Figure 4(a). Concretely, suppose we have already encoded samples. We can then use to estimate , and to estimate and , and “plug-in” these parameters to encode the -th sample. This justifies the name of this encoding method, which is known in the coding literature as sequential plug-in encoding. This encoding strategy has several advantages: 1) For common parameter estimators such as ML, this method can be shown to be universal; 2) Since all distribution parameters are fixed (pre-estimated) when encoding the -th sample, we can use the “original,” non-universal distributions assumed for modeling (LG) and (Laplacian), which have closed forms and are usually faster to compute (together with (9)) than their universal mixture counterparts; 3) Furthermore, these original distributions are convex, so that in this case, given a fixed support, we are able to exactly minimize the codelength over the non-zero coefficient values; 4) With many samples available for parameter estimation, we can potentially afford more complex models.

Residual: We estimate in two steps. First, since the random variable is an independent sum of two random variables, , we have that . Now, since is Laplacian, we have that . Combining both equations we have that . With the noise variance assumed known, and using the standard unbiased variance estimator, , we obtain

 ^θe=0.5√max{(p(j−1))−1∥∥E[1,…,(j−1)]∥∥2F−σ2e,0},

where the maximization guarantees that .

Coefficients: In the case of , we have in principle two unknown parameters, the probability of an element being non-zero, , and the scale parameter of the Laplacian governing the non-zero values, (both previously handled with universal models). Here, however, we extend the model, drawing from the well known empirical fact that coefficients associated to different atoms can have very different statistics, both in frequency and variance. This is typical of DCT coefficients for example (see [34]), and has been consistently observed for learned dictionaries as well [32]. Therefore, we will consider a separate set of parameters for each row of , . We update such parameters from the coefficients observed in the respective row for the already-computed samples, , and encode each -th coefficient in (more specifically, in , and ), as the -th sample of the respective row. Concretely, let be the number of non-zero coefficients observed so far in the -th row. For , we use the Krichevsky-Trofimov (KT) estimator [35],

 ^ρka=nk1+0.5j, (14)

which is a universal plug-in encoding scheme for Bernoulli sequences of unknown parameter. For encoding , we apply the ML estimator for the exponential family to the non-zero coefficients observed so far in the -th row. Recalling that , the resulting estimator is given by

 ^θka=∑(j−1)j′=1max{|akj′|−δa,0}nk1.

Markovian dependencies: In many applications, spatially/temporally adjacent samples are statistically dependent. For example, we may assume that an atom is more likely to occur for a sample if it has been used by, say, the -th sample (see also [36]). In that case, we may consider different estimations of depending on the value of , , and . In particular, for the image processing results of Section VII, we use a Markovian model which depends on three previous samples, corresponding to the (causal) neighboring west, north, and northwest patches of the one being encoded. Thus, for each atom we will have possible parameters, , where each value of indicates a possible Markov state in which a sample may occur. This is depicted in Figure 4(b). For each state , we estimate using (14), with the average taken over the samples which occur in the same state .

## V MDL based sparse coding

For the encoding problem, is fixed (it has already been learned), and we consider encoding a single data sample . The model selection problem here is that of choosing the model (indexed by the sparse code ) among all the models belonging to the nested family of model classes , that yields the smallest codelength for describing . In principle, this calls for finding the best model within each model class , and then selecting . However, in order to be computationally efficient, and as with most sparse coding and model selection algorithms, several simplifications and approximations are needed. Let us first consider the problem of finding ,

 a(γ) :=argmina∈M(γ)L(y,a)=argmina∈Rp−logPE(y−Da)−logP(z)−logP(s|z)−logPV(a|s,z) (15)

For quantized , this is an optimization problem over a discrete, infinite domain, with a non-convex (in the continuous domain) constraint, and a non-differentiable cost function in . Based on the literature on sparse coding, at least two alternatives can be devised at this point. One way is to use a pursuit technique, e.g.,  [18]. Another option is to use a convex relaxation of the codelength function, e.g.,  [37]. For the sake of brevity, here we will describe an algorithm loosely based on the first alternative. Details on the convex relaxation method for MDL-based sparse coding will be published elsewhere.

The pursuit-like algorithm, which we call COdelength-Minimizing Pursuit Algorithm (COMPA), is summarized in Algorithm 1. This is a non-greedy cross-breed between Matching Pursuit (MP) and Forward Stepwise Selection (FSS) [38]. As with those methods, COMPA starts with the empty solution , and updates the value of one single coefficient at each iteration. Then, given the current correlation between the dictionary atoms and the current residual, each -th coefficient in is tentatively incremented (or decremented) by , and a candidate codelength is computed in each case. The coefficient that produces the smallest is updated to produce .

The logic behind this procedure is that the codelength cost of adding a new coefficient to the support is usually very high, so that adding a new coefficient only makes sense if its contribution is high enough to produce some noticeable effect in the other parts of the codelength. A variant of this algorithm was also implemented where, for each candidate , the value of the increment was refined in order to minimize . However, this variant turned out to be significantly slower, and the compression gains where below bits per sample (uncompressed codelength is bits per sample). Assuming that is unimodal, the algorithm stops if the codelength of a new iterate is larger than the previous one. To assess the validity of this assumption, we also implemented a variant which stops, as MP or FSS, when the residual-coefficients correlation is no longer significant, which typically requires many more iterations. With this variant we obtained a negligible improvement of bits per sample, while increasing the computational cost about three times due to the extra iterations required.

6

6

6

6

6

6

## Vi MDL based dictionary learning

Given that our sparse coding algorithm in Section V can select the best support size for each sample in , the definition of the model class given in Section III, which assumes the same for all samples in , is no longer appropriate (we could of course add -weight coefficients to make equal for all data). Instead, for dictionary learning, we consider the model class family , where is the model class family of sparse codes based on a fixed dictionary defined in Section V, with the dependency on made explicit. It is easy to see that the model classes are nested. We now need to solve

 (A(p),D(p))=argmin(A,D)∈M(p)L(E,A,D), (16)

for , and then choose with the optimal dictionary size

 ^p=argminp{L(E,A(p),D(p)):p=0,1,…}.

As with sparse coding, here we exploit the nested nature of the model classes to speed up the model selection. For this, we propose a forward-selection algorithm, described in Algorithm 2, which starts from (the empty dictionary), and then approximates the best model in by adding a new atom to the dictionary computed for and then invoking Algorithm  3, which is discussed in depth in the next subsection.

A backward-selection algorithm was also developed which first learns the model for via (3), where is a given maximum dictionary size, and then prunes the less frequently used atoms until no further decrease in codelength is observed. This algorithm allows us to provide especially-constructed initial dictionaries for Algorithm (3), e.g., an (overcomplete) DCT frame, which can be critical for finding good local minima of the non-convex problem (16). We do this for example to learn a dictionary for the whole class of natural images, see Section VII.