Efficient Dictionary Learning with Sparseness-Enforcing Projections

by   Markus Thom, et al.
Universität Ulm

Learning dictionaries suitable for sparse coding instead of using engineered bases has proven effective in a variety of image processing tasks. This paper studies the optimization of dictionaries on image data where the representation is enforced to be explicitly sparse with respect to a smooth, normalized sparseness measure. This involves the computation of Euclidean projections onto level sets of the sparseness measure. While previous algorithms for this optimization problem had at least quasi-linear time complexity, here the first algorithm with linear time complexity and constant space complexity is proposed. The key for this is the mathematically rigorous derivation of a characterization of the projection's result based on a soft-shrinkage function. This theory is applied in an original algorithm called Easy Dictionary Learning (EZDL), which learns dictionaries with a simple and fast-to-compute Hebbian-like learning rule. The new algorithm is efficient, expressive and particularly simple to implement. It is demonstrated that despite its simplicity, the proposed learning algorithm is able to generate a rich variety of dictionaries, in particular a topographic organization of atoms or separable atoms. Further, the dictionaries are as expressive as those of benchmark learning algorithms in terms of the reproduction quality on entire images, and result in an equivalent denoising performance. EZDL learns approximately 30 faster than the already very efficient Online Dictionary Learning algorithm, and is therefore eligible for rapid data set analysis and problems with vast quantities of learning samples.


page 5

page 12

page 13


Efficient and Parallel Separable Dictionary Learning

Separable, or Kronecker product, dictionaries provide natural decomposit...

Astronomical Image Denoising Using Dictionary Learning

Astronomical images suffer a constant presence of multiple defects that ...

Learning Simple Thresholded Features with Sparse Support Recovery

The thresholded feature has recently emerged as an extremely efficient, ...

Learning Deep Analysis Dictionaries – Part II: Convolutional Dictionaries

In this paper, we introduce a Deep Convolutional Analysis Dictionary Mod...

A Split-and-Merge Dictionary Learning Algorithm for Sparse Representation

In big data image/video analytics, we encounter the problem of learning ...

Learning Multiplication-free Linear Transformations

In this paper, we propose several dictionary learning algorithms for spa...

Metrics for Multivariate Dictionaries

Overcomplete representations and dictionary learning algorithms kept att...

1 Introduction

In a great variety of classical machine learning problems, sparse solutions are attractive because they provide more efficient representations compared to non-sparse solutions. There is an overwhelming evidence that mammalian brains respect the sparseness principle

(Laughlin and Sejnowski, 2003), which holds true especially for the mammalian visual cortex (Hubel and Wiesel, 1959; Olshausen and Field, 1996, 1997). It suggests itself that sparseness be a fundamental prior to a variety of signal processing tasks. In particular, this includes low-level image processing since natural images can be represented succinctly using structural primitives (Olshausen and Field, 1997; Mairal et al., 2009b). Interesting and biologically plausible sparse representations were discovered through computer simulations on natural images (Olshausen and Field, 1996, 1997). Related representations can be obtained by analysis of temporal image sequences (van Hateren and Ruderman, 1998; Olshausen, 2003), stereo image pairs and images with chromatic information (Hoyer and Hyvärinen, 2000), or by enforcing a topographic organization (Hyvärinen et al., 2001; Kavukcuoglu et al., 2009).

Sparseness alleviates the effects of random noise in a natural way since it prevents arbitrary combinations of measured signals (Donoho, 1995; Hyvärinen, 1999; Elad, 2006). In fact, methods based on sparse representations were shown to achieve state-of-the-art performance for image denoising (Mairal et al., 2009b). Further notable image processing applications that benefit from the efficiency gained through sparseness are as diverse as deblurring (Dong et al., 2011)

, super-resolution

(Yang et al., 2010, 2012; Dong et al., 2011), compression (Skretting and Engan, 2011; Horev et al., 2012)

, and depth estimation

(Tošić et al., 2011).

1.1 Dictionary Learning and Sparseness Measures

Each of these tasks needs a model capable of reproducing the signals to be processed. In a linear generative model for sparse coding, a sample with features should be expressed approximately as a linear combination of only a few atoms of a larger dictionary:

Here, is the dictionary which is fixed for all samples, and is a sparse code word that depends on the concrete sample. The columns of represent the atoms, which are also called bases or filters. This sparse coding framework is well-suited for overcomplete representations where . The dictionary can be generated by wavelets, for example, or adapted to a specific task by solving an optimization problem on measurement data. The latter is also called dictionary learning in this context.

Sparseness acts as a regularizer. If was not constrained to sparseness, then trivial choices of would suffice for perfect reproduction capabilities. But when is sparse, then can be represented by additive superposition of only a small number of bases, thus preventing trivial solutions.

A fundamental problem when working with sparse representations is how to decide on a function that formally assesses the sparseness of a vector. The


simply counts the nonzero entries of its argument. It is a poor choice since it is non-continuous, prone to random noise and fails to fulfill desirable properties of meaningful sparseness measures (Hurley and Rickard, 2009).

Figure 1: Visualization of Hoyer’s sparseness measure . The abscissa and the ordinate specify the entries of a two-dimensional vector, the obtained sparseness degree is color coded. The dashed lines are contour levels at intervals of .

Throughout this paper, we will use the smooth, normalized sparseness measure proposed by Hoyer (2004):

Here, and denote the Manhattan norm and the Euclidean norm, respectively. The normalization has been designed such that attains values between zero and one. When satisfies , then all entries of but one vanish. Conversely, when , then all the entries of are equal. The function interpolates smoothly between these extremes, see Fig. 1. Moreover, it is scale-invariant so that the same sparseness degree is obtained when a vector is multiplied with a nonzero number. Hence if a quantity is given in other units, for example in millivolts instead of volts, no adjustments whatsoever have to be made.

The sparseness degree with respect to does not change much if a small amount is added to all entries of a vector, whereas the pseudo-norm would indicate that the new vector is completely non-sparse. These properties render Hoyer’s sparseness measure intuitive, especially for non-experts. It has been employed successfully for dictionary learning (Hoyer, 2004; Potluru et al., 2013), and its smoothness results in improved generalization capabilities in classification tasks compared to when the pseudo-norm is used (Thom and Palm, 2013).

1.2 Explicit Sparseness Constraints and Projections

A common approach to dictionary learning is the minimization of the reproduction error between the original samples from a learning set and their approximations provided by a linear generative model under sparseness constraints (Olshausen and Field, 1996, 1997; Kreutz-Delgado et al., 2003; Mairal et al., 2009a). It is beneficial for practitioners and end-users to enforce explicit sparseness constraints by demanding that all the code words in a generative model possess a target sparseness degree of . This leads to optimization problems of the form

Here, the objective function is the reproduction error implemented as Euclidean distance. Implicit sparseness constraints, on the other hand, augment the reproduction error with an additive penalty term, yielding optimization problems such as

Here, is a trade-off constant and the Manhattan norm is used to penalize non-sparse code words as convex relaxation of the pseudo-norm (Donoho, 2006).

Trading off the reproduction error against an additive sparseness penalty is non-trivial since the actual resulting code word sparseness cannot easily be predicted. Explicit constraints guarantee that the adjusted sparseness degree is met, making tuning of intransparent scale factors such as in the example above obsolete. This way, one can concentrate on the actual application of the theory rather than having to develop an intuition of the meaning of each and every parameter.

The mathematical tool to achieve explicit sparseness is a sparseness-enforcing projection operator. This is a vector-valued function which maps any given point in Euclidean space to its nearest point that achieves a pre-set target sparseness degree. One use case of this theory is projected gradient methods (Bertsekas, 1999), where a given objective function should be optimized subject to hard side conditions. Replacing the parameters with their best approximations lying in a certain set after each update step ensures that the constraints are satisfied during optimization progress.

1.3 Contributions of this Paper and Related Work

This paper studies dictionary learning under explicit sparseness constraints with respect to Hoyer’s sparseness measure . A major part of this work is devoted to the efficient algorithmic computation of the sparseness-enforcing projection operator, which is an integral part in efficient dictionary learning. Several algorithms were proposed in the past to solve the projection problem (Hoyer, 2004; Theis et al., 2005; Potluru et al., 2013; Thom and Palm, 2013). Only Thom and Palm (2013) provided a complete and mathematically satisfactory proof of correctness for their algorithm. Moreover, all known algorithms have at least quasi-linear time complexity in the dimensionality of the vector that should be projected.

In this paper, we first derive a characterization of the sparseness projection and demonstrate that its computation is equivalent to finding the root of a real-valued auxiliary function, which constitutes a much simpler problem. This result is used in the proposition of an algorithm for the projection operator that is asymptotically optimal in the sense of complexity theory, that is the time complexity is linear and the space complexity is constant in the problem dimensionality. We show through experiments that when run on a real computing machine the newly proposed algorithm is far superior in its computational demands to previously proposed techniques, even for small problem dimensionalities.

Existing approaches to dictionary learning that feature explicit sparseness constraints can be categorized into ones that use Hoyer’s and ones that employ the pseudo-norm. Hoyer (2004) and Potluru et al. (2013) considered matrix factorization frameworks subject to constraints, with space requirements linear in the number of learning samples which prevents processing large data sets. Thom and Palm (2013) designed sparse code words as the result of the sparseness-enforcing projection operator applied to the product of the dictionary with the samples. This requires the computation of the projection’s gradient during learning, which is feasible yet difficult to implement and has non-negligible adverse effects on the execution time.

The following approaches consider explicit pseudo-norm constraints: Aharon et al. (2006), Skretting and Engan (2010) and Coates and Ng (2011) infer sparse code words in each iteration compatible to the data and dictionary by employing basis pursuit or matching pursuit algorithms, which has a negative impact on the processing time. Zelnik-Manor et al. (2012) consider block-sparse representations, here the signals are assumed to reside in the union of few subspaces. Duarte-Carvajalino and Sapiro (2009) propose to simultaneously learn the dictionary and the sensing matrix from example image data, which results in improved reconstruction results in compressed sensing scenarios.

In addition to the contributions on sparseness projection computation, this paper proposes the Easy Dictionary Learning (EZDL) algorithm. Our technique aims at dictionary learning under explicit sparseness constraints in terms of Hoyer’s sparseness measure using a simple, fast-to-compute and biologically plausible Hebbian-like learning rule. For each presented learning sample, the sparseness-enforcing projection operator has to be carried out. The ability to perform projections efficiently makes the proposed learning algorithm particularly efficient: less training time is required in comparison to the optimized Online Dictionary Learning method of Mairal et al. (2009a).

Extensions of Easy Dictionary Learning facilitate alternative representations such as topographic atom organization or atom sparseness, which includes for example separable filters. We furthermore demonstrate the competitiveness of the dictionaries learned with our algorithm with those computed with alternative sophisticated dictionary learning algorithms in terms of reproduction and denoising quality of natural images. Since other tasks, such as deblurring or super-resolution, build upon the same optimization problem in the application phase as reproduction and denoising, it can be expected that EZDL dictionaries will exhibit no performance degradations in those tasks either.

The remainder of this paper is structured as follows. Section 2 derives a linear time and constant space algorithm for the computation of sparseness-enforcing projections. In Sect. 3, the Easy Dictionary Learning algorithm for explicitly sparseness-constrained dictionary learning is proposed. Section 4 reports experimental results on the performance of the newly proposed sparseness projection algorithm and the Easy Dictionary Learning algorithm. Section 5 concludes the paper with a discussion, and the appendix contains technical details of the mathematical statements from Sect. 2.

2 Efficient Sparseness-Enforcing Projections

This section proposes a linear time and constant space algorithm for computation of projections onto level sets of Hoyer’s . Formally, if denotes a target sparseness degree and is an arbitrary point, the point from the level set that minimizes the Euclidean distance to is sought. The function that computes is also called sparseness-enforcing projection operator since the situation where is of particular interest.

Due to symmetries of , the above described optimization problem can be reduced to finding the projection of a vector with non-negative coordinates onto the set

where and are target norms that should be chosen such that (Hoyer, 2004; Thom and Palm, 2013). With this choice, constantly attains the value of on the entire set , hence is a subset of .

Thom and Palm (2013) proved that is the intersection of a scaled canonical simplex and a hypercircle. They further demonstrated that projections onto can be computed with a finite number of alternating projections onto these two geometric structures. The proof of correctness of this method could not rely on the classical alternating projection method (see for example Deutsch, 2001) due to the lacking convexity of , rendering the proof arguments quite complicated.

The remainder of this section proceeds as follows. First, a succinct characterization of the sparseness projection is given. It is then shown that computing sparseness projections is equivalent to finding the zero of a monotone real-valued function. We prove that this can be achieved with optimal asymptotic complexity by proposing a new algorithm that solves the projection problem.

2.1 Representation of the Projection

The main theoretical result of this paper is a representation theorem that characterizes the projection onto . This was gained through an analysis of the intermediate points that emerge from the alternating projections onto a simplex and a hypercircle. A closed-form expression can then be provided by showing that the intermediate points in the projection algorithm satisfy a loop-invariant, a certain property fulfilled after each step of alternating projections.

Since a mathematical rigorous treatment of this result is very technical, it is deferred to the appendix. We assume here that the vector to be projected is given so that its projection is unique. Since it is guaranteed that for all points except for a null set there is exactly one projection (Theis et al., 2005, Theorem 2.6), we can exclude points with non-unique projections in our considerations, which is no restriction in practice. We may now state a characterization of the projection outcome:

Representation Theorem

Let and let denote the unique projection of onto . Then there is exactly one real number with

Here, is the -dimensional vector where all entries are unity. If the indices of the positive coordinates of are known, then can be computed directly from with an explicit expression.

In words, the projection is the point rescaled to obtain an norm of . The vector is computed from the input vector by subtracting the scalar from all the entries, and afterwards setting negative values to zero. It is remarkable that the projection admits such a simple representation although the target set for the projection is non-convex and geometrically quite complicated.

The function that maps a scalar to for a constant offset is called the soft-shrinkage function. If , it is also called the rectifier function. Because the central element of the projection representation is a soft-shrinkage operation applied entry-wise to the input vector, carrying out projections onto level sets of Hoyer’s sparseness measure can be interpreted as denoising operation (Donoho, 1995; Hyvärinen, 1999; Elad, 2006).

2.2 Auxiliary Function for the Sparseness Projection

The projection problem can hence be reduced to determining the soft-shrinkage offset , which is a one-dimensional problem on the real line. Further, it is reasonable that must be smaller than the maximum entry of since otherwise would be the null vector, which is absurd. In the usual case where we can further conclude that must be non-negative. Otherwise, the result of the projection using the representation theorem would be less sparse than the input, which is impossible. Therefore, the projection problem can be further reduced to finding a number from the bounded interval with .

Next, a method for efficiently deciding whether the correct offset has been found is required. Similar to the projection onto a canonical simplex (Liu and Ye, 2009), we can design a real-valued function that vanishes exactly at the wanted offset. The properties of Hoyer’s allow us to formulate this function in an intuitive way. We call

the auxiliary function for the projection onto .

The rationale for the concrete definition of is as follows. Due to the representation theorem we know that the projection onto is merely a scaled version of the point . Moreover , and due to the scale-invariance of Hoyer’s follows . The essence of is the ratio of the norm to the norm of its argument. Hence here the scale constants that make normalized to the interval are omitted and the target norms are used instead. We therefore have that , and thus . As is unique, we can conclude that no other offset of the soft-shrinkage operation leads to a correct solution. In fact, if and when we write , then we have that and therefore .

Figure 2: Plot of the auxiliary function and its derivative for a random vector . The derivative was scaled using a positive number for improved visibility. The steps in are exactly the places where coincides with an entry of . It is enough to find an with and for the neighboring entries and in , because then the exact solution can be computed with a closed-form expression.

An exemplary plot of is depicted in Fig. 2. Clearly, the auxiliary function is continuous, differentiable except for isolated points and strictly decreasing except for its final part where it is constant. Since the constant part is always negative and starts at the offset equal to the second-largest entry of , the feasible interval can be reduced even more. The step discontinuities in coincide exactly with the entries of the input vector . These appealing analytical properties greatly simplify computation of the zero of , since standard root-finding algorithms such as bisection or Newton’s method (see for example Traub, 1964) can be employed to numerically find .

2.3 Linear Time and Constant Space Projection Algorithm

We can improve on merely a numerical solution by exploiting the special structure of to yield an analytical solution to the projection problem. This is due to the closed-form expression for which requires the indices of the coordinates in which the result of the projection is positive to be known. The other way around, when is known, these indices are exactly the ones of the coordinates in which is greater than . When an offset sufficiently close to can be determined numerically, the appropriate index set can be determined and thus the exact value of .

The decision if a candidate offset is close enough to the true can be made quite efficiently, since is coupled via index sets to individual entries of the input vector . Hence, it suffices to find the right-most left neighbor of in the entries of , and analogously the left-most right neighbor (see Fig. 2 for an example). Whenever and , the zero of must be located between and for continuity reasons. But then all values in greater than are exactly the values greater than or equal to , which can be determined by simply scanning through the entries of .

Figure 3: Flowchart of the proposed algorithm for computing sparseness-enforcing projections. The algorithm starts with the box at the upper left and terminates with the box at the lower left.

Based upon these considerations, a flowchart of our proposed method for computing sparseness-enforcing projections is depicted in Fig. 3. The algorithm performs bisection and continuously checks for sign changes in the auxiliary function . As soon as this is fulfilled, is computed and the result of the projection is determined in-place. A complete formal presentation and discussion of our proposed algorithm is given in the appendix.

Each intermediate step of the algorithm (that is, determination of the second-largest entry of , evaluation of the auxiliary function, projection result computation by application of soft-shrinkage and scaling) can be carried out in a time complexity linear in the problem dimensionality and a space complexity completely independent of .

Therefore, to show the overall algorithm possesses the same asymptotic complexity, it has to be proved that the number of bisection iterations required for finding is independent of . The length of the initial interval is upper-bounded by as discussed earlier. Bisection is moreover guaranteed to stop at the latest once the interval length is smaller than the minimum pairwise distance between distinct entries of ,

where . This is due to the fact that it is enough to find a sufficiently small range to deduce an analytical solution. The required number of bisection iterations is then less than , see Liu and Ye (2009). This number is bounded from above regardless of the dimensionality of the input vector since is upper-bounded and is lower-bounded due to finite machine precision (Goldberg, 1991).

Hence our sparseness-enforcing projection algorithm is asymptotically optimal in the sense of complexity theory. There may still be hidden constants in the asymptotic notation that render the proposed algorithm less efficient than previously known algorithms for a small input dimensionality. Experiments in Sect. 4 demonstrate that this is not the case.

3 Explicitly Sparseness-Constrained Dictionary Learning

This section proposes the Easy Dictionary Learning (EZDL) algorithm for dictionary learning under explicit sparseness constraints. First, we introduce an ordinary formulation of the learning algorithm. Sparse code word inference with the sparseness projection algorithm proposed in Sect. 2 renders EZDL efficient and particularly simple to implement. We further discuss extensions that allow concentrating on different aspects of the data set under investigation, such as topographic organization of the atoms and atom sparseness. Only little implementation effort is required for these extensions and their computational demands are low. A description of the comprehensive EZDL learning algorithm is accompanied with several strategies that improve the optimization performance.

3.1 EZDL—Easy Dictionary Learning

Conventional dictionary learning algorithms operate in an alternating fashion, where code words are inferred with a costly optimization procedure before updating the dictionary. EZDL learns a dictionary by first yielding sparse code words through simple inference models and then tuning the dictionary with a simple update step. An inference model is a function which accepts filter responses in the form of the product of the dictionary with a learning sample , , and produces a representation with certain desirable properties. Here, the combination with the sparseness-enforcing projection operator is particularly interesting since it provides a natural method to fulfill explicit sparseness constraints.

We call the choice the ordinary inference model, where denotes the Euclidean projection onto the set of all vectors that achieve a sparseness degree of with respect to Hoyer’s sparseness measure . This computation can also be interpreted as the trivial first iteration of a projected Landweber procedure for sparse code word inference (Bredies and Lorenz, 2008).

The dictionary is adapted to new data by minimizing the deviation between a learning sample and its approximation through the linear generative model. The goodness of the approximation is here assessed with a differentiable similarity measure , so that the EZDL optimization problem becomes:

Note that is not a variable of the optimization problem since it is defined as output of an inference model. For the same reason, does not need to be constrained further as it inherently satisfies explicit sparseness constraints.

Although a wide range of similarity measures is possible, we decided to use the (Pearson product-moment) correlation coefficient since it is invariant to affine-linear transformations

(Rodgers and Nicewander, 1988). In the context of visual data set analysis, this corresponds to invariance with respect to DC components and gain factors. Moreover, differentiability of this similarity measure facilitates gradient-based learning (Thom and Palm, 2013).

If is the transposed derivative of with respect to its first argument and denotes its value in , the gradient of for the dictionary is given by

Therefore, when is tuned with gradient descent a simple and biologically plausible Hebbian-like learning rule (Hyvärinen et al., 2009) results:

This update step can be implemented efficiently with simple rank-1 updates (Blackford et al., 2002).

We here assumed to be constant and neglected that it actually depends on . Without this assumption, the gradient would have to be extended with an additive term which comprises the gradient of the inference model (Thom and Palm, 2013, Proposition 39). This, in turn, requires the computation of the sparseness-enforcing projection operator’s gradient. This gradient can be computed with simple vector operations as we show in the appendix. However, this constitutes a significant computational burden compared to the simple and fast-to-compute EZDL update step. Section 4 demonstrates through experiments that this simplification is still perfectly capable of learning dictionaries.

3.2 Topographic Atom Organization and Atom Sparseness

Topographic organization of the dictionary’s atoms similar to Self-organizing Maps

(Kohonen, 1990)

or Topographic Independent Component Analysis

(Hyvärinen et al., 2001) can be achieved in a straightforward way with EZDL using alternative inference models proposed in this section. For this, the dictionary atoms are interpreted to be arranged on a two-dimensional grid. A spatial pooling operator subject to circular boundary conditions can then be used to incorporate interactions between atoms located adjacent on the grid. This can, for example, be realized by averaging each entry of the filter responses in a neighborhood on the grid. This convolution operation can be expressed as linear operator applied to the vector , represented by a sparsely populated matrix . With containing all atom interactions, we call the topographic inference model.

An alternative realization of topographic organization is enforcing structure sparseness

on the filter responses reshaped as matrix to account for the grid layout. Structure sparseness can be assessed by application of a sparseness measure to the vector with the singular values of a matrix. When the

pseudo-norm is used, then this is the rank of a matrix. The matrix rank can also be measured more robustly through the ratio of the Schatten -norm to the Schatten -norm, which is essentially Hoyer’s sparseness measure applied to the singular values (Lopes, 2013).

Clearly, any matrix where can be reshaped to a vector with the vectorization operator that stacks all columns of on top of another (Neudecker, 1969). This linear operation can be inverted provided the shape of the original matrix is known, . In the following, let denote the projection onto the set of all matrices that possess a target rank

. A classical result states that any matrix can be projected onto the set of low-rank matrices by computation of its Singular Value Decomposition, applying an

pseudo-norm projection to the vector of singular values thus retaining only the largest ones, and finally computing the reconstruction using the modified singular values (Eckart and Young, 1936).

Based on these considerations, let and with be natural numbers describing the topographic grid layout dimensionalities. The rank- topographic inference model is then defined by the composition

In words, this inference model operates as follows. It first approximates the filter responses with a sparsely populated vector that attains a sparseness of with respect to Hoyer’s . These sparsified filter responses are then laid out on a grid and replaced with those best approximating filter responses that meet a lower rank of . Finally, the grid layout is reshaped to a vector since this should be the output type of each inference model.

If , then there are two vectors and such that , that is the filter responses can be expressed as a dyadic product when reshaped to a grid. We will demonstrate with experiments that this results in an interesting topographic atom organization similar to Independent Subspace Analysis (Hyvärinen and Hoyer, 2000).

Analogous to Non-negative Matrix Factorization with Sparseness Constraints (Hoyer, 2004)

, the atoms can be enforced to be sparse as well. To achieve this, it is sufficient to apply the sparseness projection to each column of the dictionary after each learning epoch:

where is the -th canonical basis vector that selects the -th column from , and is the target degree of atom sparseness.

In the situation where the learning samples resemble image patches, the atoms can also be restricted to fulfill structure sparseness using a low-rank approximation. Suppose the image patches possess pixels, then the following projection can be carried out after each learning epoch:

Here, denotes the target rank of each atom when reshaped to pixels. When the dictionary is used to process images through convolution, small values of are beneficial since computations can then be speeded up considerably. For example, if each atom can be expressed as outer product of vectors from and , respectively. The convolutional kernel is separable in this case, and two one-dimensional convolutions lead to the same result as one two-dimensional convolution, but require only a fraction of operations (Hoggar, 2006).

3.3 Learning Algorithm

In the previous sections, a simple Hebbian-like learning rule was derived which depends on abstract inference models. The core of the inference models is the sparseness-enforcing projection operator discussed in Sect. 2, which guarantees that the code words always attain a sparseness degree easily controllable by the user. On presentation of a learning sample , an update step of the dictionary hence consists of

  1. determining the filter responses (),

  2. feeding the filter responses through the inference model involving a sparseness projection (),

  3. computation of the approximation to the original learning sample () using the linear generative model,

  4. assessing the similarity between the input sample and its approximation () and the similarity measure’s gradient (), and eventually

  5. adapting the current dictionary using a rank-1 update ().

All these intermediate steps can be carried out efficiently. The sparseness projection can be computed using few resources, and neither its gradient nor the gradient of the entire inference model is required. After each learning epoch it is possible to impose additional constraints on the atoms with atom-wise projections. In doing so, several aspects of the data set structure can be made more explicit since the atoms are then forced to become more simple feature detectors.

An example of an individual learning epoch is given in Algorithm 1, where the topographic inference model and low-rank structure sparseness are used. In addition, the explicit expressions for the correlation coefficient and its gradient are given (Thom and Palm, 2013, Proposition 40). The outcome of this parameterization applied to patches from natural images is depicted in Fig. (b)b and will be discussed in Sect. 4.2.3.

Input: The algorithm accepts the following parameters:
  • An existing dictionary with atoms.

  • Random access to a learning set with samples from through a function "".

  • Step size for update steps .

  • Number of samples to be presented .

  • For the topographic inference model:

    • Target degree of dictionary sparseness .

    • Topography matrix describing the interaction between adjacent atoms.

  • To obtain low-rank structure sparseness:

    • Number of pixels so that .

    • Target rank .

Output: Updated dictionary .
// Present samples and adapt dictionary.
1 repeat times
      // Pick a random sample.
2       ;
       // (a) Compute filter responses.
3       ;
       // (b) Evaluate topographic inference model.
4       ;
       // (c) Compute approximation to .
5       ;
       // (d) Compute correlation coefficient and its gradient . Here, is the all-ones vector.
6       ;
7       ;
8       ;
9       ;
       // (e) Adapt dictionary with rank-1 update.
10       ;
12 end
// Atom-wise projection for low-rank filters.
13 for to do
      // Extract -th atom and normalize.
14       ;
       // Ensure atom has rank when reshaped to pixels.
15       ;
       // Store modified atom back in dictionary.
16       ;
18 end for
Algorithm 1 One epoch of the Easy Dictionary Learning algorithm, illustrated for the case of a topographic inference model and low-rank structure sparseness.

The efficiency of the learning algorithm can be improved with simple modifications described in the following. To reduce the learning time compared to when a randomly initialized dictionary was used, the dictionary should be initialized by samples randomly chosen from the learning set (Mairal et al., 2009a; Skretting and Engan, 2010; Coates and Ng, 2011; Thom and Palm, 2013)

. When large learning sets with millions of samples are available, stochastic gradient descent has been proven to result in significantly faster optimization progress compared to when the degrees of freedom are updated only once for each learning epoch

(Wilson and Martinez, 2003; Bottou and LeCun, 2004). The step size schedule , where denotes the initial step size and is the learning epoch counter, is beneficial when the true gradient is not available but rather an erroneous estimate (Bertsekas, 1999). After each learning epoch, all the atoms of the dictionary are normalized to unit scale. Since the learning rule is Hebbian-like, this prevents the atoms from growing arbitrarily large or becoming arbitrarily small (Hyvärinen et al., 2009). This normalization step is also common in a multitude of alternative dictionary learning algorithms (Kreutz-Delgado et al., 2003; Hoyer, 2004; Mairal et al., 2009a; Skretting and Engan, 2010).

Since the dictionary is modified with a rank-1 update where one of the factors is the result of an inference model and hence sparse, only the atoms that induce significant filter responses are updated. This may result in atoms that are never updated when the target sparseness degree

is large. This behavior can be alleviated by adding random noise to the inference model’s output prior to updating the dictionary, which forces all the atoms to be updated. We used random numbers sampled from a zero-mean normal distribution, where the standard deviation was multiplicatively annealed after each learning epoch. As optimization proceeds, the atoms are well-distributed in sample space such that the lifetime sparseness is approximately equal to the population sparseness

(Willmore and Tolhurst, 2001), and randomness is not needed anymore.

4 Experimental Results

This section reports experimental results on the techniques proposed in this paper. We first evaluate alternative solvers for the sparseness projection’s auxiliary function and show that our algorithm for the sparseness-enforcing projection operator is significantly faster than previously known methods. We then turn to the application of the projection in the Easy Dictionary Learning algorithm. First, the morphology of dictionaries learned on natural image patches is analyzed and put in context with previous methods. We further show that the resulting dictionaries are well-suited for the reproduction of entire images. They achieve a reproduction quality equivalent to dictionaries trained with alternative, significantly slower algorithms. Eventually, we analyze the performance of the dictionaries when employed for the image denoising task and find that, analogous to the reproduction experiments, no performance degradation can be observed.

4.1 Performance of Sparseness-Enforcing Projections

Our proposed algorithm for sparseness projections was implemented as C++ program using the GNU Scientific Library (Galassi et al., 2009). We first analyzed whether solvers other than bisection for locating the zero of the auxiliary function would result in an improved performance. Since is differentiable except for isolated points and its derivatives can be computed quite efficiently, Newton’s method and Halley’s method are straightforward to apply. We further verified whether Newton’s method applied to a slightly transformed variant of the auxiliary function,

where the minuend and the subtrahend were squared, would behave more efficiently. The methods based on derivatives were additionally safeguarded with bisection to guarantee new positions are always located within well-defined intervals (Press et al., 2007). This impairs the theoretical property that only a constant number of steps is required to find a solution, but in practice a significantly smaller number of steps needs to be made compared to plain bisection.

To evaluate which solver would provide maximum efficiency in practical scenarios, one thousand random vectors for each problem dimensionality were sampled and the corresponding sparseness-enforcing projections for were computed using all four solvers. We have used the very same random vectors as input for all solvers, and counted the number of times the auxiliary function needed to be evaluated until the solution was found. The results of this experiment are depicted in Fig. 4.

Figure 4: Number of auxiliary function evaluations needed to find the final interval for the projection onto using four different solvers. The error bars indicate standard deviation distance from the mean value. Since Newton’s method applied to is consistently outperforming the other solvers, it is the method of choice for all practical applications.

The number of function evaluations required by plain bisection grew about linearly in . Because the expected minimum difference of two distinct entries from a random vector gets smaller when the dimensionality of the random vector is increased, the expected number of function evaluations bisection needs increases with problem dimensionality. In either case, the length of the interval that has to be found is always bounded from below by the machine precision such that the number of function evaluations with bisection is bounded from above regardless of .

The solvers based on the derivative of or , respectively, always required less function evaluations than bisection. They exhibited a growth clearly sublinear in . For , Newton’s method required  function evaluations in the mean, Halley’s method needed  iterations, and Newton’s method applied to found the solution in only  iterations. Therefore, in practice always the latter solver should be employed.

In another experiment, the time competing algorithms require for computing sparseness projections on a real computing machine was measured for a comparison. For this, the algorithms proposed by Hoyer (2004), Potluru et al. (2013), and Thom and Palm (2013) were implemented using the GNU Scientific Library (Galassi et al., 2009) by means of C++ programs. An Intel Core i7-4960X processor was used and all algorithms were run in a single-threaded environment. Random vectors were sampled for each problem dimensionality and initially projected to attain a sparseness of with respect to Hoyer’s . This initial projection better reflects the situation in practice where not completely random vectors have to be processed. Next, all four algorithms were used to compute projections with a target sparseness degree of and their run time was measured. The original algorithm of Hoyer (2004) was the slowest, so by taking the ratio of the run times of the other algorithms to the run time of that slowest algorithm the relative speed-up was obtained.

Figure 5: Speed-ups relative to the original algorithm of Hoyer (2004) obtained using alternative algorithms. The linear time algorithm proposed in this paper is far superior in terms of execution time. The time complexity of the competing methods is at least quasi-linear, which becomes noticeable especially for large problem dimensionalities.

Figure 5 visualizes the results of this experiment. For all tested problem dimensionalities, the here proposed linear time algorithm dominated all previously described methods. While the speed-up of the algorithms of Potluru et al. (2013) and Thom and Palm (2013) relative to the original algorithm of Hoyer (2004) were already significant, especially for small to medium dimensionalities, they got relatively slow for very large vectors. This is not surprising, because both methods start with sorting the input vector and have to store that permutation to be undone in the end.

The algorithm proposed in this paper is based on root-finding of a monotone function and requires no sorting. Only the left and right neighbors of a scalar in a vector have to be found. This can be achieved by scanning linearly through the input vector, which is particularly efficient when huge vectors have to be processed. For , the proposed algorithm was more than  times faster than the methods of Hoyer (2004), Potluru et al. (2013) and Thom and Palm (2013). Because of this appealing asymptotic behavior, there is now no further obstacle to applying smooth sparseness methods to large-scale problems.

4.2 Data Set Analysis with EZDL

We used the Easy Dictionary Learning algorithm as a data analysis tool to visualize the structure of patches from natural images under different aspects, which facilitates qualitative statements on the algorithm performance. For our experiments, we used the McGill calibrated color image database (Olmos and Kingdom, 2004), where the images had either pixels or pixels. The images were desaturated (Annex B.4 SMPTE RP 177-1993) and quantized to -bit precision. We extracted  patches with  pixels from each of the

 images of the "Foliage" collection. The patches were extracted at random positions. Patches with vanishing variance were omitted since they carry no information. In total, we obtained

 million samples for learning.

We learned dictionaries on the raw pixel values of this learning set and on whitened image patches. The learning samples were normalized to zero mean and unit variance as the only pre-processing for the raw pixel experiments. The whitened data was obtained using the canonical pre-processing from Section 5.4 of Hyvärinen et al. (2009) with  principal components. EZDL was applied using each proposed combination of inference model and atom constraints. We presented randomly selected learning samples in each of one thousand epochs, which corresponds to about ten sweeps through the entire learning set. The initial step size was set to .

A noisy version of the final dictionary could already be observed after the first epoch, demonstrating the effectiveness of EZDL for quick analysis. Since the optimization procedure is probabilistic in the initialization and selection of training samples, we repeated the training five times for each parameter set. We observed only minor variations between the five dictionaries for each parameter set.

In the following, we present details of dictionaries optimized on raw pixel values and analyze their filter characteristics. Then, we consider dictionaries learned with whitened data and dictionaries with separable atoms trained on raw pixel values and discuss the underlying reasons for their special morphology.

4.2.1 Raw Pixel Values

Using the ordinary inference model, we trained two-times overcomplete dictionaries with atoms on the normalized pixel values, where the dictionary sparseness degree was varied between and

. This resulted in the familiar appearance of dictionaries with Gabor-like filters, which resemble the optimal stimulus of visual neurons

(Olshausen and Field, 1996, 1997). While for small values of the filters were mostly small and sharply bounded, high sparseness resulted in holistic and blurry filters.

We used the methods developed by Jones and Palmer (1987) and Ringach (2002) for a more detailed analysis. In doing so, a two-dimensional Gabor function as defined in Equation (1) of Ringach (2002), that is a Gaussian envelope multiplied with a cosine carrier wave, was fit to each dictionary atom using the algorithm of Nelder and Mead (1965). We verified that these Gabor fits accurately described the atoms, which confirmed the Gabor-like nature of the filters.

The differences in the dictionaries due to varying sparseness degrees became apparent through analysis of the parameters of the fitted Gabor functions. Figure 6 shows the mean values of three influential Gabor parameters in dependence on . All parameters change continuously and monotonically with increasing sparseness. The spatial frequency, a factor in the cosine wave of the Gabor function, constantly decreases for increasing dictionary sparseness. The width of the Gaussian envelope, here broken down into the standard deviation in both principal orientations and , is monotonically increasing. The envelope width in direction increases earlier, but in the end the envelope width in direction is larger.

Figure 6: Mean values of three influential parameters of Gabor functions fitted to the atoms of the dictionaries learned with EZDL, in dependence on the target degree of dictionary sparseness . An increase in sparseness results in filters with a reduced spatial frequency, but with a significant increase in the width of the Gaussian envelope.

It can be concluded that more sparse code words result in filters with lower frequency but larger envelope. Since an increased sparseness reduces the model’s effective number of degrees of freedom, it prevents the constrained dictionaries from adapting precisely to the learning data. Similar to Principal Component Analysis, low-frequency atoms here minimize the reproduction error best when only very few effective atoms are allowed

(Hyvärinen et al., 2009; Olshausen and Field, 1996).

Histograms of the spatial phase of the filters, an additive term in the cosine wave of the Gabor function, are depicted in Fig. 7. For , there is a peak at

which corresponds to odd-symmetric filters

(Ringach, 2002). The distribution is clearly bimodal for , with peaks at and corresponding to even-symmetric and odd-symmetric filters, respectively. While the case of small matches the result of ordinary sparse coding, the higher dictionary sparseness results in filters with the same characteristics as the optimal stimulus of macaque visual neurons (Ringach, 2002).

Figure 7: Histograms of the spatial phase of fitted Gabor functions for (left) and (right). Low sparseness yields dictionaries where odd-symmetric filters dominate, whereas high dictionary sparseness results in a significant amount of both even-symmetric and odd-symmetric filters.

This analysis proves that EZDL’s minimalistic learning rule is capable of generating biologically plausible dictionaries, which constitute a particularly efficient image coding scheme. To obtain dictionaries with diverse characteristics, it is enough to adjust the target degree of dictionary sparseness on a normalized scale. A major component in the learning algorithm is the sparseness projection, enforcing local competition among the atoms (Rozell et al., 2008) due to its absolute order-preservation property (Thom and Palm, 2013, Lemma 12).

4.2.2 Whitened Image Patches

Whitening as a pre-processing step helps to reduce sampling artifacts and decorrelates the input data (Hyvärinen et al., 2009). It also changes the intuition of the similarity measure in EZDL’s objective function, linear features rather than single pixels are considered where each feature captures a multitude of pixels in the raw patches. This results in differences in the filter structure, particularly in the emergence of more low-frequency filters.

The dictionary depicted in Fig. (a)a was learned using the topographic inference model with average-pooling of neighborhoods. We set the dictionary sparseness degree to and the number of atoms to , arranged on a grid. The dictionary closely resembles those gained through Topographic Independent Component Analysis (Hyvärinen et al., 2001, 2009) and Invariant Predictive Sparse Decomposition (Kavukcuoglu et al., 2009). It should be noted that here the representation is two times overcomplete due to the whitening procedure. Overcomplete representations are not inherently possible with plain Independent Component Analysis (Bell and Sejnowski, 1997; Hyvärinen et al., 2009) which limits its expressiveness, a restriction that does not hold for EZDL.

The emergence of topographic organization can be explained by the special design of the topographic inference model. The pooling operator acts as spatial low-pass filter on the filter responses, so that each smoothed filter response carries information on the neighboring filter responses. Filter response locality is retained after the sparseness projection, hence adjacent atoms receive similar updates with the Hebbian-like learning rule. Hence, there are only small differences between filters within the same vicinity. The learning process is similar to that of Self-organizing Maps (Kohonen, 1990), where only the atom with the maximum filter response and the atoms in its direct surrounding are updated. However, EZDL simultaneously updates multiple clusters since the sparse code words laid out on the grid are multimodal.

Further, it is notable that we achieved a topographic organization merely through a linear operator by simple average-pooling. This stands in contrast to the discussion from Bauer and Memisevic (2013) where the necessity of a nonlinearity such as multiplicative interaction or pooling with respect to the Euclidean norm was assumed. Our result that a simple linear operator plugged into the ordinary inference model already produces smooth topographies proves that linear interactions between the atoms are sufficient despite their minimality.

(a) Topographic inference model.
(b) Rank-1 topographic inference model.
Figure 8: Two times overcomplete dictionaries trained with EZDL on whitened natural image patches.

Figure (b)b shows a dictionary obtained with the rank-1 topographic inference model, using a sparseness degree of and filters on a grid. Here, the sparse code words reshaped to a matrix are required to have unit rank which results in a specially organized filter layout. For example, rows three and four almost exclusively contain low-frequency filters, and the filters in the sixth column are all oriented vertically. The grouping of similar atoms into the same rows and columns is related to Independent Subspace Analysis, which yields groups of Gabor-like filters where all filters in a group have approximately the same frequency and orientation (Hyvärinen and Hoyer, 2000).

The rank-1 topographic inference model guarantees that code words can be expressed as the dyadic product of two vectors, where the factors themselves are sparsely populated because the code words are sparse. This causes the code words to possess a sparse rectangular structure when reshaped to account for the grid layout, that is non-vanishing activity is always concentrated in a few rows and columns. The Hebbian-like learning rule therefore induces similar updates to atoms located in common rows or columns, which explains the obtained group layout.

(a) Ordinary inference model.
(b) Topographic inference model.
Figure 9: Dictionaries trained with EZDL on raw pixels, the atoms were replaced by their best rank-1 approximation after each epoch.

4.2.3 Rank-1 Filters on Raw Pixel Values

Enforcing the filters themselves to have rank one by setting resulted in bases similar to Discrete Cosine Transform (Watson, 1994). This was also observed recently by Hawe et al. (2013)

who considered a tensor decomposition of the dictionary, and by

Rigamonti et al. (2013) who minimized the Schatten -norm of the atoms. Note that EZDL merely replaces the atoms after each learning epoch by their best rank-1 approximation. The computational complexity of this operation is negligible compared to an individual learning epoch using tens of thousands of samples and hence does not slow down the actual optimization.

If no ordering between the filters was demanded, a multitude of checkerboard-like filters and vertical and horizontal contrast bars was obtained, see Fig. (a)a for a dictionary with atoms and a sparseness degree of . Atoms with diagonal structure cannot be present at all in such constrained dictionaries because diagonality requires a filter rank greater than one. Although all filters were paraxial due to the rank-1 constraint, they still resembled contrast fields because of their grating-like appearance. This is due to the representation’s sparseness, which is known to induce this appearance.

A similar filter morphology was obtained with a topographic filter ordering using a grid, though the filters were blurrier, as shown in Fig. (b)b. Here, checkerboard-like filters are located in clusters, and filters with vertical and horizontal orientation are spatially adjacent. This is as expected, because with a sparse topography there are only a few local blobs of active entries in each code word, causing similar atoms to be grouped together and dissimilar atoms to be distant from each other. In the space of rank-1 filters, there can be either vertical or horizontal structures, or checkerboards combining both principal orientations.

The lack of fine details can be explained by the restriction of topographic organization, which reduces the effective number of degrees of freedom. Analogously to the situation in Sect. 4.2.1 where the dictionary sparseness was increased, the reproduction error can be reduced by a large amount using few filters with low frequencies. For the same reason, there are few such checkerboard-like filters: Minimization of the reproduction error is first achieved with principal orientations before checkerboards can be used to encode details in the image patches.

In conclusion, these results show that the variety of the dictionaries produced by EZDL can be vast simply by adapting easily interpretable parameters. The algorithm covers and reproduces well-known phenomena from the literature, and allows a precise visualization of a data set’s structure. A first impression of the final dictionaries can already be obtained after one learning epoch, which takes no more than one second on a modern computer.

4.3 Experiments on the Reproduction of Entire Images

(a) Reproduction quality in dependence on the number of Landweber iterations carried out for inference. The four curves show different combinations of dictionary sparseness and inference sparseness .
(b) Achieved reproduction quality in dependence on the inference sparseness for four different dictionary sparseness degrees . Large values of yield the best performance for large .
Figure 10: Results on the reproduction of entire images using Easy Dictionary Learning dictionaries. LABEL:sub@fig:landweber_iteration_plot_256 While one Landweber iteration suffices for learning, better reproduction performance is yielded if the Landweber procedure is run until convergence. LABEL:sub@fig:inference_sparseness_plot_256 The relation between the dictionary sparseness and the inference sparseness is influential. Hardly any information loss can be observed for , while the case shows dictionaries should be trained for very sparse code words when demanding such a sparseness for reproduction.

It was demonstrated in Sect. 4.2 that the Easy Dictionary Learning algorithm produces dictionaries that correspond to efficient image coding schemes. We now analyze the suitability of EZDL dictionaries trained on raw pixel values for the reproduction of entire images. This is a prerequisite for several image processing tasks such as image enhancement and compression, since here essentially the same optimization problem should be solved as during reproduction with a given dictionary.

Our analysis allows quantitative statements as the original images and their reproduction through sparse coding can be numerically compared on the signal level. Further, the EZDL dictionaries are compared with the results of the Online Dictionary Learning algorithm by Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm by Skretting and Engan (2010) in terms of reproduction quality and learning speed.

The evaluation methodology was as follows. First, dictionaries with different parameter sets were trained on million pixels natural image patches extracted from the "Foliage" collection of the McGill calibrated color image database (Olmos and Kingdom, 2004). As in Sect. 4.2, patches were picked from random positions from each of the desaturated and quantized images. All patches were normalized to zero mean and unit variance before training. The dictionaries were designed for a four times overcomplete representation, they had atoms for samples with  pixels. After training, dictionary quality was measured using the  images of the "Flowers" collection from the same image database. This evaluation set was disjoint from the learning set.

Each single image was subsequently divided into non-overlapping blocks with pixels. All blocks were normalized to zero mean and unit variance, and then the optimized dictionaries were used to infer a sparse code word for each block. The resulting code word was then fed through a linear generative model using the currently investigated dictionary, and the mean value and variance were restored to match that of the original block.

Sparse code word inference was achieved with a projected Landweber procedure (Bredies and Lorenz, 2008), essentially the same as projected gradient descent starting with the null vector. Using the sparseness-enforcing projection operator after each iteration, the code words were tuned to attain an inference sparseness

, which need not necessarily be equal to the sparseness degree used for dictionary learning. The correlation coefficient was used as the similarity measure for inference to benefit from invariance to shifting and scaling. Automatic step size control was carried out with the bold driver heuristic

(Bishop, 1995). We note that due to the representation theorem on the sparseness projection this process can also be understood as Iterative Soft-Thresholding (Bredies and Lorenz, 2008).

A reproduced image is yielded by applying this procedure to all distinct blocks of the original image using a single dictionary. The deviation between this output image and the original image was assessed with the SSIM index (Wang and Bovik, 2009), which yielded qualitatively similar results as the peak signal-to-noise ratio. The SSIM index is, however, normalized to values between zero and one, and it respects the local structure of images since it examines

pixels neighborhoods through convolution. Because it measures the visual similarity between images, it is more intuitive than measures based on the pixel-wise squared error

(Wang and Bovik, 2009). This evaluation method yielded one number from the interval for each image and dictionary. Each parameterization for dictionary learning was used to train five dictionaries to compensate probabilistic effects, and here the mean of the resulting  SSIMs is reported.

4.3.1 EZDL Results

Figure 10 visualizes the results obtained for dictionaries produced by Easy Dictionary Learning using dictionary sparseness degrees . One thousand learning epochs were carried out, presenting samples in each epoch, and using an initial step size of . During dictionary learning, only the first trivial iteration of a Landweber procedure is carried out for sparse code word inference by computing for the ordinary inference model.

(a) Results of the Online Dictionary Learning (ODL) algorithm by Mairal et al. (2009a), where trades off between sparseness and reproduction capabilities in a model with implicit sparseness constraints.
(b) Results of the Recursive Least Squares Dictionary Learning Algorithm (RLS-DLA) by Skretting and Engan (2010). Here, denotes the target pseudo-norm of the code words for inference.
Figure 11: Experimental results on image reproduction where alternative dictionary learning algorithms were used. The dictionaries behave similarly to Easy Dictionary Learning dictionaries if the dictionary sparseness parameter and the inference sparseness are varied.

We first analyzed the impact of carrying out more Landweber iterations during the image reproduction phase, and the effect of varying the inference sparseness degree . A huge performance increase can be obtained by using more than one iteration for inference (Fig. (a)a). Almost the optimal performance is achieved after ten iterations, and after one hundred iterations the method has converged. For , the performance of dictionaries with and is about equal yielding the maximum SSIM of one. This value indicates that there is no visual difference between the reproduced images and the original images.

When the inference sparseness is increased to , a difference in the choice of the dictionary sparseness becomes noticeable. For , performance already degrades, and the degradation is substantial if . It is more intuitive when and are put in relation. Almost no information is lost in the reproduction by enforcing a lower inference sparseness than dictionary sparseness (). Performance is worse if which is plausible because the dictionary was not adapted for a higher sparseness degree. In the natural case the performance mainly depends on their concrete value, where higher sparseness results in worse reproduction capabilities.

To further investigate this behavior, we set the number of Landweber iterations to and varied smoothly in the interval for the four different dictionary sparseness degrees. The results of this experiment are depicted in Fig. (b)b. For there is hardly any difference in the reproduction quality irrespective of the value of . The difference when training dictionaries with different sparseness degrees first becomes visible for large values of . Performance is here better using dictionaries where very sparse code words were demanded during learning. Hence, tasks that require high code word sparseness should use dictionaries trained with high values of the dictionary sparseness.

4.3.2 Comparison with Alternative Dictionary Learning Algorithms

For a comparison, we conducted experiments using the Online Dictionary Learning (ODL) algorithm of Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm (RLS-DLA) by Skretting and Engan (2010). For inference of sparse code words, ODL minimizes the reproduction error under implicit sparseness constraints. RLS-DLA uses an external vector selection algorithm for inference, hence explicit constraints such as a target pseudo-norm can be demanded easily. Both algorithms update the dictionary after each sample presentation.

ODL does not require any step sizes to be adjusted. The crucial parameter for dictionary sparseness here is a number , which controls the trade-off between reproduction capabilities and code word sparseness. We trained five dictionaries for each using atoms and presenting learning samples in each of one thousand epochs. Then, the same evaluation methodology as before with was used to assess the reproduction capabilities. The results are shown in Fig. (a)a. The choice of is not as influential compared to in EZDL. There are only small performance differences for , and for hardly any difference can be observed. Similar to the EZDL experiments, large values of result in better performance for large and worse performance for small .

RLS-DLA provides a forgetting factor parameter which behaves analogously to the step size in gradient descent. We used the default forgetting factor schedule which interpolates between and using a cubic function over one thousand learning epochs. For inference, we chose an optimized Orthogonal Matching Pursuit variant (Gharavi-Alkhansari and Huang, 1998) where a parameter controls the number of non-vanishing coordinates in the code words. We trained five dictionaries with atoms for each . Thirty thousand randomly drawn learning samples were presented in each learning epoch. The resulting reproduction performance is shown in Fig. (b)b. Again, for large values of the dictionaries with the most sparse code words during learning perform best, that is those trained with small values of .

To compare all three dictionary learning algorithms independent of the concrete dictionary sparseness, we took the mean SSIM value that belonged to the best performing dictionaries for each feasible value of . This can also be interpreted as using the convex hull of the results shown in Fig. (b)b and Fig. 11. This yielded one curve for each dictionary learning algorithm, depicted in Fig. 12. There is only a minor difference between the curves over the entire range of . It can hence be concluded that the algorithms learn dictionaries which are equally well-suited for the reproduction of entire images. Although EZDL uses a very simple learning rule, this is sufficient enough to achieve a performance competitive with that of two state-of-the-art dictionary learning algorithms.

4.3.3 Comparison of Learning Speed

We moreover compared the learning speed of the methods and investigated the influence of the initial step size on the EZDL dictionaries. In doing so, we carried out reproduction experiments on entire images using dictionaries provided by the learning algorithms after certain numbers of learning epochs. In each of one thousand overall epochs,  learning samples where input to all three algorithms. The dictionary sparseness parameters were set to for EZDL, to for Online Dictionary Learning, and to for the Recursive Least Squares Dictionary Learning Algorithm.

Figure 12: Comparison of dictionary performance when the dependency on the dictionary sparseness degrees , and was eliminated, for ODL, RLS-DLA and EZDL, respectively. All three algorithms produce dictionaries equally well-suited to the reproduction of entire images.
Figure 13: Reproduction quality in dependence on the number of samples presented to Online Dictionary Learning, the Recursive Least Squares Dictionary Learning Algorithm, and Easy Dictionary Learning, where the initial step size was varied for the latter. All approaches achieved the same performance after  learning epochs.

Timing measurements on an Intel Core i7-4960X processor have shown that one EZDL learning epoch takes approximately less time than a learning epoch of ODL. RLS-DLA was roughly  times slower than EZDL. Although the employed vector selection method was present as part of an optimized software library, only a fairly unoptimized implementation of the actual learning algorithm was available. Therefore, a comparison of this algorithm with regard to execution speed would be inequitable.

The inference sparseness was set to and one hundred Landweber iterations were carried out for sparse code word inference. The SSIM index was eventually used to assess the reproduction performance. Figure 13 visualizes the results of the learning speed analysis, averaged over all images from the evaluation set and five dictionaries trained for each parameterization to compensate probabilistic effects. Online Dictionary Learning is free of step sizes, dictionary performance increased instantly and constantly with each learning epoch. Only small performance gains could be observed after  learning epochs.

The performance of dictionaries trained with RLS-DLA was initially better than that of the ODL dictionaries, but improved more slowly. After a hundred epochs, however, it was possible to observe a significant performance gain. Although tweaking the forgetting factor schedule may have led to better early reproduction capabilities, RLS-DLA achieved a performance equivalent to that of ODL after  epochs.

If EZDL’s initial step size was set to unity, performance first degraded, but started to improve after the fifth epoch. After  epochs, the performance was identical to ODL’s, and all three methods obtained equal reproduction capabilities after  epochs. Reduction of the initial step size caused the dictionaries to perform better after few sample presentations. The quality of Online Dictionary Learning was achieved after  epochs for , and for the EZDL dictionaries were always better in the mean until the th epoch. There is hardly any quality difference after  epochs, a very small initial step size resulted however in a slightly worse performance.

The choice of the initial step size for EZDL is hence not very influential. Although caused small overfitting during the very first epochs, the dictionaries quickly recovered so that no significant difference in reproduction capabilities could be observed after  epochs. Since an EZDL epoch is faster than an Online Dictionary Learning epoch, our proposed method produces better results earlier on an absolute time scale.

4.4 Image Denoising Experiments

We have demonstrated in Sect. 4.3 that dictionaries learned with Easy Dictionary Learning are as good as those obtained from the Online Dictionary Learning algorithm of Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm by Skretting and Engan (2010) in terms of the reproduction quality of entire images. In a final experiment, we investigated the suitability of EZDL dictionaries for image denoising using the image enhancement procedure proposed by Mairal et al. (2009b).

This method carries out semi-local block matching and finds sparse code words by imposing a group sparseness penalty on the Euclidean reproduction error. In doing so, a pre-learned dictionary is used which explains the appearance of uncorrupted images and further helps to resolve ambiguities if block matching fails to provide large enough groups. A linear generative model is finally used to find estimators of denoised image patches from the sparse code words. This procedure is quite robust if the input data is noisy, since sparseness provides a strong prior which well regularizes this ill-posed inverse problem (Kreutz-Delgado et al., 2003; Foucart and Rauhut, 2013).

The denoising approach of Mairal et al. (2009b) also provides the possibility of dictionary adaptation while denoising concrete input data. We did not use this option as it would hinder resilient statements on a dictionary’s eligibility if it would be modified with another dictionary learning algorithm during denoising.

The methodology of the experiment was as follows. We used the four-times overcomplete dictionaries trained on the pixels image patches from the "Foliage" collection of the McGill calibrated color image database (Olmos and Kingdom, 2004) as models for uncorrupted images. The dictionary sparseness parameters were for EZDL, for ODL and for RLS-DLA. For evaluation, we used the  images of the "Animals" collection from the McGill database, and converted them to -bit grayscales as in the previous experiments. The images were synthetically corrupted with additive white Gaussian noise. Five noisy images were generated for each original image and each standard deviation from .

These images were then denoised using the candidate dictionaries, where the window size for block matching was set to  pixels. Then, the similarity between the reconstructions and the original images was measured with the peak signal-to-noise ratio. We also evaluated the SSIM index, which led to qualitatively similar results.

Figure 14: Denoising performance in terms of the peak signal-to-noise ratio (PSNR) using the denoising method proposed by Mairal et al. (2009b). There is only a small performance difference between dictionaries trained with ODL, RLS-DLA and EZDL.

The results are depicted in Fig. 14. Denoising performance degrades if the noise strength is increased. There is hardly any difference between the dictionaries trained with the three algorithms. The RLS-DLA and EZDL dictionaries perform slightly better for small synthetic noise levels, but this improvement is visually imperceptible. This result is not surprising, since Sect. 4.3 demonstrated that all three learning algorithms produce dictionaries equally well-suited for the reproduction of entire images.

The denoising procedure of Mairal et al. (2009b) aims at reproduction capabilities as well, with the modification of employing noisy samples as input. Image enhancement and compression applications such as those proposed by Yang et al. (2010, 2012), Dong et al. (2011), Skretting and Engan (2011) and Horev et al. (2012) which also use problem formulations based on the reproduction error can hence be expected to benefit from more efficiently learned dictionaries as well.

5 Conclusions

This paper proposed the EZDL algorithm which features explicit sparseness constraints with respect to Hoyer’s smooth sparseness measure . Pre-defined sparseness degrees are ensured to always be attained using a sparseness-enforcing projection operator. Building upon a succinct representation of the projection, we proved that the projection problem can be formulated as a root-finding problem. We presented a linear time and constant space algorithm for the projection which is superior to previous approaches in terms of theoretical computational complexity and execution time on real computing machines.

EZDL adapts dictionaries to measurement data through simple rank-1 updates. The sparseness projection serves as foundation for sparse code word inference. Due to the projection efficiency and since no complicated gradients are required, our proposed learning algorithm is significantly faster than even the optimized ODL algorithm. Topographic atom organization and atom sparseness can be realized with very simple extensions, allowing for versatile sparse representations of data sets. Its simplicity and efficiency does not hinder EZDL from producing dictionaries competitive with those generated by ODL and RLS-DLA in terms of the reproduction and denoising performance on natural images. Alternative image processing methods based on sparse representations rely on dictionaries subject to the same criteria, and can thus be expected to benefit from EZDL’s advantages as well.

The authors are grateful to Heiko Neumann, Florian Schüle, and Michael Gabb for helpful discussions. We would like to thank Julien Mairal and Karl Skretting for making implementations of their algorithms available. Parts of this work were performed on the computational resource bwUniCluster funded by the Ministry of Science, Research and Arts and the Universities of the State of Baden-Württemberg, Germany, within the framework program bwHPC. This work was supported by Daimler AG, Germany.

Appendix: Technical Details and Proofs for Section 2

This appendix studies the algorithmic computation of Euclidean projections onto level sets of Hoyer’s in greater detail, and in particular proves the correctness of the algorithm proposed in Sect. 2.

For a non-empty subset of the Euclidean space and a point , we call

the set of Euclidean projections of onto (Deutsch, 2001). Since we only consider situations in which is a singleton, we may also write .

Without loss of generality, we can compute for a vector within the non-negative orthant instead of for an arbitrary point to yield sparseness-enforcing projections, where and are as defined in Sect. 2. First, the actual scale is irrelevant as we can simply re-scale the result of the projection (Thom and Palm, 2013, Remark 5). Second, the constraint that the projection lies in the non-negative orthant can easily be handled by flipping the signs of certain coordinates (Thom and Palm, 2013, Lemma 11). Finally, all entries of can be assumed non-negative with Corollary 19 from Thom and Palm (2013).

We note that is non-convex because of the constraint. Moreover, for all target sparseness degrees which we show here by construction (see also Remark 18 in Thom and Palm (2013) for further details): Let and , then the point lies in , where denotes the -th canonical basis vector. Since all coordinates of are positive, always contains points with an pseudo-norm of . If we had used the pseudo-norm to measure sparseness, then would have the same sparseness degree as, for example, the vector with all entries equal to unity. If, however, is close to one, then there is only one large value in and all the other entries equaling are very small but positive. This simple example demonstrates that in situations where the presence of noise cannot be eliminated, Hoyer’s is a much more robust sparseness measure than the pseudo-norm.

Representation Theorem

Before proving a theorem on the characterization of the projection onto , we first fix some notation. As above, let denote the -th canonical basis vector and let furthermore be the vector where all entries are one. We note that if a point resides in the non-negative orthant, then . Subscripts to vectors denote the corresponding coordinate, except for and . For example, we have that . We abbreviate with when it is clear that is a real number. When is an index set with elements, say , then the unique matrix with for all is called the slicing operator. A useful relation between the pseudo-norm, the Manhattan norm and the Euclidean norm is for all points .

We are now in a position to formalize the representation theorem:

Theorem .1

Let and be unique. Then there is exactly one number such that

where is a scaling constant. Moreover, if , and , denotes the set of the coordinates in which does not vanish, then


It is possible to prove this claim either constructively or implicitly, where both variants differ in whether the set of all positive coordinates in the projection can be computed from or must be assumed to be known. We first present a constructive proof based on a geometric analysis conducted in Thom and Palm (2013), which contributes to deepening our insight into the involved computations. As an alternative, we also provide a rigorous proof using the method of Lagrange multipliers which greatly enhances the unproven analysis of Potluru et al. (2013, Section 3.1).

We first note that when there are and so that we have , then is determined already through because it holds that . We now show that the claimed representation is unique, and then present the two different proofs for the existence of the representation.

Uniqueness: It is , since would violate that and is impossible because . We first show that there are two distinct indices with . Assume this was not the case, then , say, for all . Let be the index of the smallest coordinate of which has its index in . Let and be numbers and define . Then like in the argument where we have shown that is non-empty. Because of and , it follows that . Hence is at least as good an approximation to as , violating the uniqueness of . Therefore, it is impossible that the set is a singleton.

Now let with and such that . Clearly and as . It is , thus holds. Moreover, , hence . Finally, we have that , which yields , and hence the representation is unique.

Existence (constructive): Let

be the hyperplane on which all points in the non-negative orthant have an

norm of and let be a scaled canonical simplex. Further, let be a circle on , and for an arbitrary index set let be a subset of where all coordinates not indexed by vanish. With Theorem 2 and Appendix D from Thom and Palm (2013) there exists a finite sequence of index sets with for such that is the result of the finite sequence

All intermediate projections yield unique results because was restricted to be unique. The index sets contain the indices of the entries that survive the projections onto , for . In other words, can be computed from by alternating projections, where the sets and are non-convex for all . The expressions for the individual projections are given in Lemma 13, Lemma 17, Proposition 24, and Lemma 30, respectively, in Thom and Palm (2013).

Let for completeness, then we can define the following constants for . Let be the number of relevant coordinates in each iteration, and let