1 Introduction
In a great variety of classical machine learning problems, sparse solutions are attractive because they provide more efficient representations compared to nonsparse solutions. There is an overwhelming evidence that mammalian brains respect the sparseness principle
(Laughlin and Sejnowski, 2003), which holds true especially for the mammalian visual cortex (Hubel and Wiesel, 1959; Olshausen and Field, 1996, 1997). It suggests itself that sparseness be a fundamental prior to a variety of signal processing tasks. In particular, this includes lowlevel image processing since natural images can be represented succinctly using structural primitives (Olshausen and Field, 1997; Mairal et al., 2009b). Interesting and biologically plausible sparse representations were discovered through computer simulations on natural images (Olshausen and Field, 1996, 1997). Related representations can be obtained by analysis of temporal image sequences (van Hateren and Ruderman, 1998; Olshausen, 2003), stereo image pairs and images with chromatic information (Hoyer and Hyvärinen, 2000), or by enforcing a topographic organization (Hyvärinen et al., 2001; Kavukcuoglu et al., 2009).Sparseness alleviates the effects of random noise in a natural way since it prevents arbitrary combinations of measured signals (Donoho, 1995; Hyvärinen, 1999; Elad, 2006). In fact, methods based on sparse representations were shown to achieve stateoftheart performance for image denoising (Mairal et al., 2009b). Further notable image processing applications that benefit from the efficiency gained through sparseness are as diverse as deblurring (Dong et al., 2011)
(Yang et al., 2010, 2012; Dong et al., 2011), compression (Skretting and Engan, 2011; Horev et al., 2012), and depth estimation
(Tošić et al., 2011).1.1 Dictionary Learning and Sparseness Measures
Each of these tasks needs a model capable of reproducing the signals to be processed. In a linear generative model for sparse coding, a sample with features should be expressed approximately as a linear combination of only a few atoms of a larger dictionary:
Here, is the dictionary which is fixed for all samples, and is a sparse code word that depends on the concrete sample. The columns of represent the atoms, which are also called bases or filters. This sparse coding framework is wellsuited for overcomplete representations where . The dictionary can be generated by wavelets, for example, or adapted to a specific task by solving an optimization problem on measurement data. The latter is also called dictionary learning in this context.
Sparseness acts as a regularizer. If was not constrained to sparseness, then trivial choices of would suffice for perfect reproduction capabilities. But when is sparse, then can be represented by additive superposition of only a small number of bases, thus preventing trivial solutions.
A fundamental problem when working with sparse representations is how to decide on a function that formally assesses the sparseness of a vector. The
pseudonormsimply counts the nonzero entries of its argument. It is a poor choice since it is noncontinuous, prone to random noise and fails to fulfill desirable properties of meaningful sparseness measures (Hurley and Rickard, 2009).
Throughout this paper, we will use the smooth, normalized sparseness measure proposed by Hoyer (2004):
Here, and denote the Manhattan norm and the Euclidean norm, respectively. The normalization has been designed such that attains values between zero and one. When satisfies , then all entries of but one vanish. Conversely, when , then all the entries of are equal. The function interpolates smoothly between these extremes, see Fig. 1. Moreover, it is scaleinvariant so that the same sparseness degree is obtained when a vector is multiplied with a nonzero number. Hence if a quantity is given in other units, for example in millivolts instead of volts, no adjustments whatsoever have to be made.
The sparseness degree with respect to does not change much if a small amount is added to all entries of a vector, whereas the pseudonorm would indicate that the new vector is completely nonsparse. These properties render Hoyer’s sparseness measure intuitive, especially for nonexperts. It has been employed successfully for dictionary learning (Hoyer, 2004; Potluru et al., 2013), and its smoothness results in improved generalization capabilities in classification tasks compared to when the pseudonorm is used (Thom and Palm, 2013).
1.2 Explicit Sparseness Constraints and Projections
A common approach to dictionary learning is the minimization of the reproduction error between the original samples from a learning set and their approximations provided by a linear generative model under sparseness constraints (Olshausen and Field, 1996, 1997; KreutzDelgado et al., 2003; Mairal et al., 2009a). It is beneficial for practitioners and endusers to enforce explicit sparseness constraints by demanding that all the code words in a generative model possess a target sparseness degree of . This leads to optimization problems of the form
Here, the objective function is the reproduction error implemented as Euclidean distance. Implicit sparseness constraints, on the other hand, augment the reproduction error with an additive penalty term, yielding optimization problems such as
Here, is a tradeoff constant and the Manhattan norm is used to penalize nonsparse code words as convex relaxation of the pseudonorm (Donoho, 2006).
Trading off the reproduction error against an additive sparseness penalty is nontrivial since the actual resulting code word sparseness cannot easily be predicted. Explicit constraints guarantee that the adjusted sparseness degree is met, making tuning of intransparent scale factors such as in the example above obsolete. This way, one can concentrate on the actual application of the theory rather than having to develop an intuition of the meaning of each and every parameter.
The mathematical tool to achieve explicit sparseness is a sparsenessenforcing projection operator. This is a vectorvalued function which maps any given point in Euclidean space to its nearest point that achieves a preset target sparseness degree. One use case of this theory is projected gradient methods (Bertsekas, 1999), where a given objective function should be optimized subject to hard side conditions. Replacing the parameters with their best approximations lying in a certain set after each update step ensures that the constraints are satisfied during optimization progress.
1.3 Contributions of this Paper and Related Work
This paper studies dictionary learning under explicit sparseness constraints with respect to Hoyer’s sparseness measure . A major part of this work is devoted to the efficient algorithmic computation of the sparsenessenforcing projection operator, which is an integral part in efficient dictionary learning. Several algorithms were proposed in the past to solve the projection problem (Hoyer, 2004; Theis et al., 2005; Potluru et al., 2013; Thom and Palm, 2013). Only Thom and Palm (2013) provided a complete and mathematically satisfactory proof of correctness for their algorithm. Moreover, all known algorithms have at least quasilinear time complexity in the dimensionality of the vector that should be projected.
In this paper, we first derive a characterization of the sparseness projection and demonstrate that its computation is equivalent to finding the root of a realvalued auxiliary function, which constitutes a much simpler problem. This result is used in the proposition of an algorithm for the projection operator that is asymptotically optimal in the sense of complexity theory, that is the time complexity is linear and the space complexity is constant in the problem dimensionality. We show through experiments that when run on a real computing machine the newly proposed algorithm is far superior in its computational demands to previously proposed techniques, even for small problem dimensionalities.
Existing approaches to dictionary learning that feature explicit sparseness constraints can be categorized into ones that use Hoyer’s and ones that employ the pseudonorm. Hoyer (2004) and Potluru et al. (2013) considered matrix factorization frameworks subject to constraints, with space requirements linear in the number of learning samples which prevents processing large data sets. Thom and Palm (2013) designed sparse code words as the result of the sparsenessenforcing projection operator applied to the product of the dictionary with the samples. This requires the computation of the projection’s gradient during learning, which is feasible yet difficult to implement and has nonnegligible adverse effects on the execution time.
The following approaches consider explicit pseudonorm constraints: Aharon et al. (2006), Skretting and Engan (2010) and Coates and Ng (2011) infer sparse code words in each iteration compatible to the data and dictionary by employing basis pursuit or matching pursuit algorithms, which has a negative impact on the processing time. ZelnikManor et al. (2012) consider blocksparse representations, here the signals are assumed to reside in the union of few subspaces. DuarteCarvajalino and Sapiro (2009) propose to simultaneously learn the dictionary and the sensing matrix from example image data, which results in improved reconstruction results in compressed sensing scenarios.
In addition to the contributions on sparseness projection computation, this paper proposes the Easy Dictionary Learning (EZDL) algorithm. Our technique aims at dictionary learning under explicit sparseness constraints in terms of Hoyer’s sparseness measure using a simple, fasttocompute and biologically plausible Hebbianlike learning rule. For each presented learning sample, the sparsenessenforcing projection operator has to be carried out. The ability to perform projections efficiently makes the proposed learning algorithm particularly efficient: less training time is required in comparison to the optimized Online Dictionary Learning method of Mairal et al. (2009a).
Extensions of Easy Dictionary Learning facilitate alternative representations such as topographic atom organization or atom sparseness, which includes for example separable filters. We furthermore demonstrate the competitiveness of the dictionaries learned with our algorithm with those computed with alternative sophisticated dictionary learning algorithms in terms of reproduction and denoising quality of natural images. Since other tasks, such as deblurring or superresolution, build upon the same optimization problem in the application phase as reproduction and denoising, it can be expected that EZDL dictionaries will exhibit no performance degradations in those tasks either.
The remainder of this paper is structured as follows. Section 2 derives a linear time and constant space algorithm for the computation of sparsenessenforcing projections. In Sect. 3, the Easy Dictionary Learning algorithm for explicitly sparsenessconstrained dictionary learning is proposed. Section 4 reports experimental results on the performance of the newly proposed sparseness projection algorithm and the Easy Dictionary Learning algorithm. Section 5 concludes the paper with a discussion, and the appendix contains technical details of the mathematical statements from Sect. 2.
2 Efficient SparsenessEnforcing Projections
This section proposes a linear time and constant space algorithm for computation of projections onto level sets of Hoyer’s . Formally, if denotes a target sparseness degree and is an arbitrary point, the point from the level set that minimizes the Euclidean distance to is sought. The function that computes is also called sparsenessenforcing projection operator since the situation where is of particular interest.
Due to symmetries of , the above described optimization problem can be reduced to finding the projection of a vector with nonnegative coordinates onto the set
where and are target norms that should be chosen such that (Hoyer, 2004; Thom and Palm, 2013). With this choice, constantly attains the value of on the entire set , hence is a subset of .
Thom and Palm (2013) proved that is the intersection of a scaled canonical simplex and a hypercircle. They further demonstrated that projections onto can be computed with a finite number of alternating projections onto these two geometric structures. The proof of correctness of this method could not rely on the classical alternating projection method (see for example Deutsch, 2001) due to the lacking convexity of , rendering the proof arguments quite complicated.
The remainder of this section proceeds as follows. First, a succinct characterization of the sparseness projection is given. It is then shown that computing sparseness projections is equivalent to finding the zero of a monotone realvalued function. We prove that this can be achieved with optimal asymptotic complexity by proposing a new algorithm that solves the projection problem.
2.1 Representation of the Projection
The main theoretical result of this paper is a representation theorem that characterizes the projection onto . This was gained through an analysis of the intermediate points that emerge from the alternating projections onto a simplex and a hypercircle. A closedform expression can then be provided by showing that the intermediate points in the projection algorithm satisfy a loopinvariant, a certain property fulfilled after each step of alternating projections.
Since a mathematical rigorous treatment of this result is very technical, it is deferred to the appendix. We assume here that the vector to be projected is given so that its projection is unique. Since it is guaranteed that for all points except for a null set there is exactly one projection (Theis et al., 2005, Theorem 2.6), we can exclude points with nonunique projections in our considerations, which is no restriction in practice. We may now state a characterization of the projection outcome:
Representation Theorem
Let and let denote the unique projection of onto . Then there is exactly one real number with
Here, is the dimensional vector where all entries are unity. If the indices of the positive coordinates of are known, then can be computed directly from with an explicit expression.
In words, the projection is the point rescaled to obtain an norm of . The vector is computed from the input vector by subtracting the scalar from all the entries, and afterwards setting negative values to zero. It is remarkable that the projection admits such a simple representation although the target set for the projection is nonconvex and geometrically quite complicated.
The function that maps a scalar to for a constant offset is called the softshrinkage function. If , it is also called the rectifier function. Because the central element of the projection representation is a softshrinkage operation applied entrywise to the input vector, carrying out projections onto level sets of Hoyer’s sparseness measure can be interpreted as denoising operation (Donoho, 1995; Hyvärinen, 1999; Elad, 2006).
2.2 Auxiliary Function for the Sparseness Projection
The projection problem can hence be reduced to determining the softshrinkage offset , which is a onedimensional problem on the real line. Further, it is reasonable that must be smaller than the maximum entry of since otherwise would be the null vector, which is absurd. In the usual case where we can further conclude that must be nonnegative. Otherwise, the result of the projection using the representation theorem would be less sparse than the input, which is impossible. Therefore, the projection problem can be further reduced to finding a number from the bounded interval with .
Next, a method for efficiently deciding whether the correct offset has been found is required. Similar to the projection onto a canonical simplex (Liu and Ye, 2009), we can design a realvalued function that vanishes exactly at the wanted offset. The properties of Hoyer’s allow us to formulate this function in an intuitive way. We call
the auxiliary function for the projection onto .
The rationale for the concrete definition of is as follows. Due to the representation theorem we know that the projection onto is merely a scaled version of the point . Moreover , and due to the scaleinvariance of Hoyer’s follows . The essence of is the ratio of the norm to the norm of its argument. Hence here the scale constants that make normalized to the interval are omitted and the target norms are used instead. We therefore have that , and thus . As is unique, we can conclude that no other offset of the softshrinkage operation leads to a correct solution. In fact, if and when we write , then we have that and therefore .
An exemplary plot of is depicted in Fig. 2. Clearly, the auxiliary function is continuous, differentiable except for isolated points and strictly decreasing except for its final part where it is constant. Since the constant part is always negative and starts at the offset equal to the secondlargest entry of , the feasible interval can be reduced even more. The step discontinuities in coincide exactly with the entries of the input vector . These appealing analytical properties greatly simplify computation of the zero of , since standard rootfinding algorithms such as bisection or Newton’s method (see for example Traub, 1964) can be employed to numerically find .
2.3 Linear Time and Constant Space Projection Algorithm
We can improve on merely a numerical solution by exploiting the special structure of to yield an analytical solution to the projection problem. This is due to the closedform expression for which requires the indices of the coordinates in which the result of the projection is positive to be known. The other way around, when is known, these indices are exactly the ones of the coordinates in which is greater than . When an offset sufficiently close to can be determined numerically, the appropriate index set can be determined and thus the exact value of .
The decision if a candidate offset is close enough to the true can be made quite efficiently, since is coupled via index sets to individual entries of the input vector . Hence, it suffices to find the rightmost left neighbor of in the entries of , and analogously the leftmost right neighbor (see Fig. 2 for an example). Whenever and , the zero of must be located between and for continuity reasons. But then all values in greater than are exactly the values greater than or equal to , which can be determined by simply scanning through the entries of .
Based upon these considerations, a flowchart of our proposed method for computing sparsenessenforcing projections is depicted in Fig. 3. The algorithm performs bisection and continuously checks for sign changes in the auxiliary function . As soon as this is fulfilled, is computed and the result of the projection is determined inplace. A complete formal presentation and discussion of our proposed algorithm is given in the appendix.
Each intermediate step of the algorithm (that is, determination of the secondlargest entry of , evaluation of the auxiliary function, projection result computation by application of softshrinkage and scaling) can be carried out in a time complexity linear in the problem dimensionality and a space complexity completely independent of .
Therefore, to show the overall algorithm possesses the same asymptotic complexity, it has to be proved that the number of bisection iterations required for finding is independent of . The length of the initial interval is upperbounded by as discussed earlier. Bisection is moreover guaranteed to stop at the latest once the interval length is smaller than the minimum pairwise distance between distinct entries of ,
where . This is due to the fact that it is enough to find a sufficiently small range to deduce an analytical solution. The required number of bisection iterations is then less than , see Liu and Ye (2009). This number is bounded from above regardless of the dimensionality of the input vector since is upperbounded and is lowerbounded due to finite machine precision (Goldberg, 1991).
Hence our sparsenessenforcing projection algorithm is asymptotically optimal in the sense of complexity theory. There may still be hidden constants in the asymptotic notation that render the proposed algorithm less efficient than previously known algorithms for a small input dimensionality. Experiments in Sect. 4 demonstrate that this is not the case.
3 Explicitly SparsenessConstrained Dictionary Learning
This section proposes the Easy Dictionary Learning (EZDL) algorithm for dictionary learning under explicit sparseness constraints. First, we introduce an ordinary formulation of the learning algorithm. Sparse code word inference with the sparseness projection algorithm proposed in Sect. 2 renders EZDL efficient and particularly simple to implement. We further discuss extensions that allow concentrating on different aspects of the data set under investigation, such as topographic organization of the atoms and atom sparseness. Only little implementation effort is required for these extensions and their computational demands are low. A description of the comprehensive EZDL learning algorithm is accompanied with several strategies that improve the optimization performance.
3.1 EZDL—Easy Dictionary Learning
Conventional dictionary learning algorithms operate in an alternating fashion, where code words are inferred with a costly optimization procedure before updating the dictionary. EZDL learns a dictionary by first yielding sparse code words through simple inference models and then tuning the dictionary with a simple update step. An inference model is a function which accepts filter responses in the form of the product of the dictionary with a learning sample , , and produces a representation with certain desirable properties. Here, the combination with the sparsenessenforcing projection operator is particularly interesting since it provides a natural method to fulfill explicit sparseness constraints.
We call the choice the ordinary inference model, where denotes the Euclidean projection onto the set of all vectors that achieve a sparseness degree of with respect to Hoyer’s sparseness measure . This computation can also be interpreted as the trivial first iteration of a projected Landweber procedure for sparse code word inference (Bredies and Lorenz, 2008).
The dictionary is adapted to new data by minimizing the deviation between a learning sample and its approximation through the linear generative model. The goodness of the approximation is here assessed with a differentiable similarity measure , so that the EZDL optimization problem becomes:
Note that is not a variable of the optimization problem since it is defined as output of an inference model. For the same reason, does not need to be constrained further as it inherently satisfies explicit sparseness constraints.
Although a wide range of similarity measures is possible, we decided to use the (Pearson productmoment) correlation coefficient since it is invariant to affinelinear transformations
(Rodgers and Nicewander, 1988). In the context of visual data set analysis, this corresponds to invariance with respect to DC components and gain factors. Moreover, differentiability of this similarity measure facilitates gradientbased learning (Thom and Palm, 2013).If is the transposed derivative of with respect to its first argument and denotes its value in , the gradient of for the dictionary is given by
Therefore, when is tuned with gradient descent a simple and biologically plausible Hebbianlike learning rule (Hyvärinen et al., 2009) results:
This update step can be implemented efficiently with simple rank1 updates (Blackford et al., 2002).
We here assumed to be constant and neglected that it actually depends on . Without this assumption, the gradient would have to be extended with an additive term which comprises the gradient of the inference model (Thom and Palm, 2013, Proposition 39). This, in turn, requires the computation of the sparsenessenforcing projection operator’s gradient. This gradient can be computed with simple vector operations as we show in the appendix. However, this constitutes a significant computational burden compared to the simple and fasttocompute EZDL update step. Section 4 demonstrates through experiments that this simplification is still perfectly capable of learning dictionaries.
3.2 Topographic Atom Organization and Atom Sparseness
Topographic organization of the dictionary’s atoms similar to Selforganizing Maps
(Kohonen, 1990)or Topographic Independent Component Analysis
(Hyvärinen et al., 2001) can be achieved in a straightforward way with EZDL using alternative inference models proposed in this section. For this, the dictionary atoms are interpreted to be arranged on a twodimensional grid. A spatial pooling operator subject to circular boundary conditions can then be used to incorporate interactions between atoms located adjacent on the grid. This can, for example, be realized by averaging each entry of the filter responses in a neighborhood on the grid. This convolution operation can be expressed as linear operator applied to the vector , represented by a sparsely populated matrix . With containing all atom interactions, we call the topographic inference model.An alternative realization of topographic organization is enforcing structure sparseness
on the filter responses reshaped as matrix to account for the grid layout. Structure sparseness can be assessed by application of a sparseness measure to the vector with the singular values of a matrix. When the
pseudonorm is used, then this is the rank of a matrix. The matrix rank can also be measured more robustly through the ratio of the Schatten norm to the Schatten norm, which is essentially Hoyer’s sparseness measure applied to the singular values (Lopes, 2013).Clearly, any matrix where can be reshaped to a vector with the vectorization operator that stacks all columns of on top of another (Neudecker, 1969). This linear operation can be inverted provided the shape of the original matrix is known, . In the following, let denote the projection onto the set of all matrices that possess a target rank
. A classical result states that any matrix can be projected onto the set of lowrank matrices by computation of its Singular Value Decomposition, applying an
pseudonorm projection to the vector of singular values thus retaining only the largest ones, and finally computing the reconstruction using the modified singular values (Eckart and Young, 1936).Based on these considerations, let and with be natural numbers describing the topographic grid layout dimensionalities. The rank topographic inference model is then defined by the composition
In words, this inference model operates as follows. It first approximates the filter responses with a sparsely populated vector that attains a sparseness of with respect to Hoyer’s . These sparsified filter responses are then laid out on a grid and replaced with those best approximating filter responses that meet a lower rank of . Finally, the grid layout is reshaped to a vector since this should be the output type of each inference model.
If , then there are two vectors and such that , that is the filter responses can be expressed as a dyadic product when reshaped to a grid. We will demonstrate with experiments that this results in an interesting topographic atom organization similar to Independent Subspace Analysis (Hyvärinen and Hoyer, 2000).
Analogous to Nonnegative Matrix Factorization with Sparseness Constraints (Hoyer, 2004)
, the atoms can be enforced to be sparse as well. To achieve this, it is sufficient to apply the sparseness projection to each column of the dictionary after each learning epoch:
where is the th canonical basis vector that selects the th column from , and is the target degree of atom sparseness.
In the situation where the learning samples resemble image patches, the atoms can also be restricted to fulfill structure sparseness using a lowrank approximation. Suppose the image patches possess pixels, then the following projection can be carried out after each learning epoch:
Here, denotes the target rank of each atom when reshaped to pixels. When the dictionary is used to process images through convolution, small values of are beneficial since computations can then be speeded up considerably. For example, if each atom can be expressed as outer product of vectors from and , respectively. The convolutional kernel is separable in this case, and two onedimensional convolutions lead to the same result as one twodimensional convolution, but require only a fraction of operations (Hoggar, 2006).
3.3 Learning Algorithm
In the previous sections, a simple Hebbianlike learning rule was derived which depends on abstract inference models. The core of the inference models is the sparsenessenforcing projection operator discussed in Sect. 2, which guarantees that the code words always attain a sparseness degree easily controllable by the user. On presentation of a learning sample , an update step of the dictionary hence consists of

determining the filter responses (),

feeding the filter responses through the inference model involving a sparseness projection (),

computation of the approximation to the original learning sample () using the linear generative model,

assessing the similarity between the input sample and its approximation () and the similarity measure’s gradient (), and eventually

adapting the current dictionary using a rank1 update ().
All these intermediate steps can be carried out efficiently. The sparseness projection can be computed using few resources, and neither its gradient nor the gradient of the entire inference model is required. After each learning epoch it is possible to impose additional constraints on the atoms with atomwise projections. In doing so, several aspects of the data set structure can be made more explicit since the atoms are then forced to become more simple feature detectors.
An example of an individual learning epoch is given in Algorithm 1, where the topographic inference model and lowrank structure sparseness are used. In addition, the explicit expressions for the correlation coefficient and its gradient are given (Thom and Palm, 2013, Proposition 40). The outcome of this parameterization applied to patches from natural images is depicted in Fig. (b)b and will be discussed in Sect. 4.2.3.
The efficiency of the learning algorithm can be improved with simple modifications described in the following. To reduce the learning time compared to when a randomly initialized dictionary was used, the dictionary should be initialized by samples randomly chosen from the learning set (Mairal et al., 2009a; Skretting and Engan, 2010; Coates and Ng, 2011; Thom and Palm, 2013)
. When large learning sets with millions of samples are available, stochastic gradient descent has been proven to result in significantly faster optimization progress compared to when the degrees of freedom are updated only once for each learning epoch
(Wilson and Martinez, 2003; Bottou and LeCun, 2004). The step size schedule , where denotes the initial step size and is the learning epoch counter, is beneficial when the true gradient is not available but rather an erroneous estimate (Bertsekas, 1999). After each learning epoch, all the atoms of the dictionary are normalized to unit scale. Since the learning rule is Hebbianlike, this prevents the atoms from growing arbitrarily large or becoming arbitrarily small (Hyvärinen et al., 2009). This normalization step is also common in a multitude of alternative dictionary learning algorithms (KreutzDelgado et al., 2003; Hoyer, 2004; Mairal et al., 2009a; Skretting and Engan, 2010).Since the dictionary is modified with a rank1 update where one of the factors is the result of an inference model and hence sparse, only the atoms that induce significant filter responses are updated. This may result in atoms that are never updated when the target sparseness degree
is large. This behavior can be alleviated by adding random noise to the inference model’s output prior to updating the dictionary, which forces all the atoms to be updated. We used random numbers sampled from a zeromean normal distribution, where the standard deviation was multiplicatively annealed after each learning epoch. As optimization proceeds, the atoms are welldistributed in sample space such that the lifetime sparseness is approximately equal to the population sparseness
(Willmore and Tolhurst, 2001), and randomness is not needed anymore.4 Experimental Results
This section reports experimental results on the techniques proposed in this paper. We first evaluate alternative solvers for the sparseness projection’s auxiliary function and show that our algorithm for the sparsenessenforcing projection operator is significantly faster than previously known methods. We then turn to the application of the projection in the Easy Dictionary Learning algorithm. First, the morphology of dictionaries learned on natural image patches is analyzed and put in context with previous methods. We further show that the resulting dictionaries are wellsuited for the reproduction of entire images. They achieve a reproduction quality equivalent to dictionaries trained with alternative, significantly slower algorithms. Eventually, we analyze the performance of the dictionaries when employed for the image denoising task and find that, analogous to the reproduction experiments, no performance degradation can be observed.
4.1 Performance of SparsenessEnforcing Projections
Our proposed algorithm for sparseness projections was implemented as C++ program using the GNU Scientific Library (Galassi et al., 2009). We first analyzed whether solvers other than bisection for locating the zero of the auxiliary function would result in an improved performance. Since is differentiable except for isolated points and its derivatives can be computed quite efficiently, Newton’s method and Halley’s method are straightforward to apply. We further verified whether Newton’s method applied to a slightly transformed variant of the auxiliary function,
where the minuend and the subtrahend were squared, would behave more efficiently. The methods based on derivatives were additionally safeguarded with bisection to guarantee new positions are always located within welldefined intervals (Press et al., 2007). This impairs the theoretical property that only a constant number of steps is required to find a solution, but in practice a significantly smaller number of steps needs to be made compared to plain bisection.
To evaluate which solver would provide maximum efficiency in practical scenarios, one thousand random vectors for each problem dimensionality were sampled and the corresponding sparsenessenforcing projections for were computed using all four solvers. We have used the very same random vectors as input for all solvers, and counted the number of times the auxiliary function needed to be evaluated until the solution was found. The results of this experiment are depicted in Fig. 4.
The number of function evaluations required by plain bisection grew about linearly in . Because the expected minimum difference of two distinct entries from a random vector gets smaller when the dimensionality of the random vector is increased, the expected number of function evaluations bisection needs increases with problem dimensionality. In either case, the length of the interval that has to be found is always bounded from below by the machine precision such that the number of function evaluations with bisection is bounded from above regardless of .
The solvers based on the derivative of or , respectively, always required less function evaluations than bisection. They exhibited a growth clearly sublinear in . For , Newton’s method required function evaluations in the mean, Halley’s method needed iterations, and Newton’s method applied to found the solution in only iterations. Therefore, in practice always the latter solver should be employed.
In another experiment, the time competing algorithms require for computing sparseness projections on a real computing machine was measured for a comparison. For this, the algorithms proposed by Hoyer (2004), Potluru et al. (2013), and Thom and Palm (2013) were implemented using the GNU Scientific Library (Galassi et al., 2009) by means of C++ programs. An Intel Core i74960X processor was used and all algorithms were run in a singlethreaded environment. Random vectors were sampled for each problem dimensionality and initially projected to attain a sparseness of with respect to Hoyer’s . This initial projection better reflects the situation in practice where not completely random vectors have to be processed. Next, all four algorithms were used to compute projections with a target sparseness degree of and their run time was measured. The original algorithm of Hoyer (2004) was the slowest, so by taking the ratio of the run times of the other algorithms to the run time of that slowest algorithm the relative speedup was obtained.
Figure 5 visualizes the results of this experiment. For all tested problem dimensionalities, the here proposed linear time algorithm dominated all previously described methods. While the speedup of the algorithms of Potluru et al. (2013) and Thom and Palm (2013) relative to the original algorithm of Hoyer (2004) were already significant, especially for small to medium dimensionalities, they got relatively slow for very large vectors. This is not surprising, because both methods start with sorting the input vector and have to store that permutation to be undone in the end.
The algorithm proposed in this paper is based on rootfinding of a monotone function and requires no sorting. Only the left and right neighbors of a scalar in a vector have to be found. This can be achieved by scanning linearly through the input vector, which is particularly efficient when huge vectors have to be processed. For , the proposed algorithm was more than times faster than the methods of Hoyer (2004), Potluru et al. (2013) and Thom and Palm (2013). Because of this appealing asymptotic behavior, there is now no further obstacle to applying smooth sparseness methods to largescale problems.
4.2 Data Set Analysis with EZDL
We used the Easy Dictionary Learning algorithm as a data analysis tool to visualize the structure of patches from natural images under different aspects, which facilitates qualitative statements on the algorithm performance. For our experiments, we used the McGill calibrated color image database (Olmos and Kingdom, 2004), where the images had either pixels or pixels. The images were desaturated (Annex B.4 SMPTE RP 1771993) and quantized to bit precision. We extracted patches with pixels from each of the
images of the "Foliage" collection. The patches were extracted at random positions. Patches with vanishing variance were omitted since they carry no information. In total, we obtained
million samples for learning.We learned dictionaries on the raw pixel values of this learning set and on whitened image patches. The learning samples were normalized to zero mean and unit variance as the only preprocessing for the raw pixel experiments. The whitened data was obtained using the canonical preprocessing from Section 5.4 of Hyvärinen et al. (2009) with principal components. EZDL was applied using each proposed combination of inference model and atom constraints. We presented randomly selected learning samples in each of one thousand epochs, which corresponds to about ten sweeps through the entire learning set. The initial step size was set to .
A noisy version of the final dictionary could already be observed after the first epoch, demonstrating the effectiveness of EZDL for quick analysis. Since the optimization procedure is probabilistic in the initialization and selection of training samples, we repeated the training five times for each parameter set. We observed only minor variations between the five dictionaries for each parameter set.
In the following, we present details of dictionaries optimized on raw pixel values and analyze their filter characteristics. Then, we consider dictionaries learned with whitened data and dictionaries with separable atoms trained on raw pixel values and discuss the underlying reasons for their special morphology.
4.2.1 Raw Pixel Values
Using the ordinary inference model, we trained twotimes overcomplete dictionaries with atoms on the normalized pixel values, where the dictionary sparseness degree was varied between and
. This resulted in the familiar appearance of dictionaries with Gaborlike filters, which resemble the optimal stimulus of visual neurons
(Olshausen and Field, 1996, 1997). While for small values of the filters were mostly small and sharply bounded, high sparseness resulted in holistic and blurry filters.We used the methods developed by Jones and Palmer (1987) and Ringach (2002) for a more detailed analysis. In doing so, a twodimensional Gabor function as defined in Equation (1) of Ringach (2002), that is a Gaussian envelope multiplied with a cosine carrier wave, was fit to each dictionary atom using the algorithm of Nelder and Mead (1965). We verified that these Gabor fits accurately described the atoms, which confirmed the Gaborlike nature of the filters.
The differences in the dictionaries due to varying sparseness degrees became apparent through analysis of the parameters of the fitted Gabor functions. Figure 6 shows the mean values of three influential Gabor parameters in dependence on . All parameters change continuously and monotonically with increasing sparseness. The spatial frequency, a factor in the cosine wave of the Gabor function, constantly decreases for increasing dictionary sparseness. The width of the Gaussian envelope, here broken down into the standard deviation in both principal orientations and , is monotonically increasing. The envelope width in direction increases earlier, but in the end the envelope width in direction is larger.
It can be concluded that more sparse code words result in filters with lower frequency but larger envelope. Since an increased sparseness reduces the model’s effective number of degrees of freedom, it prevents the constrained dictionaries from adapting precisely to the learning data. Similar to Principal Component Analysis, lowfrequency atoms here minimize the reproduction error best when only very few effective atoms are allowed
(Hyvärinen et al., 2009; Olshausen and Field, 1996).Histograms of the spatial phase of the filters, an additive term in the cosine wave of the Gabor function, are depicted in Fig. 7. For , there is a peak at
which corresponds to oddsymmetric filters
(Ringach, 2002). The distribution is clearly bimodal for , with peaks at and corresponding to evensymmetric and oddsymmetric filters, respectively. While the case of small matches the result of ordinary sparse coding, the higher dictionary sparseness results in filters with the same characteristics as the optimal stimulus of macaque visual neurons (Ringach, 2002).This analysis proves that EZDL’s minimalistic learning rule is capable of generating biologically plausible dictionaries, which constitute a particularly efficient image coding scheme. To obtain dictionaries with diverse characteristics, it is enough to adjust the target degree of dictionary sparseness on a normalized scale. A major component in the learning algorithm is the sparseness projection, enforcing local competition among the atoms (Rozell et al., 2008) due to its absolute orderpreservation property (Thom and Palm, 2013, Lemma 12).
4.2.2 Whitened Image Patches
Whitening as a preprocessing step helps to reduce sampling artifacts and decorrelates the input data (Hyvärinen et al., 2009). It also changes the intuition of the similarity measure in EZDL’s objective function, linear features rather than single pixels are considered where each feature captures a multitude of pixels in the raw patches. This results in differences in the filter structure, particularly in the emergence of more lowfrequency filters.
The dictionary depicted in Fig. (a)a was learned using the topographic inference model with averagepooling of neighborhoods. We set the dictionary sparseness degree to and the number of atoms to , arranged on a grid. The dictionary closely resembles those gained through Topographic Independent Component Analysis (Hyvärinen et al., 2001, 2009) and Invariant Predictive Sparse Decomposition (Kavukcuoglu et al., 2009). It should be noted that here the representation is two times overcomplete due to the whitening procedure. Overcomplete representations are not inherently possible with plain Independent Component Analysis (Bell and Sejnowski, 1997; Hyvärinen et al., 2009) which limits its expressiveness, a restriction that does not hold for EZDL.
The emergence of topographic organization can be explained by the special design of the topographic inference model. The pooling operator acts as spatial lowpass filter on the filter responses, so that each smoothed filter response carries information on the neighboring filter responses. Filter response locality is retained after the sparseness projection, hence adjacent atoms receive similar updates with the Hebbianlike learning rule. Hence, there are only small differences between filters within the same vicinity. The learning process is similar to that of Selforganizing Maps (Kohonen, 1990), where only the atom with the maximum filter response and the atoms in its direct surrounding are updated. However, EZDL simultaneously updates multiple clusters since the sparse code words laid out on the grid are multimodal.
Further, it is notable that we achieved a topographic organization merely through a linear operator by simple averagepooling. This stands in contrast to the discussion from Bauer and Memisevic (2013) where the necessity of a nonlinearity such as multiplicative interaction or pooling with respect to the Euclidean norm was assumed. Our result that a simple linear operator plugged into the ordinary inference model already produces smooth topographies proves that linear interactions between the atoms are sufficient despite their minimality.
Figure (b)b shows a dictionary obtained with the rank1 topographic inference model, using a sparseness degree of and filters on a grid. Here, the sparse code words reshaped to a matrix are required to have unit rank which results in a specially organized filter layout. For example, rows three and four almost exclusively contain lowfrequency filters, and the filters in the sixth column are all oriented vertically. The grouping of similar atoms into the same rows and columns is related to Independent Subspace Analysis, which yields groups of Gaborlike filters where all filters in a group have approximately the same frequency and orientation (Hyvärinen and Hoyer, 2000).
The rank1 topographic inference model guarantees that code words can be expressed as the dyadic product of two vectors, where the factors themselves are sparsely populated because the code words are sparse. This causes the code words to possess a sparse rectangular structure when reshaped to account for the grid layout, that is nonvanishing activity is always concentrated in a few rows and columns. The Hebbianlike learning rule therefore induces similar updates to atoms located in common rows or columns, which explains the obtained group layout.
4.2.3 Rank1 Filters on Raw Pixel Values
Enforcing the filters themselves to have rank one by setting resulted in bases similar to Discrete Cosine Transform (Watson, 1994). This was also observed recently by Hawe et al. (2013)
who considered a tensor decomposition of the dictionary, and by
Rigamonti et al. (2013) who minimized the Schatten norm of the atoms. Note that EZDL merely replaces the atoms after each learning epoch by their best rank1 approximation. The computational complexity of this operation is negligible compared to an individual learning epoch using tens of thousands of samples and hence does not slow down the actual optimization.If no ordering between the filters was demanded, a multitude of checkerboardlike filters and vertical and horizontal contrast bars was obtained, see Fig. (a)a for a dictionary with atoms and a sparseness degree of . Atoms with diagonal structure cannot be present at all in such constrained dictionaries because diagonality requires a filter rank greater than one. Although all filters were paraxial due to the rank1 constraint, they still resembled contrast fields because of their gratinglike appearance. This is due to the representation’s sparseness, which is known to induce this appearance.
A similar filter morphology was obtained with a topographic filter ordering using a grid, though the filters were blurrier, as shown in Fig. (b)b. Here, checkerboardlike filters are located in clusters, and filters with vertical and horizontal orientation are spatially adjacent. This is as expected, because with a sparse topography there are only a few local blobs of active entries in each code word, causing similar atoms to be grouped together and dissimilar atoms to be distant from each other. In the space of rank1 filters, there can be either vertical or horizontal structures, or checkerboards combining both principal orientations.
The lack of fine details can be explained by the restriction of topographic organization, which reduces the effective number of degrees of freedom. Analogously to the situation in Sect. 4.2.1 where the dictionary sparseness was increased, the reproduction error can be reduced by a large amount using few filters with low frequencies. For the same reason, there are few such checkerboardlike filters: Minimization of the reproduction error is first achieved with principal orientations before checkerboards can be used to encode details in the image patches.
In conclusion, these results show that the variety of the dictionaries produced by EZDL can be vast simply by adapting easily interpretable parameters. The algorithm covers and reproduces wellknown phenomena from the literature, and allows a precise visualization of a data set’s structure. A first impression of the final dictionaries can already be obtained after one learning epoch, which takes no more than one second on a modern computer.
4.3 Experiments on the Reproduction of Entire Images
It was demonstrated in Sect. 4.2 that the Easy Dictionary Learning algorithm produces dictionaries that correspond to efficient image coding schemes. We now analyze the suitability of EZDL dictionaries trained on raw pixel values for the reproduction of entire images. This is a prerequisite for several image processing tasks such as image enhancement and compression, since here essentially the same optimization problem should be solved as during reproduction with a given dictionary.
Our analysis allows quantitative statements as the original images and their reproduction through sparse coding can be numerically compared on the signal level. Further, the EZDL dictionaries are compared with the results of the Online Dictionary Learning algorithm by Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm by Skretting and Engan (2010) in terms of reproduction quality and learning speed.
The evaluation methodology was as follows. First, dictionaries with different parameter sets were trained on million pixels natural image patches extracted from the "Foliage" collection of the McGill calibrated color image database (Olmos and Kingdom, 2004). As in Sect. 4.2, patches were picked from random positions from each of the desaturated and quantized images. All patches were normalized to zero mean and unit variance before training. The dictionaries were designed for a four times overcomplete representation, they had atoms for samples with pixels. After training, dictionary quality was measured using the images of the "Flowers" collection from the same image database. This evaluation set was disjoint from the learning set.
Each single image was subsequently divided into nonoverlapping blocks with pixels. All blocks were normalized to zero mean and unit variance, and then the optimized dictionaries were used to infer a sparse code word for each block. The resulting code word was then fed through a linear generative model using the currently investigated dictionary, and the mean value and variance were restored to match that of the original block.
Sparse code word inference was achieved with a projected Landweber procedure (Bredies and Lorenz, 2008), essentially the same as projected gradient descent starting with the null vector. Using the sparsenessenforcing projection operator after each iteration, the code words were tuned to attain an inference sparseness
, which need not necessarily be equal to the sparseness degree used for dictionary learning. The correlation coefficient was used as the similarity measure for inference to benefit from invariance to shifting and scaling. Automatic step size control was carried out with the bold driver heuristic
(Bishop, 1995). We note that due to the representation theorem on the sparseness projection this process can also be understood as Iterative SoftThresholding (Bredies and Lorenz, 2008).A reproduced image is yielded by applying this procedure to all distinct blocks of the original image using a single dictionary. The deviation between this output image and the original image was assessed with the SSIM index (Wang and Bovik, 2009), which yielded qualitatively similar results as the peak signaltonoise ratio. The SSIM index is, however, normalized to values between zero and one, and it respects the local structure of images since it examines
pixels neighborhoods through convolution. Because it measures the visual similarity between images, it is more intuitive than measures based on the pixelwise squared error
(Wang and Bovik, 2009). This evaluation method yielded one number from the interval for each image and dictionary. Each parameterization for dictionary learning was used to train five dictionaries to compensate probabilistic effects, and here the mean of the resulting SSIMs is reported.4.3.1 EZDL Results
Figure 10 visualizes the results obtained for dictionaries produced by Easy Dictionary Learning using dictionary sparseness degrees . One thousand learning epochs were carried out, presenting samples in each epoch, and using an initial step size of . During dictionary learning, only the first trivial iteration of a Landweber procedure is carried out for sparse code word inference by computing for the ordinary inference model.
We first analyzed the impact of carrying out more Landweber iterations during the image reproduction phase, and the effect of varying the inference sparseness degree . A huge performance increase can be obtained by using more than one iteration for inference (Fig. (a)a). Almost the optimal performance is achieved after ten iterations, and after one hundred iterations the method has converged. For , the performance of dictionaries with and is about equal yielding the maximum SSIM of one. This value indicates that there is no visual difference between the reproduced images and the original images.
When the inference sparseness is increased to , a difference in the choice of the dictionary sparseness becomes noticeable. For , performance already degrades, and the degradation is substantial if . It is more intuitive when and are put in relation. Almost no information is lost in the reproduction by enforcing a lower inference sparseness than dictionary sparseness (). Performance is worse if which is plausible because the dictionary was not adapted for a higher sparseness degree. In the natural case the performance mainly depends on their concrete value, where higher sparseness results in worse reproduction capabilities.
To further investigate this behavior, we set the number of Landweber iterations to and varied smoothly in the interval for the four different dictionary sparseness degrees. The results of this experiment are depicted in Fig. (b)b. For there is hardly any difference in the reproduction quality irrespective of the value of . The difference when training dictionaries with different sparseness degrees first becomes visible for large values of . Performance is here better using dictionaries where very sparse code words were demanded during learning. Hence, tasks that require high code word sparseness should use dictionaries trained with high values of the dictionary sparseness.
4.3.2 Comparison with Alternative Dictionary Learning Algorithms
For a comparison, we conducted experiments using the Online Dictionary Learning (ODL) algorithm of Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm (RLSDLA) by Skretting and Engan (2010). For inference of sparse code words, ODL minimizes the reproduction error under implicit sparseness constraints. RLSDLA uses an external vector selection algorithm for inference, hence explicit constraints such as a target pseudonorm can be demanded easily. Both algorithms update the dictionary after each sample presentation.
ODL does not require any step sizes to be adjusted. The crucial parameter for dictionary sparseness here is a number , which controls the tradeoff between reproduction capabilities and code word sparseness. We trained five dictionaries for each using atoms and presenting learning samples in each of one thousand epochs. Then, the same evaluation methodology as before with was used to assess the reproduction capabilities. The results are shown in Fig. (a)a. The choice of is not as influential compared to in EZDL. There are only small performance differences for , and for hardly any difference can be observed. Similar to the EZDL experiments, large values of result in better performance for large and worse performance for small .
RLSDLA provides a forgetting factor parameter which behaves analogously to the step size in gradient descent. We used the default forgetting factor schedule which interpolates between and using a cubic function over one thousand learning epochs. For inference, we chose an optimized Orthogonal Matching Pursuit variant (GharaviAlkhansari and Huang, 1998) where a parameter controls the number of nonvanishing coordinates in the code words. We trained five dictionaries with atoms for each . Thirty thousand randomly drawn learning samples were presented in each learning epoch. The resulting reproduction performance is shown in Fig. (b)b. Again, for large values of the dictionaries with the most sparse code words during learning perform best, that is those trained with small values of .
To compare all three dictionary learning algorithms independent of the concrete dictionary sparseness, we took the mean SSIM value that belonged to the best performing dictionaries for each feasible value of . This can also be interpreted as using the convex hull of the results shown in Fig. (b)b and Fig. 11. This yielded one curve for each dictionary learning algorithm, depicted in Fig. 12. There is only a minor difference between the curves over the entire range of . It can hence be concluded that the algorithms learn dictionaries which are equally wellsuited for the reproduction of entire images. Although EZDL uses a very simple learning rule, this is sufficient enough to achieve a performance competitive with that of two stateoftheart dictionary learning algorithms.
4.3.3 Comparison of Learning Speed
We moreover compared the learning speed of the methods and investigated the influence of the initial step size on the EZDL dictionaries. In doing so, we carried out reproduction experiments on entire images using dictionaries provided by the learning algorithms after certain numbers of learning epochs. In each of one thousand overall epochs, learning samples where input to all three algorithms. The dictionary sparseness parameters were set to for EZDL, to for Online Dictionary Learning, and to for the Recursive Least Squares Dictionary Learning Algorithm.
Timing measurements on an Intel Core i74960X processor have shown that one EZDL learning epoch takes approximately less time than a learning epoch of ODL. RLSDLA was roughly times slower than EZDL. Although the employed vector selection method was present as part of an optimized software library, only a fairly unoptimized implementation of the actual learning algorithm was available. Therefore, a comparison of this algorithm with regard to execution speed would be inequitable.
The inference sparseness was set to and one hundred Landweber iterations were carried out for sparse code word inference. The SSIM index was eventually used to assess the reproduction performance. Figure 13 visualizes the results of the learning speed analysis, averaged over all images from the evaluation set and five dictionaries trained for each parameterization to compensate probabilistic effects. Online Dictionary Learning is free of step sizes, dictionary performance increased instantly and constantly with each learning epoch. Only small performance gains could be observed after learning epochs.
The performance of dictionaries trained with RLSDLA was initially better than that of the ODL dictionaries, but improved more slowly. After a hundred epochs, however, it was possible to observe a significant performance gain. Although tweaking the forgetting factor schedule may have led to better early reproduction capabilities, RLSDLA achieved a performance equivalent to that of ODL after epochs.
If EZDL’s initial step size was set to unity, performance first degraded, but started to improve after the fifth epoch. After epochs, the performance was identical to ODL’s, and all three methods obtained equal reproduction capabilities after epochs. Reduction of the initial step size caused the dictionaries to perform better after few sample presentations. The quality of Online Dictionary Learning was achieved after epochs for , and for the EZDL dictionaries were always better in the mean until the th epoch. There is hardly any quality difference after epochs, a very small initial step size resulted however in a slightly worse performance.
The choice of the initial step size for EZDL is hence not very influential. Although caused small overfitting during the very first epochs, the dictionaries quickly recovered so that no significant difference in reproduction capabilities could be observed after epochs. Since an EZDL epoch is faster than an Online Dictionary Learning epoch, our proposed method produces better results earlier on an absolute time scale.
4.4 Image Denoising Experiments
We have demonstrated in Sect. 4.3 that dictionaries learned with Easy Dictionary Learning are as good as those obtained from the Online Dictionary Learning algorithm of Mairal et al. (2009a) and the Recursive Least Squares Dictionary Learning Algorithm by Skretting and Engan (2010) in terms of the reproduction quality of entire images. In a final experiment, we investigated the suitability of EZDL dictionaries for image denoising using the image enhancement procedure proposed by Mairal et al. (2009b).
This method carries out semilocal block matching and finds sparse code words by imposing a group sparseness penalty on the Euclidean reproduction error. In doing so, a prelearned dictionary is used which explains the appearance of uncorrupted images and further helps to resolve ambiguities if block matching fails to provide large enough groups. A linear generative model is finally used to find estimators of denoised image patches from the sparse code words. This procedure is quite robust if the input data is noisy, since sparseness provides a strong prior which well regularizes this illposed inverse problem (KreutzDelgado et al., 2003; Foucart and Rauhut, 2013).
The denoising approach of Mairal et al. (2009b) also provides the possibility of dictionary adaptation while denoising concrete input data. We did not use this option as it would hinder resilient statements on a dictionary’s eligibility if it would be modified with another dictionary learning algorithm during denoising.
The methodology of the experiment was as follows. We used the fourtimes overcomplete dictionaries trained on the pixels image patches from the "Foliage" collection of the McGill calibrated color image database (Olmos and Kingdom, 2004) as models for uncorrupted images. The dictionary sparseness parameters were for EZDL, for ODL and for RLSDLA. For evaluation, we used the images of the "Animals" collection from the McGill database, and converted them to bit grayscales as in the previous experiments. The images were synthetically corrupted with additive white Gaussian noise. Five noisy images were generated for each original image and each standard deviation from .
These images were then denoised using the candidate dictionaries, where the window size for block matching was set to pixels. Then, the similarity between the reconstructions and the original images was measured with the peak signaltonoise ratio. We also evaluated the SSIM index, which led to qualitatively similar results.
The results are depicted in Fig. 14. Denoising performance degrades if the noise strength is increased. There is hardly any difference between the dictionaries trained with the three algorithms. The RLSDLA and EZDL dictionaries perform slightly better for small synthetic noise levels, but this improvement is visually imperceptible. This result is not surprising, since Sect. 4.3 demonstrated that all three learning algorithms produce dictionaries equally wellsuited for the reproduction of entire images.
The denoising procedure of Mairal et al. (2009b) aims at reproduction capabilities as well, with the modification of employing noisy samples as input. Image enhancement and compression applications such as those proposed by Yang et al. (2010, 2012), Dong et al. (2011), Skretting and Engan (2011) and Horev et al. (2012) which also use problem formulations based on the reproduction error can hence be expected to benefit from more efficiently learned dictionaries as well.
5 Conclusions
This paper proposed the EZDL algorithm which features explicit sparseness constraints with respect to Hoyer’s smooth sparseness measure . Predefined sparseness degrees are ensured to always be attained using a sparsenessenforcing projection operator. Building upon a succinct representation of the projection, we proved that the projection problem can be formulated as a rootfinding problem. We presented a linear time and constant space algorithm for the projection which is superior to previous approaches in terms of theoretical computational complexity and execution time on real computing machines.
EZDL adapts dictionaries to measurement data through simple rank1 updates. The sparseness projection serves as foundation for sparse code word inference. Due to the projection efficiency and since no complicated gradients are required, our proposed learning algorithm is significantly faster than even the optimized ODL algorithm. Topographic atom organization and atom sparseness can be realized with very simple extensions, allowing for versatile sparse representations of data sets. Its simplicity and efficiency does not hinder EZDL from producing dictionaries competitive with those generated by ODL and RLSDLA in terms of the reproduction and denoising performance on natural images. Alternative image processing methods based on sparse representations rely on dictionaries subject to the same criteria, and can thus be expected to benefit from EZDL’s advantages as well.
Acknowledgements.
The authors are grateful to Heiko Neumann, Florian Schüle, and Michael Gabb for helpful discussions. We would like to thank Julien Mairal and Karl Skretting for making implementations of their algorithms available. Parts of this work were performed on the computational resource bwUniCluster funded by the Ministry of Science, Research and Arts and the Universities of the State of BadenWürttemberg, Germany, within the framework program bwHPC. This work was supported by Daimler AG, Germany.Appendix: Technical Details and Proofs for Section 2
This appendix studies the algorithmic computation of Euclidean projections onto level sets of Hoyer’s in greater detail, and in particular proves the correctness of the algorithm proposed in Sect. 2.
For a nonempty subset of the Euclidean space and a point , we call
the set of Euclidean projections of onto (Deutsch, 2001). Since we only consider situations in which is a singleton, we may also write .
Without loss of generality, we can compute for a vector within the nonnegative orthant instead of for an arbitrary point to yield sparsenessenforcing projections, where and are as defined in Sect. 2. First, the actual scale is irrelevant as we can simply rescale the result of the projection (Thom and Palm, 2013, Remark 5). Second, the constraint that the projection lies in the nonnegative orthant can easily be handled by flipping the signs of certain coordinates (Thom and Palm, 2013, Lemma 11). Finally, all entries of can be assumed nonnegative with Corollary 19 from Thom and Palm (2013).
We note that is nonconvex because of the constraint. Moreover, for all target sparseness degrees which we show here by construction (see also Remark 18 in Thom and Palm (2013) for further details): Let and , then the point lies in , where denotes the th canonical basis vector. Since all coordinates of are positive, always contains points with an pseudonorm of . If we had used the pseudonorm to measure sparseness, then would have the same sparseness degree as, for example, the vector with all entries equal to unity. If, however, is close to one, then there is only one large value in and all the other entries equaling are very small but positive. This simple example demonstrates that in situations where the presence of noise cannot be eliminated, Hoyer’s is a much more robust sparseness measure than the pseudonorm.
Representation Theorem
Before proving a theorem on the characterization of the projection onto , we first fix some notation. As above, let denote the th canonical basis vector and let furthermore be the vector where all entries are one. We note that if a point resides in the nonnegative orthant, then . Subscripts to vectors denote the corresponding coordinate, except for and . For example, we have that . We abbreviate with when it is clear that is a real number. When is an index set with elements, say , then the unique matrix with for all is called the slicing operator. A useful relation between the pseudonorm, the Manhattan norm and the Euclidean norm is for all points .
We are now in a position to formalize the representation theorem:
Theorem .1
Let and be unique. Then there is exactly one number such that
where is a scaling constant. Moreover, if , and , denotes the set of the coordinates in which does not vanish, then
Proof
It is possible to prove this claim either constructively or implicitly, where both variants differ in whether the set of all positive coordinates in the projection can be computed from or must be assumed to be known. We first present a constructive proof based on a geometric analysis conducted in Thom and Palm (2013), which contributes to deepening our insight into the involved computations. As an alternative, we also provide a rigorous proof using the method of Lagrange multipliers which greatly enhances the unproven analysis of Potluru et al. (2013, Section 3.1).
We first note that when there are and so that we have , then is determined already through because it holds that . We now show that the claimed representation is unique, and then present the two different proofs for the existence of the representation.
Uniqueness: It is , since would violate that and is impossible because . We first show that there are two distinct indices with . Assume this was not the case, then , say, for all . Let be the index of the smallest coordinate of which has its index in . Let and be numbers and define . Then like in the argument where we have shown that is nonempty. Because of and , it follows that . Hence is at least as good an approximation to as , violating the uniqueness of . Therefore, it is impossible that the set is a singleton.
Now let with and such that . Clearly and as . It is , thus holds. Moreover, , hence . Finally, we have that , which yields , and hence the representation is unique.
Existence (constructive): Let
be the hyperplane on which all points in the nonnegative orthant have an
norm of and let be a scaled canonical simplex. Further, let be a circle on , and for an arbitrary index set let be a subset of where all coordinates not indexed by vanish. With Theorem 2 and Appendix D from Thom and Palm (2013) there exists a finite sequence of index sets with for such that is the result of the finite sequenceAll intermediate projections yield unique results because was restricted to be unique. The index sets contain the indices of the entries that survive the projections onto , for . In other words, can be computed from by alternating projections, where the sets and are nonconvex for all . The expressions for the individual projections are given in Lemma 13, Lemma 17, Proposition 24, and Lemma 30, respectively, in Thom and Palm (2013).
Let for completeness, then we can define the following constants for . Let be the number of relevant coordinates in each iteration, and let