I Introduction
Sparse representations [1] have become one of the most widely used and successful models for inverse problems in signal processing, image processing, and computational imaging. The reconstruction of a signal from a sparse representation with respect to dictionary matrix is linear, i.e. , but computing the sparse representation given the signal, referred to as sparse coding, usually involves solving an optimization problem^{1}^{1}1We do not consider the analysis form [2] of sparse representations in this work, focusing instead on the more common synthesis.. When solving problems involving images of any significant size, these representations are typically independently applied to sets of overlapping image patches due to the intractability of learning an unstructured dictionary matrix
mapping to a vector space with the dimensionality of the number of pixels in an entire image.
The convolutional form of sparse representations replaces the unstructured dictionary with a set of linear filters . In this case the reconstruction of from representation is , where can be an entire image instead of a small image patch. This form of representation was first introduced some time ago under the label translationinvariant sparse representations [3], but has recently enjoyed a revival of interest as convolutional sparse representations, inspired by deconvolutional networks [4] (see [5, Sec. II]). This interest was spurred by the development of more efficient methods for the computationallyexpensive convolutional sparse coding (CSC) problem [6, 7, 8, 9], and has led to a number of applications in which the convolutional form provides stateoftheart performance [10, 11, 12, 13, 14].
The current leading CSC algorithms [8, 9, 15] are all based on the Alternating Direction Method of Multipliers (ADMM) [16], which decomposes the problem into two subproblems, one of which is solved by softthresholding, and the other having a very efficient noniterative solution in the DFT domain [8]. The design of convolutional dictionary learning (CDL) algorithms is less straightforward. These algorithms adopt the usual approach for standard dictionary learning, alternating between a sparse coding step that updates the sparse representation of the training data given the current dictionary, and a dictionary update step that updates the current dictionary given the new sparse representation. It is the inherent computational cost of the latter update that makes the CDL problem more difficult than the CSC problem.
Most recent batchmode^{2}^{2}2We do not consider the very recent online CDL algorithms [17, 18, 19] in this work. CDL algorithms share the structure introduced in [7] (and described in more detail in [20]), the primary features of which are the use of Augmented Lagrangian methods and the solution of the most computationally expensive subproblems in the frequency domain. Earlier algorithms exist (see [5, Sec. II.D] for a thorough literature review), but since they are less effective, we do not consider them here, focusing on subsequent methods:

Proposed a number of dictionary update methods that lead to CDL algorithms with better performance than that of [7].

Proposed a CDL algorithm that allows the inclusion of a spatial mask in the data fidelity term by exploiting the mask decoupling technique [23].

Proposed an alternative masked CDL algorithm that has much lower memory requirements than that of [9] and that converges faster in some contexts.
Unfortunately, due to the absence of any thorough performance comparisons between them (for example, [22] provides comparisons with [7] but not [5]), as well as due to the absence of a careful exploration of the optimum choice of algorithm parameters in most of these works, it is quite difficult to determine which of these methods truly represents the state of the art in CDL.
Three other very recent methods are not considered here. The algorithm of [24] addresses a variant of the CDL problem that is customized for neural signal processing and not relevant to most imaging applications, and [25, 26] appeared while we were finalizing this paper, so that it was not feasible to include them in our comparisons.
The main contributions of the present paper are:

Demonstrating that two of the algorithms proposed in [22], with very different derivations, are in fact closely related and fall within the same class of algorithm.

Proposing a new approach for the CDL problem without a spatial mask that outperforms all existing methods in a serial processing context.

Proposing new approaches for the CDL problem with a spatial mask that respectively outperform existing methods in serial and parallel processing contexts.

Carefully examining the sensitivity of the considered CDL algorithms to their parameters, and proposing simple heuristics for parameter selection that provide close to optimal performance.
Ii Convolutional Dictionary Learning
CDL is usually posed in the form of the problem
(1) 
where the constraint on the norms of filters is required to avoid the scaling ambiguity between filters and coefficients^{3}^{3}3The constraint is frequently used instead of . In practice this does not appear to make a significant difference to the solution.. The training images are considered to be dimensional vectors, where is the number of pixels in each image, and we denote the number of filters and the number of training images by and respectively. This problem is nonconvex in both variables and , but is convex in with constant, and vice versa. As in standard (nonconvolutional) dictionary learning, the usual approach to minimizing this functional is to alternate between updates of the sparse representation and the dictionary. The design of a CDL algorithm can therefore be decomposed into three components: the choice of sparse coding algorithm, the choice of dictionary update algorithm, and the choice of coupling mechanism, including how many iterations of each update should be performed before alternating, and which of their internal variables should be transferred across when alternating.
Iia Sparse Coding
While a number of greedy matching pursuit type algorithms were developed for translationinvariant sparse representations [5, Sec. II.C], recent algorithms have largely concentrated on a convolutional form of the standard Basis Pursuit DeNoising (BPDN) [27] problem
(2) 
This form, which we will refer to as Convolutional BPDN (CBPDN), can be written as
(3) 
If we define such that , and
(4) 
we can rewrite the CBPDN problem in standard BPDN form Eq. (2). The Multiple Measurement Vector (MMV) version of CBPDN, for multiple images, can be written as
(5) 
where is the image, and is the coefficient map corresponding to the image and dictionary filter. By defining
(6) 
we can rewrite in the MMV form of standard BPDN,
(7) 
The most effective solution for solving Eq. (5) is currently based on ADMM [16]^{4}^{4}4It is worth noting, however, that a solution based on FISTA with the gradient computed in the frequency domain, while generally less effective than the ADMM solution, exhibits a relatively small performance difference for the larger values typically used for CDL [5, Sec. IV.B]., which solves problems of the form
(8) 
by iterating over the steps
(9)  
(10)  
(11) 
where penalty parameter is an algorithm parameter that plays an important role in determining the convergence rate of the iterations, and is the dual variable corresponding to the constraint . We can apply ADMM to problem Eq. (7) (working with this form of the problem instead of Eq. (5) simplifies the notation, but the reader should keep in mind that , , and denote the specific blockstructured matrices defined above) by variable splitting, introducing an auxiliary variable that is constrained to be equal to the primary variable , leading to the equivalent problem
(12) 
for which we have the ADMM iterations
(13)  
(14)  
(15) 
Step Eq. (15) involves simple arithmetic, and step Eq. (14) has a closedform solution
(16) 
where is the softthresholding function
(17) 
with and of a vector considered to be applied elementwise, and denoting elementwise multiplication. The most computationally expensive step is Eq. (13), which requires solving the linear system
(18) 
Since is a very large matrix, it is impractical to solve this linear system using the approaches that are effective when is not a convolutional dictionary. It is possible, however, to exploit the FFT for efficient implementation of the convolution via the DFT convolution theorem. Transforming Eq. (18) into the DFT domain gives
(19) 
where denotes the DFT of variable . Due to the structure of , which consists of concatenated diagonal matrices , linear system Eq. (19) can be decomposed into a set of independent linear systems [7], each of which has a left hand side consisting of a diagonal matrix plus a rankone component, which can be solved very efficiently by exploiting the ShermanMorrison formula [8].
IiB Dictionary Update
The dictionary update requires solving the problem
(20) 
which is a convolutional form of Method of Optimal Directions (MOD) [28] with a constraint on the filter normalization. As for CSC, we will develop the algorithms for solving this problem in the spatial domain, but will solve the critical subproblems in the frequency domain. We want solve for with a relatively small support, but when computing convolutions in the frequency domain we need to work with
that have been zeropadded to the common spatial dimensions of
and . The most straightforward way of dealing with this complication is to consider the to be zeropadded and add a constraint that requires that they be zero outside of the desired support. If we denote the projection operator that zeros the regions of the filters outside of the desired support by , we can write a constraint set that combines this support constraint with the normalization constraint as(21) 
and write the dictionary update as
(22) 
Introducing the indicator function of the constraint set , where the indicator function of a set is defined as
(23) 
allows Eq. (22) to be written in unconstrained form [29]
(24) 
Defining such that and
(25) 
this problem can be expressed as
(26) 
or, defining
(27) 
as
(28) 
Algorithms for solving this problem will be discussed in Sec. III. A common feature of most of these methods is the need to solve a linear system that includes the data fidelity term . As in the case of the step Eq. (13) for CSC, this problem can be solved in the frequency domain, but there is a critical difference: is composed of independent components of rank instead of rank 1, so that the very efficient Sherman Morrison solution cannot be directly exploited. It is this property that makes the dictionary update inherently more computationally expensive than the sparse coding stage, complicating the design of algorithms, and leading to the present situation in which there is far less clarity as to the best choice of dictionary learning algorithm than there is for the choice of the sparse coding algorithm.
IiC Update Coupling
Both sparse coding and dictionary update stages are typically solved via iterative algorithms, and many of these algorithms have more than one working variable that can be used to represent the current solution. The major design choices in coupling the alternating optimization of these two stages are therefore:

how many iterations of each subproblem to perform before switching to the other subproblem, and

which working variable from each subproblem to pass across to the other subproblem.
Since these issues are addressed in detail in [21], we only summarize the conclusions here:

When both subproblems are solved by ADMM algorithms, most authors have coupled the subproblems via the primary variables (corresponding, for example, to in Eq. (12)) of each ADMM algorithm.

This choice tends to be rather unstable, and requires either multiple iterations of each subproblem before alternating, or very large penalty parameters, which can lead to slow convergence.

The alternative strategy of coupling the subproblems via the auxiliary variables (corresponding, for example, to in Eq. (12)) of each ADMM algorithm tends to be more stable, not requiring multiple iterations before alternating, and converging faster.
Iii Dictionary Update Algorithms
Since the choice of the best CSC algorithm is not in serious dispute, the focus of this work is on the choice of dictionary update algorithm.
Iiia ADMM with Equality Constraint
The simplest approach to solving Eq. (28) via an ADMM algorithm is to apply the variable splitting
(29) 
for which the corresponding ADMM iterations are
(30)  
(31)  
(32) 
Step Eq. (31) is of the form
(33) 
It is clear from the geometry of the problem that
(34) 
or, if the normalisation is desired instead,
(35) 
Step Eq. (30) involves solving the linear system
(36) 
which can be expressed in the DFT domain as
(37) 
As in Eq. (19), this linear system can be decomposed into a set of independent linear systems, but in contrast to Eq. (19), each of these has a left hand side consisting of a diagonal matrix plus a rank component, which precludes direct use of the ShermanMorrison formula [5].
We consider three different approaches to solving these linear systems:
IiiA1 Conjugate Gradient
An obvious approach to solving Eq. (37) without having to explicitly construct the matrix is to apply an iterative method such as Conjugate Gradient (CG). The experiments reported in [5] indicated that solving this system to a relative residual tolerance of or better is sufficient for the dictionary learning algorithm to converge reliably. The number of CG iterations required can be substantially reduced by using the solution from the previous outer iteration as an initial value.
IiiA2 Iterated ShermanMorrison
Since the independent linear systems into which Eq. (37) can be decomposed have left hand side consisting of a diagonal matrix plus a rank component, one can iteratively apply the ShermanMorrison formula to obtain a solution [5]. This approach is very effective for small to moderate , but performs poorly for large since the computational cost is .
IiiA3 Spatial Tiling
IiiB Consensus Framework
In this section it is convenient to introduce different blockmatrix and vector notation for the coefficient maps and dictionary, but we overload the usual symbols to emphasize their corresponding roles. We define as in Eq. (25), but define
(38) 
where is distinct copy of dictionary filter corresponding to training image .
As proposed in [22], we can pose problem Eq. (28) in the form of an ADMM Consensus problem [16, Ch. 7]
(39) 
The constraint can be written in standard ADMM form as
(40) 
where
(41) 
The corresponding ADMM iterations are
(42)  
(43)  
(44) 
Since is block diagonal, Eq. (42) can be solved as the independent problems
(45) 
each of which can be solved via the same efficient DFTdomain ShermanMorrison method used for Eq. (13). Subproblem Eq. (43) can be expressed as [16, Sec. 7.1.1]
(46)  
(47) 
which has the closedform solution
(48) 
IiiC 3D / Frequency Domain Consensus
Like spatial tiling (see Sec. IIIA3), the “3D” method proposed in [22] maps the dictionary update problem with to an equivalent problem for which . The “3D” method achieves this by considering an array of 2D training images as a single 3D training volume. The corresponding dictionary filters are also inherently 3D, but the constraint is modified to require that they are zero other than in the first 3D slice (this can be viewed as an extension of the constraint that the spatiallypadded filters are zero except on their desired support) so that the final results is a set of 2D filters, as desired.
While ADMM consensus and “3D” were proposed as two entirely distinct methods [22], it turns out they are closely related: the “3D” method is ADMM consensus with the data fidelity term and constraint expressed in the DFT domain. Since the notation is a bit cumbersome, the point will be illustrated for the case, but the argument is easily generalised to arbitrary .
When , the dictionary update problem can be expressed as
(49) 
which can be rewritten as the equivalent problem^{5}^{5}5Equivalence when the constraints are satisfied is easily verified by multiplying out the matrixvector product in the data fidelity term in Eq. (50).
(50) 
where the constraint can also be written as
(51) 
The general form of the matrix in Eq. (50) is a blockcirculant matrix constructed from the blocks . Since the multiplication of the dictionary block vector by the blockcirculant matrix is equivalent to convolution in an additional dimension, this equivalent problem represents the “3D” method.
Now, define the unnormalised block DFT matrix operating in this extra dimension as
(52) 
and apply it to the objective function and constraint, giving
(53) 
Since the DFT diagonalises a circulant matrix, this is
(54) 
In this form the problem is an ADMM consensus problem in variables
(55) 
IiiD Fista
The Fast Iterative ShrinkageThresholding Algorithm (FISTA) [30], an accelerated proximal gradient method, has been used for CSC [6, 5, 18], and in a recent online CDL algorithm [17], but has not previously been considered for the dictionary update of a batchmode dictionary learning algorithm.
The FISTA iterations for solving Eq. (28) are
(56)  
(57)  
(58) 
where and is a parameter controlling the gradient descent step size. Parameter can be computed adaptively by using a backtracking step size rule [30], but in the experiments reported here we use a constant for simplicity. We compute the gradient of the data fidelity term in Eq. (56) in the DFT domain
(59) 
as advocated in [5] for the FISTA solution of the CSC problem.
Iv Masked Convolutional Dictionary Learning
When we wish to learn a dictionary from data with missing samples, or have reason to be concerned about the possibility of boundary artifacts resulting from the circular boundary conditions associated with the computation of the convolutions in the DFT domain, it is useful to introduce a variant of Eq. (1) that includes a spatial mask [9], which can be represented by a diagonal matrix
(60) 
As in Sec. II, we separately consider the sparse coding and dictionary updates components of this functional.
Iva Sparse Coding
A masked form of the MMV CBPDN problem Eq. (7) can be expressed as the problem^{6}^{6}6For simplicity, the notation presented here assumes a fixed mask across all columns of and , but the algorithm is easily extended to handle a different for each column .
(61) 
There are two different methods for solving this problem. The one, proposed in [9], exploits the mask decoupling technique [23], involving applying an alternative variable splitting to give the ADMM problem
(62) 
where the constraint can also be written as
(63) 
The corresponding ADMM iterations are
(64)  
(65)  
(66)  
(67)  
(68) 
The functional minimised in Eq. (64) is of the same form as Eq. (13), and can be solved via the same frequency domain method, the solution to Eq. (65) is as in Eq. (16), and the solution to Eq. (66) is given by
(69) 
The other method for solving Eq. (61) involves appending an impulse filter to the dictionary and solving the problem in a way that constrains the coefficient map corresponding to this filter to be zero where the mask is unity, and to be unconstrained where the mask is zero [31, 15]. Both approaches provide very similar performance [15], the major difference being that the former is a bit more complicated to implement, while the latter is restricted to addressing problems where has only zero or one entries. We will use the former mask decoupling approach for the experiments reported here since it does not require any restrictions on the form of .
IvB Dictionary Update
The dictionary update requires solving the problem
(70) 
Algorithms for solving this problem are discussed in the following section.
V Masked Dictionary Update Algorithms
Va BlockConstraint ADMM
Problem Eq. (70) can be solved via the splitting [9]
(71) 
where the constraint can also be written as
(72) 
This problem has the same structure as Eq. (62), the only difference being the replacement of the norm with the indicator function of the constraint set. Hence, the ADMM iterations are largely the same as Eq. (64)(68), except that the norm in Eq. (65) is replaced with the indicator function of the constraint set, and that the step corresponding to Eq. (64) is more computationally expensive to solve, just as Eq. (30) is more expensive than Eq. (13).
VB Extended Consensus Framework
In this section we reuse the variant notation introduced in Sec. IIIB. The masked dictionary update Eq. (70) can be solved via a hybrid of the mask decoupling and ADMM consensus approaches, which can be formulated as
(73) 
where the constraint can also be written as
(74) 
or, expanding the block components of , , and ,
(75) 
The corresponding ADMM iterations are
(76)  
(77)  
(78)  
(79)  
(80) 
Steps Eq. (76), (77), and (79) have the same form, and can be solved in the same way, as steps Eq. (42), (43), and (44) respectively of the ADMM algorithm in Sec. IIIB, and steps Eq. (78) and (80) have the same form, and can be solved in the same way, as the corresponding steps in the ADMM algorithm of Sec. VA.
VC Fista
Problem Eq. (70) can be solved via FISTA as described in Sec. IIID, but the calculation of the gradient term is complicated by the presence of the spatial mask. This difficulty can be handled by transforming back and forth between spatial and frequency domains so that the convolution operations are computed efficiently in the frequency domain, while the masking operation is computed in the spatial domain, i.e.
(81) 
where and represent the DFT and inverse DFT transform operators, respectively.
Vi Results
In this section we compare the computational performance of the various approaches that have been discussed, carefully selecting optimal parameters for each algorithm to ensure a fair comparison.
Via Dictionary Learning Algorithms
Before proceeding to performance results, we summarize the dictionary learning algorithms that will be compared. Instead of using the complete dictionary learning algorithm proposed in each prior work, we consider the primary contribution of these works to be in the dictionary update method, which is incorporated into the CDL algorithm structure that was demonstrated in [21] to be most effective: auxiliary variable coupling with a single iteration for each subproblem^{7}^{7}7In some cases, slightly better time performance can be obtained by performing a few iterations of the sparse coding update followed by a single dictionary update, but we do not consider this complication here. before alternating. Since the sparse coding stages are the same, the algorithm naming is based on the dictionary update algorithms.
The following CDL algorithms are considered for problem Eq. (1) without a spatial mask
The CDL algorithm is as proposed in [5].
The CDL algorithm is as proposed in [5].
The CDL algorithm uses the dictionary update proposed in [22], but the more effective variable coupling and alternation strategy discussed in [21].
The CDL algorithm uses the dictionary update proposed in [22], but the more effective variable coupling and alternation strategy discussed in [21].
The algorithm is the same as Cns, but with a parallel implementation of both the sparse coding and dictionary update stages^{8}^{8}8Šorel and Šroubek [22] observe that the ADMM Consensus problem is inherently parallelizable [16, Ch. 7], but do not actually implement the corresponding CDL algorithm in parallel form to allow the resulting computational gain to be quantified empirically.. All steps of the CSC stage are completely parallelizable in the training image index , as are the and steps of the dictionary update, the only synchronization point being in the step, Eq. (48
), where all the independent dictionary estimates are averaged to update the consensus variable that all the processes share.
The CDL algorithm uses the dictionary update proposed in [22], but the more effective variable coupling and alternation strategy discussed in [21].
Not previously considered for this problem.
The following dictionary learning algorithms are considered for problem Eq. (60) with a spatial mask
Not previously considered for this problem.
The CDL algorithm is as proposed in [15].
The CDL algorithm is based on a new dictionary update constructed as a hybrid of the dictionary update methods proposed in [9] and [22].
Not previously considered for this problem.
In addition to the algorithms listed above, we considered the Stochastic Averaging ADMM (SAADMM) [32], as proposed for CDL in [10]. Our implementation of a CDL algorithm based on this method was found to have promising computational cost per iteration, but its convergence was not competitive with some of the other methods considered here. However, since there are a number of algorithm details that are not provided in [10] (CDL is not the primary topic of that work), it is possible that our implementation omits some critical components. These results are therefore not included here in order to avoid making an unfair comparison.
We do not compare with the dictionary learning algorithm in [7] because the algorithms of both [9] and [22] were both reported to be substantially faster. Similarly, we do not compare with the dictionary learning algorithm in [9] because the alternative approach considered in [15] was found to be faster, and to avoid the very high memory requirements, resulting from the strategy of caching matrix factorizations, that make it difficult to apply this method to training images of even moderate size.
ViB Experiments
We use training sets (one of each size) of 5, 10, 20, and 40 images. These sets are nested in the sense that all images in a set are also present in all the larger sets. The parent set of 40 images consists of greyscale images of size 256 256 pixels, cropped and rescaled from a set of images selected from the MIRFLICKR1M dataset^{9}^{9}9The actual image data contained in this dataset is of very low resolution since the dataset is primarily targeted at image classification tasks. The high resolution images from which those used here were derived were obtained by downloading the original images that were used to derive the MIRFLICKR1M images. [36]. An additional set of 20 images, of the same size and from the same source, is used as a test set to allow comparison of generalization performance, taking into account possible differences in overfitting effects between the different methods. These 8 bit greyscale images are divided by 255 so that pixel values are within the interval [0,1], and are highpass filtered (a common approach for convolutional sparse representations [37, 38, 5][39, Sec. 3]) by subtracting a lowpass component computed by Tikhonov regularization with a gradient term [35, pg. 3], with regularization parameter .
ViC Optimal Penalty Parameters
To ensure a fair comparison between the methods, the optimal penalty parameters for each method and training image set were selected via a grid search, of CDL functional values obtained after 100 iterations, over values for the ADMM dictionary updates, and over values for the FISTA dictionary updates. The grid resolutions were

10 logarithmically spaced points in

15 logarithmically spaced points in

15 logarithmically spaced points in
The best set of or for each method i.e. the ones yielding the lowest value of the CDL functional at 100 iterations, was selected as a center for a finer grid search, of CDL functional values obtained after 200 iterations, with 10 logarithmically spaced points in and 10 logarithmically spaced points in or 10 logarithmically spaced points in . The optimal parameters for each method were taken as those yielding the lowest value of the CDL functional at 200 iterations in this finer grid. This procedure was repeated for sets of 5, 10 and 20 images. As an indication of the sensitivities of the different methods to their parameters, results for the coarse grid search for the 20 image set can be found in Sec. SII in the Supplementary Material.
Since computing grid searches for the case of is very expensive, we extrapolated the optimal parameters found in the smaller sized problems and performed a coarser resolution grid search around those values. We used 3 logarithmically spaced points in and 3 logarithmically spaced points in or 3 logarithmically spaced points in and selected the or combination yielding the lowest value of the CBPDN functional at 200 iterations as the optimal parameters.
The optimal parameters determined via these grid searches are summarized in Table I.
Parameter  Parameter  

Method  Method  
5  3.59  4.08  3.59  5.99  
CG  10  3.59  12.91  MCG  3.59  7.74 
20  2.15  24.48  2.15  7.74  
40  2.56  62.85  2.49  11.96  
5  3.59  4.08  3.59  5.99  
ISM  10  3.59  12.91  MISM  3.59  7.74 
20  2.15  24.48  2.15  7.74  
40  2.56  62.85  2.49  11.96  
5  3.59  7.74  
Tiled  10  3.59  12.91  
20  3.59  40.84  
40  3.59  72.29  
5  3.59  1.29  3.59  1.13  
Cns  10  3.59  1.29  MCns  3.59  0.68 
20  3.59  2.15  3.59  1.13  
40  3.59  1.08  3.59  1.01  
5  3.59  7.74  
3D  10  3.59  12.91  
20  3.59  40.84  
40  3.59  72.29  
5  3.59  48.14  
FISTA  10  3.59  92.95  
20  3.59  207.71  
40  3.59  400.00 
ViD Performance Comparisons
We compare the performance of the methods in learning a dictionary of 64 filters of size 8 8 for sets of 5, 10, 20 and 40 images, setting the sparsity parameter , and using the parameters determined by the grid searches for each method. To avoid complicating the comparisons, we used fixed penalty parameters and without any adaptation methods [5, Sec. III.D], and did not apply relaxation methods [16, Sec. 3.4.3][5, Sec. III.D] in any of the ADMM algorithms. Performance is compared in Figs. 1 – 3 in terms of the convergence rate, with respect to both iterations and computation time, of the CDL functional. The time scaling of all the methods is summarized in Fig. 7(a).
For the case, all the methods have quite similar performance in terms of functional value convergence with respect to iterations. For the larger training set sizes, CG and ISM have somewhat better performance with respect to iterations, but ISM has very poor performance with respect to time. CG has substantially better time scaling, depending on the relative residual tolerance. We ran our experiments for CG with a fixed tolerance of , resulting in computation times that are comparable with those of the other methods. A smaller tolerance leads to better convergence with respect to iterations, but substantially worse time performance.
Both parallel (CnsP) and regular consensus (Cns) have the same evolution of the CBPDN functional, Eq. (5), with respect to iterations, but the former requires much less computation time, and is the fastest method overall. Moreover, parallel consensus yields almost ideal speedups, with some overhead for , but scaling linearly for , and with very competitive computation times. FISTA is also very competitive, achieving good results in less time than any of the other serial methods, and almost matching the time scaling with of parallel consensus (see Fig. 7(a)).
As expected from the relationship established in Sec. IIIC, 3D behaves similarly to ADMM consensus, but has a larger memory footprint. The spatial tiling method (Tiled), on the other hand, tends to have slower convergence with respect to both iterations and time than the other methods. We do not further explore the performance of these methods since they do not provide substantial advantages over the other ones discussed.
All experiments with algorithms that include a spatial mask set the mask to the identity () to allow comparison with the performance of the algorithms without a spatial mask. Plots comparing the evolution of the masked CBPDN functional, Eq. (61), during 1000 iterations and problem sizes of are displayed in Figs. 4 – 6, respectively. The time scaling of all the masked methods is summarized in Fig. 7(b).
It is clear that, despite the additional FFTs required for computing the masked version of the FISTA algorithm, this is a more timeefficient procedure than the mask decoupling ADMM variants. It is also clear that the hybrid mask decoupling/consensus dictionary update (MCns) yields a better convergence with iterations and competitive time scaling. While we have not implemented a parallel version of this method, the overall algorithm structure is sufficiently similar to that of the consensus dictionary update to support the conclusion that it would exhibit similar performance gains from parallelization, which would give it the best time performance of any of the methods considered for the CDL problem with mask decoupling.
In contrast with the corresponding maskfree variants, MCG and MISM have worse performance in terms of both time and iterations. This suggests that MCG requires a value for the relative residual tolerance smaller than to produce good results, but this would be at the expense of much longer computation times.
With the exception of CG, for which the cost of computing the masked version increases for , the computation time for the masked versions is only slightly worse than the maskfree variants (Fig. 7). In general, using the masked versions leads to a marginal decrease in convergence rate with respect to iterations and a small increase in computation time.
ViE Evaluation on the Test Set
To enable a comparison that takes into account any possible differences in overfitting and generalization properties of the dictionaries learned by the different methods, we ran experiments over a 20 image test set that was not used during learning. For all the methods discussed, we saved the dictionaries at 50 iteration intervals (including the final one obtained at 1000 iterations) while training. These dictionaries were used to sparse code the images in the test set and report the evolution of the CBPDN functional for . Results for the dictionaries learned while training with and images are shown in Figs. 8 and 9 respectively, and corresponding results for the algorithms with a spatial mask are shown in Figs. 10 and 11 respectively. Note that the time axis in these plots refers to the run time of the dictionary learning code used to generate the relevant dictionary, and not to the run time of the sparse coding on the test set. Therefore, at any given instant we know which method is giving the best test performance, which would be particularly important when learning a dictionary subject to a run time constraint.
As expected, independent of the method, the dictionaries obtained for training with 40 images exhibit better performance than the ones trained with 20 images. Overall, performance on training is a good predictor of performance in testing, which suggests that the functional value on a sufficiently large training set is a reliable indicator of dictionary quality.
ViF Penalty Parameter Selection
The grid searches performed for determining optimal parameters ensure a fair comparison between the methods, but they are not convenient as a general approach to parameter selection. In this section we show that it is possible to construct heuristics that provide reliable parameter selections for the best performing CDL methods considered here.
ViF1 Parameter Scaling Properties
In Sec. SIII in the Supplementary Material we derive the scaling properties with respect to of the different algorithms discussed here by considering the simplified case in which the training set size is changed by replication of the same training data. For the CDL problem without a spatial mask, these scaling properties are derived for the sparse coding problem and for the different dictionary learning updates: ADMM with equality constraint, ADMM with consensus constraint and FISTA. These results show that the scaling of the penalty parameter for the convolutional sparse coding is , the scaling of the penalty parameter for the dictionary learning update is for the ADMM with equality constraint and for consensus ADMM, and the scaling of the step size for FISTA is . Analytic derivations for Tiled and 3D methods do not lead to a simple scaling relationship, and are not included.
Likewise, for the masked versions we derive the scaling properties of the parameters for the sparse coding, dictionary learning with blockconstraint ADMM and dictionary learning with consensus framework. The scaling of the penalty parameter for the masked version of the convolutional sparse coding is , the scaling of the penalty parameter for the dictionary learning update in the masked consensus framework is , while there is no simple rule of the scaling in the blockconstraint ADMM of Sec. VA.
ViF2 Penalty Parameter Sensitivity
While the analytic derivations of parameter scaling properties represent a useful guide to estimating optimal parameters for a training set given those of a training set of a different size, they do not provide any indication of the stability of these parameters across different training subsets. We therefore provide an empirical evaluation of the sensitivity of the penalty parameters across different training image subsets, for the most effective method in each representative class in the dictionary learning update: conjugate gradient (for ADMM with equality constraint), consensus (for ADMM with consensus constraint) and FISTA (for iterative gradient descent). For masked versions, we include an empirical evaluation for the hybrid consensus method, which, besides MFISTA, is the only one with a simple parameter scaling behaviour. We do not compute a separate empirical sensitivity analysis for the masked version of FISTA because it has the same functional evolution as the maskfree version for the identity mask .
For each training set size we randomly selected 20 different image subsets from the 40 image training set. We performed grid searches, of CDL functional values obtained after 100 iterations, over values for the ADMM dictionary updates, and over values at 200 iterations for the FISTA dictionary updates. We used the following resolutions: for CG, 10 logarithmically spaced points in for and ; for Cns, 10 logarithmically spaced points in for and 10 logarithmically spaced points in for ; and for FISTA, 10 logarithmically spaced points in for , 10 logarithmically spaced points in for with and 10 logarithmically spaced points in for with
. The ranges of these were set in agreement with the results obtained in the first grid searches. We normalized the results for each subset by dividing by the minimum of the functional for that subset, and computed statistics over these normalized values. These statistics are reported as box plots (median, quartiles and spread) for all the representative methods in Sec.
SIV in the Supplementary Material.Parameter  Method  Rule 

All  
CG, ISM  
Cns, MCns  
FISTA 
Fig. 12 displays a summary of the scaling of the penalty parameters. For each of the problem sizes , we computed the median of the normalized CDL values for each point in the search grids of the random subsets, and we plotted the penalty parameters corresponding to the minima of the median. We also determined the part of the grid that is in the 2% of the minima of the normalized CDL values, and plotted it as spread bars per . The figure includes plots for and for CG, consensus and consensus with spatial mask; and and for FISTA. Results for ISM are the same as for CG and are not shown. For the methods that lead to a simple scaling relationship (for which analytic derivations are provided in the Supplementary Material), we combined the theoretical expected scaling with the results of the experimental sensitivity analysis, to construct guidelines for setting the parameters. These guidelines are plotted in black, and are also summarized in Table II.
Note that , the inverse of the gradient step size in FISTA, has to be greater than or equal to the Lipschitz constant of the gradient of the functional to guarantee convergence of the algorithm, but this constant is not always computable [30]. Empirically, we find that smaller values of
yield a worse final value of the CBPDN functional, and therefore, very skewed sensitivity plots, with smooth changes of the functional above the optimal value, but large jumps below it, and medians closer to the minima of the 2% interval. A similar but less marked behavior can be seen in
, the penalty parameter for the CSC problem, with slower degradation for values above the optimal value, more rapid degradation for values below the optimum, and medians closer to the minima of the 2% interval.The parameter selection guidelines presented in this section should only be expected to be reliable for training data with similar characteristics to those used in our experiments, i.e. natural images scaled to pixel values in the range and preprocessed via the same or similar lowpass filtering. Nevertheless, since the scaling properties derived in Sec. SIII of the Supplementary Material remain valid, it is reasonable to expect that similar heuristics, albeit with different parameters (e.g. constants in the case of scaling and slopes in the case of scaling), would hold for different image content or different choices of lowpass filtering.