and computer vision[3, 4]. In sparse coding, each data sample is represented as a weighted combination of a few atoms from an over-complete dictionary learned from the data. Despite its popularity, sparse coding cannot capture shifted local patterns that are common in image samples. Often, it has to first extract overlapping image patches, which is analogous to manually convolving the dictionary with the samples. As each sample element (e.g., an image pixel) is contained in multiple overlapping patches, the separately learned representations may not be consistent. Moreover, the resultant representation is highly redundant .
Convolutional sparse coding (CSC) addresses this problem by learning a shift-invariant dictionary composed of many filters. Local patterns at translated positions of the samples are easily extracted by convolution, and eliminates the need for generating overlapping patches. Each sample is approximated by the sum of a set of filters convolved with the corresponding codes. The learned representations are consistent as they are obtained together. CSC has been used successfully in various image processing applications such as super-resolution image reconstruction, high dynamic range imaging , image denoising and inpainting . It is also popular in biomedical applications, e.g., cell identification , calcium image analysis , tissue histology classification  and segmentation of curvilinear structures . CSC has also been used in audio processing applications such as piano music transcription .
A number of approaches have been proposed to solve the optimization problem in CSC. In the pioneering deconvolutional network (DeconvNet) , simple gradient descent is used. As convolution is slow in the spatial domain, fast convolutional sparse coding (FCSC)  formulates CSC in the frequency domain, and the alternating direction method of multipliers (ADMM)  is used to solve the resultant optimization problem. Its most expensive operation is the inversion of a convolution-related linear operator. To alleviate this problem, convolutional basis pursuit denoising (CBPDN)  exploits a special structure of the dictionary, while the global consensus ADMM (CONSENSUS)  utilizes the matrix inverse lemma to simplify computations. Fast and flexible convolutional sparse coding (FFCSC)  further introduces mask matrices so as to handle incomplete samples that are common in image/video inpainting and demosaicking applications. Note that all these algorithms operate in the batch mode (i.e., all the samples/codes have to be accessed in each iteration). Hence they can become expensive, in terms of both space and time, on large data sets.
In general, online learning has been commonly used to improve the scalability of machine learning algorithms[18, 19]. While batch learning algorithms train the model after arrival of the whole data set, online learning algorithms observe the samples sequentially and update the model incrementally. Moreover, data samples need not be stored after being processed. This can significantly reduce the algorithm’s time and space complexities. In the context of sparse coding, an efficient online algorithm is proposed in . In each iteration, information necessary for dictionary update is summarized in fixed-sized history matrices. The space complexity of the algorithm is thus independent of sample size. Recently, this has also been extended for large-scale matrix factorization .
However, though CSC is similar to sparse coding, the online sparse coding algorithm in  cannot be directly used. This is because convolution in CSC needs to be performed in the frequency domain for efficiency. Moreover, the sizes of history matrices depend on dimensionality of the sparse codes, which becomes much larger in CSC than in sparse coding. Storing the resultant history matrices can be computationally infeasible.
In this paper, we propose a scalable online CSC algorithm for large data sets. The algorithm, which will be called Online Convolutional Sparse Coding (OCSC), is inspired by the online sparse coding algorithm of . It avoids the above-mentioned problems by reformulating the CSC objective so that convolution can be handled easily in the frequency domain and much smaller history matrices are needed. We use ADMM to solve the resultant optimization problem. It will be shown that the ADMM subproblems have efficient closed-form solutions. Consequently, to process a given number of samples, OCSC has the same time complexity as state-of-the-art batch CSC methods but requires much less space. Empirically, as OCSC updates the dictionary after coding each sample, it converges much faster than batch CSC methods. Theoretical analysis shows that the learned dictionary converges to a stationary point of the optimization problem. Extensive experiments show that convergence of the proposed method is much faster and its reconstruction performance is also better. Moreover, while existing CSC algorithms can only run on a small number of images, the proposed method can at least handle ten times more images.
The rest of the paper is organized as follows. Section II briefly reviews online sparse coding, the ADMM, and batch CSC methods. Section III describes the proposed online convolutional sparse coding algorithm. Experimental results are presented in Section IV, and the last section gives some concluding remarks.
: For vector, its th element is denoted , its norm is , its norm is , and reshapes to a diagonal matrix with elements ’s. Given another vector , the convolution is a vector , with . For matrix with elements ’s, stacks the columns of to a vector. Given another matrix , the Hadamard product is
. The identity matrix is denoted, and the conjugate transpose is denoted .
The Fourier transform that maps from the spatial domain to the frequency domain is denoted, and is the inverse Fourier transform. For a variable in the spatial domain, its corresponding variable in the frequency domain is denoted .
Ii Related Works
Ii-a Online Sparse Coding
Given samples , where each , sparse coding learns an over-complete dictionary of atoms and sparse codes . It can be formulated as the following optimization problem:
where and . Many efficient algorithms have been developed for solving (1). Examples include K-SVD  and active set method . However, they require storing all the samples, which can become infeasible when is large.
To solve this problem, an online learning algorithm for sparse coding that processes samples one at a time is proposed in . After observing the th sample , the sparse code is obtained as
where is the dictionary obtained at the th iteration. After obtaining , is updated as
Each column in (4) can be obtained by coordinate descent. and can also be updated incrementally as
Using and , one does not need to store all the samples and codes to update . The whole algorithm is shown in Algorithm 1.
The following assumptions are made in .
Samples are generated i.i.d. from some distribution with bounded.
The code is unique w.r.t. data .
The objective in (4) is strictly convex with lower-bounded Hessians.
Ii-B Alternating Direction Method of Multipliers (ADMM)
ADMM  has been popularly used for solving optimization problems of the form
where are convex functions, and (resp. ) are constant matrices (resp. vector). It first constructs the augmented Lagrangian of problem (8)
where is the dual variable, and is a penalty parameter.
At the th iteration, the values of and (denoted as and ) are updated by minimizing (9) w.r.t. and in an alternating manner. Define the scaled dual variable . The ADMM updates can be written as
The above procedure converges to the optimal solution at a rate of , where is the number of iterations.
Ii-C Convolutional Sparse Coding
Convolutional sparse coding (CSC) learns a dictionary composed of filters, each of length , that can capture the same local pattern at different translated positions of the samples. This is achieved by replacing the multiplication between dictionary and code by convolution. While each in sparse coding is represented by a single code , each in CSC is represented by codes stored together in the matrix .
The dictionary and codes are obtained by solving the following optimization problem:
where denotes convolution in the spatial domain.
Convolution can be accelerated in the frequency domain via the convolution theorem : , where
is first zero-padded to-dimensional. Hence, recent CSC methods [5, 8, 16, 17] choose to operate in the frequency domain. Let , and . (12) is reformulated as
where the factor in the objective comes from the Parseval’s theorem , and is the linear operation that removes the extra dimensions in .
Given , can be obtained as
After obtaining and , the sparse codes can be recovered as for , and the dictionary filters as .
The above algorithms all need in space. They differ mainly in how to compute the linear system involved with in the ADMM subproblems. FCSC  directly solves the subproblem, which takes time. CBPDN  exploits a special structure in the dictionary and reduces the time complexity to , which is efficient for small . The CONSENSUS algorithm  utilizes the matrix inverse lemma to reduce the time complexity to . The state-of-the-art is FFCSC , which incorporates various linear algebra techniques (such as Cholesky factorization  and cached factorization ) to reduce the time complexity to .
Iii Online Convolutional Sparse Coding
Existing CSC algorithms operate in the batch mode, and need to store all the samples and codes which cost space. This becomes infeasible when the data set is large. In this section, we will scale up CSC by using online learning.111After the initial arXiv posting of our paper , we became aware of some very recent independent works that also consider CSC in the online setting [27, 28]. These will be discussed in Section III-E.
After observing the th sample , online CSC considers the following optimization problem which is analogous to (12):
To solve problem (15), some naive approaches are first considered in Section III-A. The proposed online convolutional sparse coding algorithm is then presented in Section III-B. It takes the same time complexity for one data pass as state-of-the-art batch CSC algorithms, but has a much lower space complexity (Section III-C). The convergence properties of the proposed algorithm is discussed in Section III-D.
Iii-a Naive Approaches
As in batch CSC, problem (15) can be solved by alternating minimization w.r.t. the codes and dictionary (as in Section II-C). Given the dictionary, the codes are updated as in (13). Given the codes, the dictionary is updated by solving the following optimization subproblem analogous to (14):
Alternatively, the objective in (16) can be rewritten as
and . This is of the form in (3). Hence, we may attempt to reuse the online sparse coding in Algorithm 1, and thus avoid storing all the samples and codes. However, recall that online learning the dictionary is possible because one can summarize the observed samples into the history matrices in (5), (6). For (17), the history matrices become
In typical CSC applications, the number of image pixels is at least in the tens of thousands, and may only be in the thousands. Hence, the space required for storing and is even higher than the space required for batch methods.
Iii-B Proposed Algorithm
Note that in (18) is composed of a number of diagonal matrices. By utilizing this special structure, the following Proposition rewrites the objective in (16) so that much smaller history matrices can be used.
The objective in (16) is equivalent to the following apart from a constant:
The dictionary and codes can then be efficiently updated in an alternating manner as follows.
Iii-B1 Updating the Dictionary
With the codes fixed, using Proposition 2, the dictionary can be updated by solving the following optimization problem:
This can solved using ADMM. At the th ADMM iteration, let be the ADMM dual variable. The following shows the update equations for and .
Updating : From (20), can be updated by solving the following subproblem:
where . Note that . Hence, . The rows of can then be obtained separately as
where . With , we do not need to store .
Computing the matrix inverse takes time. This can be simplified by noting from (21) that is the sum of rank-1 matrices and a (scaled) identity matrix. Using the Sherman-Morrison formula222Given an invertible square matrix and vectors , . , we have
This takes , instead of , time.
Updating : From (10), each column can be updated as
It has the following closed-form solution.
Proposition 3 ().
, where .
Finally, the dual variables is updated as in (11). The whole dictionary update procedure (DictOCSC) is shown in Algorithm 2. As (23) is convex, convergence to the globally optimal solution is guaranteed .
In batch CSC methods, its dictionary update in (14) is also based on ADMM (Section II-C2). However, our dictionary update step first reformulates the objective as in (20). This enables each ADMM subproblem to be solved with a much lower space complexity ( vs for the state-of-the-art ) but still with the same iteration time complexity (i.e., ).
Iii-B2 Updating the Code
Given the dictionary, as the codes for different samples are independent, they can be updated one by one as in batch CSC methods (Section II-C1).
The whole algorithm, which will be called Online Convolutional Sparse Coding (OCSC), is shown in Algorithm 3.
Iii-C Complexity Analysis
In Algorithm 3, the space requirement is dominated by , which takes space. For one data pass which precesses samples, updating and takes time, dictionary update takes time, code update takes time, and FFT/inverse FFT takes time. Hence, one data pass takes a total of time.
A comparison with the existing batch CSC algorithms is shown in Table I. As can be seen, the proposed algorithm takes the same time complexity for one data pass as the state-of-the-art FFCSC algorithm, but has a much lower space complexity ( instead of ).
|batch/online||convolution operation||space||time for one data pass|
where is a linear operator which maps a vector to its associated Toeplitz matrix. Specifically,
The number of columns in is equal to the dimension of (i.e., ).
Let , be the inverse operator of which maps back to , and . Problem (15) can be rewritten as
Thus, the objective (27) is of the same form as that in (3). However, a direct use of Algorithm 1 is not feasible. First, the feasible region in (28) is more complex, and coordinate descent cannot be used as there is no simple projection to . Second, as is of length , the corresponding history matrices (analogous to those in (5)) require space.
Though a direct application of Algorithm 1 is not practical, Theorem 1 still holds. Indeed, Theorem 1 can be further extended by relaxing its feasible region on . As discussed in , can be, for example, . It is mentioned in  that has to be a union of independent constraints on each column of . However, this only serves to facilitate the use of coordinate descent (step 6 in Algorithm 1), but is not required in the proof. In general, Theorem 1 holds when is bounded, convex, and .
in (28) is bounded, convex, and a subset of .
Here, we discuss the very recent works of [27, 28] which also consider online learning of the dictionary in CSC. Extra experiments are performed in Section B, which shows our method is much faster than them.
In the online convolutional dictionary learning (OCDL) algorithm , convolution is performed in the spatial domain. They started with the observation that convolution is commutative. Hence, for the summation in (15), we have
where with and . (15) can then be rewritten as
where (as each is repeated times in the Toeplitz matrix ). Using (30), the history matrices are constructed as
Recall that is the length of filter when CSC is solved in the spatial domain. The space complexity of  is dominated by (which takes space) or (which takes space), depending on the relative sizes of and . Though this is comparable to our space requirement, its time complexity is much larger. For one data pass, convolution in the spatial domain takes time and updating the history matrices above takes time. The dictionary update in total takes time. In contrast, the proposed algorithm takes time. In the experiments,