1 Introduction
Convolutional constraints arise in response to many natural phenomena. They provide a means of expressing an intuition that the relationships between variables should be both spatially or temporally localized, and redundant across the support of the signal.
Coupled with sparsity, one can also enforce selectivity, i.e. that the patterns observed in the signal stem from some structured underlying basis whose phase alignment is arbitrary.
In this paper, we discuss a number of optimization methods designed to solve problems involving convolutional constraints and sparse regularization in the form of the convolutional sparse coding algorithm,^{1}^{1}1Convolutional Sparse Coding has also been coined Shift Invariant Sparse Coding (SISC), however the authors believe the term “convolutional” is more representative of the algorithm’s properties.
and show through examples that the algorithm is well suited to a wide variety of problems that arise in computer vision.
Consider the signal in Fig. 1 from (Lewicki and Sejnowski, 1999). The signal is composed of two distinct modes, appearing at multiple intervals (you could consider the modes to be expressions that are repeated over the course of a conversation, features that appear multiple times within an image, or particular motifs in a musical passage or speech). We wish to discover these modes and their occurrences in the signal in an unsupervised manner.
If the signal is segmented into blocks, the latent structure becomes obfuscated, and the basis learned must have the capacity to reconstruct each block in isolation. In effect, we learn a basis that is higher rank than the true basis due to the artificial constraints placed on the temporal alignment of the bases.
Convolutional sparse coding makes no such assumptions, allowing shiftable basis functions to discover a lower rank structure, and should therefore be preferred in situations where the basis alignment is not known a priori. As we aim to demonstrate in this paper, this arises in a large number of practical applications.
The notion of translation invariant optimization stems from the thought experiments of Simoncelli et al. (Simoncelli et al., 1992). His motivation for considering translation invariance came from the context of wavelet transforms for signal processing. Amongst others, he had observed that blockbased wavelet algorithms were sensitive to translation and scaling of the input signal.
As an example, he chose an input signal to be one of the wavelet basis functions (yielding a single reconstruction coefficient), then perturbed that signal slightly to produce a completely dense set of coefficients. The abrupt change in representation in the wavelet domain due to a small change in the input illustrated the wavelet transform’s unsuitability for higherlevel summarization.
Olshausen and Field showed that sparsity alone is a sufficient driver for learning structured overcomplete representations from signals, and used it to learn a basis for natural image patches (Olshausen and Field, 1996). The resulting basis, featuring edges at different scales and orientations, was similar to the receptive fields observed in the primary visual cortex.
The strategy of sampling patches from natural images has come under fire however, since many of the learned basis elements are simple translations of each other  an artefact of having to reconstruct individual patches, rather than entire image scenes (Kavukcuoglu et al., 2010). Removing the artificial assumption that image patches are independent  by modelling interactions in a convolutional objective  results in more expressive basis elements that better explain the underlying mechanics of the signal.
Lewicki and Sejnowski (Lewicki and Sejnowski, 1999) made the first steps towards this realization, by finding a set of sparse coefficients (value and temporal position) that reconstructed the signal with a fixed basis. They remarked at the spikelike responses observed, and the small number of coefficients needed to achieve satisfactory reconstructions.
Introducing sparsity brought with it a set of computational challenges that made the resulting objectives difficult to optimize. (Olshausen and Field, 1996)
explored coefficients drawn from a Cauchy distribution (a smooth heavytailed distribution) and a Laplacian distribution (a nonsmooth heavytailed distribution), citing that in both cases
they favour among activity states with equal variance, those with fewest nonzero coefficients
. (Olshausen and Field, 1996) inferred the coefficients as the equilibrium solution to a differential equation. (Lewicki and Sejnowski, 1999) assumed Laplacian distributed coefficients, and noted that due to the high sparsity of the desired response, it would be sufficient to replace exact inference with a procedure for guessing the values and temporal locations of the nonzero coefficients, then refining the results through a modified conjugate gradient local search.Tibshirani (Tibshirani, 1996)
introduced a convex form of the sparse inference problem  estimating Laplacian distributed coefficients which minimize a leastsquares reconstruction error  using
norm regularization and presented a method for solving it with existing quadratic programming techniques.The full convolutional sparse coding algorithm culminated in the work of (Grosse et al., 2007). Fundamentally, Grosse extended Olshausen and Field’s sparse coding algorithm to include convolutional constraints and generalized Lewicki and Sejnowski’s convolutional sparse inference to 2D. Algorithmically, Grosse drew on the work of Tibshirani (Tibshirani, 1996) in expressing Laplacian distributed coefficients as norm regularization, and used the feature sign search minimization algorithm proposed by his colleague in the same year (Lee et al., 2007) to solve it efficiently. The form he introduced is the now canonical bilinear convolutional sparse coding algorithm.
Convolutional sparse coding has found application in learning Gaborlike bases that reflect the receptive fields of the primary visual cortex (Olshausen and Field, 1997), elemental motifs of visual (Zeiler et al., 2010), speech (Grosse et al., 2007; Lewicki and Sejnowski, 1999) and musical (Grosse et al., 2007; Mø rup et al., 2008) perception, a basis for human motion and articulation (Zhu and Lucey, 2014), midlevel discriminative patches (Sohn and Jung, 2011)
and unsupervised learning of hierarchical generative models
(Lee et al., 2009; Zeiler et al., 2010).The largescale nature of the latter applications have placed great demands on the computational efficiency of the underlying algorithms. Coupled with the steady advances in machine learning and computing, this has given rise to a range of optimization approaches for convolutional sparse coding. (Chalasani and Principe, 2012) introduced a convolutional extension to the FISTA algorithm for sparse inference (Beck and Teboulle, 2009), and (Bristow et al., 2013) introduced a Fourier method based on the closelyrelated ADMM (Boyd, 2010). Convolutional basis learning has largely been relegated to gradient descent, however (Grosse et al., 2007) introduced a Fourier domain adaptation that minimized the inequality constrained problem via the Lagrange dual.
In this paper, we make the following contributions:

We provide a brief discussion detailing the works that influenced the development of the convolutional sparse coding algorithm and how it stems from natural processes.

We argue that convolutional sparse coding makes a set of assumptions that are more appropriate in many tasks where block or sampled sparse coding is currently used.

We discuss practical optimization of convolutional sparse coding, including speed, memory usage and assumptions on boundary conditions.

We show through a number of examples the applicability of convolutional sparse coding to a wide range of problems that arise in computer vision, and when particular problems benefit from different optimization methods.
It is worth emphasising that the algorithms presented in this work are not new. Detailed treatments of the algorithms can be found in their respective works (Bristow et al., 2013; Chalasani and Principe, 2012; Grosse et al., 2007). Rather, this paper aims to present the available algorithms in a unified framework and act as a resource for practitioners wishing to understand their properties.
1.1 Outline
We begin in §2 with an introduction of the convolutional sparse coding formulation. We discuss optimization methods for convolutional sparse inference first in §3, as they are of most algorithmic interest, followed by methods for basis/filter updates in §4. In §6 we present a number of applications of convolutional sparse coding, each with a discussion of the preferred optimization method and implication of boundary effects. Proofs and implementation details are deferred until the appendices.
1.2 Notation
Matrices are written in bold uppercase (e.g.
), vectors in bold lowercase (
e.g. ) and scalars in regular typeface (e.g. ). Convolution is represented by the operator, and correlation by theoperator. We treat all signals as vectors  higherdimensional spatial or spatiotemporal signals are implicitly vectorized  however operations on these signals are performed in their original domain. We adopt this notation because (i) the domain of the signal is largely irrelevant in our analysis and the algorithms we present, and (ii) for any linear transform in the original domain, there exists an equivalent transform in the vectorized domain.
A
applied applied to any vector denotes the Discrete Fourier Transform (DFT) of a vectorized signal
such that , where is the Fourier transform operator and is the matrix of complex Fourier bases.2 Convolutional Sparse Coding
The convolutional sparse coding problem consists of minimizing a convolutional modelfitting term and a sparse regularizer ,
(1) 
where is the convolutional kernel, are the set of sparse coefficients, and controls the tradeoff between reconstruction error and sparsity of representation. The input can be reconstructed via the convolution,
(2) 
Assuming Gaussian distributed noise and Laplacian distributed coefficients, Eqn.
1 can be written more formally as,(3) 
The remainder of our analysis is based around efficient methods of optimizing this objective. The objective naturally extends to multiple images and filters,
(4) 
however this form quickly becomes unwieldy, so we only use it when the number of filters or images needs to be emphasized.
Contrast the objective of Eqn. 2 with that of conventional sparse coding,
(5) 
Here, we are solving for a set of basis vectors and sparse coefficients in alternation that reconstruct patches or samples of the signal in isolation. This is an important distinction to make, and forms the fundamental difference between convolutional and patchbased sparse coding.
In the limit when contains every patch from the full image, the two sparse coding algorithms behave equivalently, however the patchbased algorithm must store a redundant amount of data, and cannot take advantage of fast methods for evaluating the inner product of the basis with each patch (i.e. convolution).
The objective of Eqn. 2 is bilinear – solving for each variable whilst holding the other fixed yields a convex subproblem, however the objective is not jointly convex in both. We optimize the objective in alternation, iterating until convergence. There are no guarantees that the final minima reached is the global minima, however in practice multiple trials reach minima of comparable quality, even if the exact bases and distribution of coefficients learned are slightly different.
In the sections that follow, we introduce a range of algorithms that can solve for the filters and coefficients. Since the alternation strategy treats each independently, we largely consider the algorithms in isolation, however some care must be taken in matching appropriate boundary condition assumptions.
2.1 Other Formulations
It should be noted that the algorithm of Eqn. 2 is not the only conceivable formulation of convolutional sparse coding. In particular, there has been growing interest in nonconvex sparse coding with hyperLaplacian, and other exotic priors (Zhang, 2010; Wipf et al., 2011). Whilst these methods have not yet been extended to convolutional sparse coding, there is no fundamental barrier to doing so.
3 Solving for Coefficients
It is natural to begin with methods for convolutional sparse inference, since this is where most of the research effort has been focussed. Solving for the coefficients involves optimizing the unconstrained objective,
(6) 
where controls the tradeoff between sparsity and reconstruction error. This objective is difficult to solve because (i) it involves a nonsmooth regularizer, (ii) the leastsquares system cannot be solved directly (forming explicit convolutional matrices for large inputs is infeasible), and (iii) the system involves a large number of variables, especially if working with megapixel imagery, etc.
In the case of multiple images and filters,
(7) 
and given that the minima of the sum of convex functions is the sum of their minima,
(8) 
the coefficients for each image can be inferred separately,
(9) 
The two methods that we introduce are both based around partitioning the objective into the smooth modelfitting term and nonsmooth regularizer that can then be handled separately.
3.1 ADMM Partitioning
The alternating direction method of minimizers (ADMM), was proposed jointly by (Glowinski and Marroco, 1975; Gabay and Mercier, 1976), though the idea can be traced back as early as DouglasRachford splitting in the mid1950s. (Boyd, 2010) presents a thorough overview of ADMMs and their properties. ADMMs solve problems of the form,
(10) 
To express convolutional sparse inference in this form, we perform the (somewhat unintuitive) substitution,
(11)  
(12) 
to obtain,
(13) 
By introducing a proxy
, the loss function can be treated as a sum of functions of two independent variables, and with the addition of equality constraints, the minima of the new constrained objective is the same as the original unconstrained objective.
Whilst the individual functions may be easier to optimize, there is added complexity in enforcing the equality constraints. Taking the Lagrangian of the augmented objective,
(14) 
the solution is to minimize the primal variables, and maximize the dual variables,
(15) 
Optimizing in alternation (which accounts for the term alternating direction) yields a strategy for updating the variables involving a function of a single variable plus a proximal term binding all variables,
(16)  
(17)  
(18) 
The intuition behind this strategy is to find a set of model parameters which don’t deviate too far from the regularization parameters, then visa versa, to find a set of regularization parameters which are close to the model parameters. The Lagrange variables impose a linear descent direction that force the two primal variables to equality over time.
For this strategy to be effective, the sum of the function or and an isotropic leastsquares must be easy to solve.
Substituting the convolutional sparse inference objective into Eqn. 16,
(19)  
(20) 
The update takes the form of generalized Tikhonov regularization. In the update, the regularizer can now be treated independently of the modelfitting term, and importantly the added proximal term is an isotropic Gaussian, so each pixel of can be updated independently,
(21) 
the solution to which is the soft thresholding operator,
(22) 
In the ADMM, the minimizer is exactly sparse, whilst the minimizer is only close to sparse. If exact sparsity is a concern in the problem domain, should be retained at the point of convergence.
3.2 FISTA/Proximal Gradient
Proximal gradient methods are a close parallel to ADMMs. They generalize the problem of projecting a point onto a convex set, and often admit closedform solutions. (Parikh and Boyd, 2013) present a thorough review of proximal algorithms and their relation to ADMMs. The convolutional Lasso problem introduces the splitting,
(23) 
with gradient and proximal operator,
(24) 
where is the soft thresholding operator of Eqn. 22. In this case, the proximal algorithm is finding the closest minimizer to the convolutional least squares model fitting term that projects onto the ball (with radius proportional to ).
The FISTA algorithm of (Beck and Teboulle, 2009) presents an efficient update strategy for solving this problem by incorporating an optimal firstorder method in the gradient computation (discussed in §4.2).
FISTA and ADMMs both have similar computational complexity, each requiring updates to a leastsquares problem and evaluation of a soft thresholding operator. One disadvantage of FISTA is the requirement of gradientbased updates to the functional term. ADMMs, on the other hand, make a more general set of assumptions, requiring only that the objective value in each iteration is reduced. In cases where solving the objective is similar in complexity to evaluating the gradient, ADMMs may converge faster.
3.3 Iterative Optimziation
The classical approach to convolution is to assume Dirichlet boundary conditions, i.e. that values outside the domain of consideration are zero. In such a case , and . Another approach is to take only the convolution between the fully overlapping portions of the signals – ‘valid’ convolution – which results in .
In both approaches, the signals must be convolved in the spatial domain, which is an operation.
One way to alleviate the computational cost is to assume periodic extension of the signal, where convolution is then diagonalized by the Fourier transform,
(25) 
where and (see §4.5
for the correct method of padding
to size). This method of convolution has cost .Both ADMMs and FISTA can take advantage of iterative methods, by taking accelerated (proximal) gradient steps. In the next section we show how solving the entire system of equations in the Fourier domain is only slightly more complex than performing a single gradient step, and can lead to faster overall convergence of the ADMM. An important distinction between FISTA and ADMMs is that FISTA is constrained to use gradient updates – it cannot take advantage of direct solvers.
3.4 Direct Optimization
Given Dirichlet boundary assumptions, a filter kernel can be represented (in 2D) as a blockToeplitzToeplitz matrix such that of Eqn. 23 becomes,
(26) 
where,
(27) 
which is the canonical leastsquares problem. There are a plethora of direct solvers for this problem, however unlike Toeplitz matrices, there are no known fast () methods for inverting blockToeplitzToeplitz matrices, so general purpose solvers must be used. This requires constructing the full . For a input, and filters each of support , the matrix will have nonzero values.
One way to alleviate the computational cost is to assume periodic extension of the signal, where convolution is then diagonalized by the Fourier transform.
This provides an efficient strategy for inverting the system,
(28) 
where,
(29) 
The equivalence between Eqn. 23 and Eqn. 28 relies upon Parseval’s theorem,
(30) 
where is a constant scaling factor between the domains. Since the norm is rotation invariant, the minimizer of Eqn. 28 is the Fourier transform of the minimize of Eqn. 23.
The system of Eqn. 19 can be solved directly in an efficient manner in the Fourier domain by observing that each frequency band in a single image is related only by the same frequency band in each of the filters and coefficients, e.g.,
(31) 
Finding the optima of across all channels for frequency band now involves
(32) 
Note that is a rank1 matrix. Thus from the matrix inversion lemma (Appendix B),
(33) 
where is a scalar, since is a scalar. The block of can thus be solved by,
(34) 
which is just a series of multiplications, so a solution can be found in time – no inversion is required.
The assumption of periodic extension is sometimes not indicative of the structure observed in the signal of interest, and can cause artefacts along the boundaries. In such a case, a more sensible assumption is to assume Neumann boundary conditions, or symmetric reflection across the boundary. This assumption tends to minimize boundary distortion for small displacements. The resulting matrix can be diagonalised by the discrete cosine transform (DCT). There exists a generalization of Parseval’s theorem that extends to the DCT, however some care must be taken with understanding the nuances between the 16 types of DCT. (Martucci, 1993) provides a comprehensive exposé on the topic.
4 Solving for Filters
Solving for the filters involves optimizing,
(35) 
This is a classical least squares objective with norm inequality constraints. Norm constraints on the bases are necessary since there always exists a linear transformation of and which keeps unchanged whilst making approach zero. Inequality constraints are sufficient since deflation of will always cause to lie on the constraint boundary (whilst forming a convex set).
Eqn. 4 is a quadratically constrained quadratic program (QCQP), which is difficult to optimize in general. Each of the following methods relax this form in one way or another to make it more tractable to solve.
The convolutional form of the least squares term also poses some challenges for optimization, since (i) the filter has smaller support than the image it is being convolved with, and (ii) forming an explicit multiplication between the filters and a convolutional matrix form of the images is prohibitively expensive (as per §3.4).
In the case of multiple images and filters,
(36) 
one can see that all filters are jointly involved in reconstructing the inputs, and so must be updated jointly.
4.1 Gradient Descent
The gradient of the objective is given by,
(37) 
This involves collecting the correlation statistics across the coefficient maps and images. Since the filters are of smaller support than the coefficient maps and images, we collect only “valid” statistics, or regions that don’t incur boundary effects. In the autocorrelation of , one of the arguments must be zero padded to the appropriate size.
Given the gradient direction, updating the the filters involves,
(38) 
Computing the step size, , can be done either via line search or by solving a 1D optimization problem which minimizes the reconstruction error in the gradient direction,
(39) 
the closedform solution to which is,
(40) 
and involves evaluating only convolutions ( if the term has been previously computed as part of a stopping criteria, etc.). In the case of a large number of inputs
, stochastic gradient descent is typically used
(Mairal et al., 2009).Typically after each gradient step, if the new iterate exists outside the unit ball, the result is projected back onto the ball. This is not strictly the correct way to enforce the normconstraints, however it works without sideeffects in practice. The ADMM (§4.3) and Lagrange dual (§4.4) methods on the other hand, both solve for the norm constraints exactly.
4.2 Nesterov’s Accelerated Gradient Descent
Prolific mathematician Yurii Nesterov introduced an optimal^{2}^{2}2Optimal in the sense that it has a worstcase convergence rate that cannot be improved whilst remaining firstorder. firstorder method for solving smooth convex functions (Nesterov, 1983). Convolutional least squares objectives require a straightforward application of accelerated proximal gradient (APG).
Without laboring on the details introduced by Nesterov, the method involves iteratively placing an isotropic quadratic tangent to the current gradient direction, then shifting the iterate to the minima of the quadratic. The curvature of the quadratic is computed by estimating the Lipschitz smoothness of the objective.
Checking for Lipschitz smoothness feasibility using backtracking requires two projections per iteration. This can be costly since each projection involves multiple convolutions. In practice we find this method no faster than regular gradient descent with optimal stepsize calculation.
4.3 ADMM Partitioning
As per §3.1, treating convolution in the Fourier domain can lead to efficient direct optimization. Unlike solving for the coefficients, however, the learned filters must be constrained to be small support, and there is no way to do this explicitly via Fourier convolution.
An approach to handling this is via ADMMs again,
(41) 
where,
(42) 
and is a submatrix of the Fourier matrix that corresponds to a small spatial support transform.
Intuitively, we are trying to learn a set of filters that minimize reconstruction error in the Fourier domain and are small support in the spatial domain.
Taking the augmented Lagrangian of the objective and optimizing over and in alternation yields the update strategy,
(43)  
(44)  
(45) 
In a similar fashion to Eqn. 32, each frequency component in jointly across all filters can be solved for independently via a variable reordering to produce dense systems of equations.
Solving for appears more involved, however the unconstrained loss function can be minimized in closedform,
(46) 
since is an orthonormal matrix, and thus . Further, the matrix multiplication can instead be replaced by the inverse Fourier transform, followed by a selection operator which keeps only the small support region,
(47) 
Handling the inequality constraints is now trivial. Since is orthonormal (implying an isotropic regression problem), projecting the optimal solution to the unconstrained problem onto the ball,
(48) 
is equivalent to solving the constrained problem.
4.4 Lagrange Dual
Rather than using gradient descent with iterative projection, Eqn. 4 can be solved by taking the Lagrange dual in the Fourier domain,
(49) 
where are the dual variables, and . Finding the optimum involves minimizing the primal variables, and maximizing the dual variables,
(50) 
The closedform solution to the primal variables is,
(51) 
Substituting this expression for into the Lagrangian, we analytically derive the dual optimization problem.
4.5 Convolving with Small Support Filters in the Fourier Domain
In order to convolve two signals in the Fourier domain, their lengths must commute. This involves padding the shorter signal to the length of the longer. Some care must be taken to avoid introducing phase shifts into the response, however. Given a 2D filter , we can partition it into blocks,
(52) 
where, in the case of oddsized filters, the blocks are partitioned
above and to the left of the central point. Given a 2D image , the padded representation can thus be formed as,(53) 
This transform is illustrated in Fig. 4. For a comprehensive guide to Fourier domain transforms and identities, including appropriate handling of boundary effects and padding, see (Oppenheim et al., 1996).
5 Stopping Criteria
For the gradientbased algorithms – gradient descent, APG and FISTA – a sufficient stopping criteria is to threshold the residual between two iterates,
(54) 
where is the variable being minimized.
Estimating convergence of the ADMMbased methods is more involved, including deviation from primal feasibility,
(55) 
and dual feasibility,
(56) 
However, in practice it is usually sufficient to measure only primal feasibility, since if the iterates have reached primal feasibility, dual feasibility is unlikely to improve (this is doubly true when using a strategy for increasing ).
6 Applications
6.1 Example 1  Image and Video Compression
Many image and video coding algorithms such as JPEG (Wallace, 1991) and H.264 (Sullivan, 2004) discretize each image into blocks which are transformed, quantized and coded independently.
Using convolutional sparse coding, an entire image can be coded onto a basis with sparse reconstruction coefficients . Quality and size can be controlled via the parameter. The basis can either be specific to the image, or a generic basis (such as Gabor) which is part of the coding spec and need not be transferred with the image data.
Since the coefficients of are exactly sparse by virtue of the softthresholding operator, the representation can make effective use of runlength and Huffman entropy coding techniques. ^{3}^{3}3JPEG also uses runlength coding but its efficiency is a function of the quantization artefacts.
To reconstruct the image , the decoder simply convolves the bases with the transmitted coefficients,
(57) 
This matches the media model well: media is consumed more frequently than it is created, so encoding can be costly (in this case an inverse inference problem) but decoding should be fast – convolutional primitives are hardwareaccelerated on almost all modern chipsets.
6.2 Example 2  A basis for Natural Images
The receptive fields of simple cells in the mammalian primary visual cortex can be characterized as being spatially localized, oriented and bandpass. (Olshausen and Field, 1996)
hypothesized that such fields could arise spontaneously in an unsupervised strategy for maximizing the sparsity of the representation. Sparsity can be interpreted biologically as a metabolic constraint  firing only a few neurons in response to a stimulus is clearly more energy efficient than firing a large number.
(Olshausen and Field, 1996) use traditional patchbased sparse coding to solve for a set of basis functions. The famous result is that the basis functions resemble Gabor filters. However, a large number of the bases learned are translations of others – an artefact of sampling and treating each image patch independently.
It is wellunderstood that the statistics of natural images are translation invariant (Hyvärinen et al., 2009), i.e. the covariance of natural images depends only on the distance,
(58) 
Thus it makes sense to code natural images in a manner that does not depend on exact position of the stimulus. Convolutional sparse coding permits this, and as a result produces a more varied range of basis elements than simple Gabor filters when coding natural images. The sparsity pattern of convolutional coefficients also has a mapping onto the receptive fields of active neurons in V1.^{4}^{4}4Unlike many higher regions within the visual cortex, V1 is retinotopic – the spatial location of stimulus in the visual world is highly correlated with the spatial location of active neurons within V1.
6.3 Example 3  Structure from Motion
Trajectory basis NonRigid Structure from Motion (NRSfM) refers to the process of reconstructing the motion of 3D points of a nonrigid object from only their 2D projected trajectories.
Reconstruction relies on two inherently conflicting factors: (i) the condition of the composed camera and trajectory basis matrix, and (ii) whether the trajectory basis has enough degrees of freedom to model the 3D point trajectory. Typically, (i) is improved with a lowrank basis, and (ii) is improved with a higherrank basis.
The Discrete Cosine Transform (DCT) basis has traditionally been used as a generic basis for encoding motion trajectories, however choosing the correct rank has been a difficult problem. (Zhu and Lucey, 2014) proposed the use of convolutional sparse coding to learn a compact basis that could model the trajectories taken from a corpus of training data.
Learning the basis proceeds as per usual,
(59) 
where each is a 1D trajectory of arbitrary length, the trajectory basis being learned, and the sparse reconstruction coefficients.
Given the convolutional trajectory basis , reconstructing the sparse coefficients for the trajectory from observations involves,
(60) 
where are the observations of the points that have been imaged by the camera matrices , one for each frame in the trajectory,
(61) 
In single view reconstruction, backprojection is typically enforced as a constraint, and the objective is to minimize the number of nonzero coefficients in the reconstructed trajectory that satisfy this constraint.
A convolutional sparse coded basis produces less reconstruction error than previously explored bases, including one learned from patchbased sparse coding, and a generic DCT basis. This illustrates convolutional sparse coding’s ability to learn low rank structure from misaligned trajectories stemming from the same underlying dynamics (e.g. articulated human motion).
6.4 Example 4  Midlevel Generative Parts
Zeiler et al. show how a cascade of convolutional sparse coders can be used to build robust, unsupervised midlevel representations, beyond the edge primitives of §6.2.
The convolutional sparse coder at each level of the hierarchy can be defined as,
(62) 
where the extra superscripts on each element indicate the layer to which they are native.
The inputs to each layer are the sparse coefficients of the previous layer, after being passed through a pooling/subsampling operation . For the first layer, , i.e. the input image.
The idea behind this coding strategy is that structure within the signal is progressively gathered at a higher and higher level, initially with edge primitives, then mergers between these primitives into linesegments, and eventually into recurrent object parts.
Layers of convolutional sparse coders have also been used to produce high quality latent representations for convolutional neural networks
(Lee et al., 2009; Kavukcuoglu et al., 2010; Chen et al., 2013), though fullysupervised backpropagation across layers has become popular more recently (Krizhevsky et al., 2012).6.5 Example 5  Single Image SuperResolution
SingleImage Super Resolution (SISR) is the process of reconstructing a highresolution image from an observed lowresolution image. SISR can be cast as the inverse problem,
(63) 
where is the latent highresolution image that we wish to recover, is an antialiasing filter, is a downsampling matrix and is the observed lowresolution image. The system is underdetermined, so there exist infinitely many solutions to . A strategy for performing SISR is via a straightforward convolutional extension of (Yang et al., 2010),
s.t.  
(64) 
where and are a highresolution and derived lowresolution training pair, is the downsampling filter as before and are a common set of coefficients that tie the two representations together. The dictionaries and learn a mapping between low and highresolution image features.
Given a new lowresolution input image , the sparse coefficients are first inferred with respect to the lowresolution basis,
(65) 
and then convolved with the highresolution basis to reconstruct the highresolution image,
(66) 
6.6 Example 6  Visualizing Object Detection Features
(Vondrick et al., 2013) presented a method for visualizing HOG features via the inverse mapping,
(67) 
where is the image to recover, and is the mapping of the image into HOG space. Direct optimization of this objective is difficult, since it is highly nonlinear through the HOG operator , i.e. multiple distinct images can map to the same HOG representation.
One possible approach to approximating this objective is through paired dictionary learning, in a similar manner to §6.5.
Given an image and its representation in the HOG domain, we wish to find two basis sets, in the image domain and in the HOG domain, and a common set of sparse reconstruction coefficients , such that,
(68) 
Intuitively, the common reconstruction coefficients force the basis sets to represent the same appearance information albeit in different domains. The basis pairs thus provide a mapping between the domains.
Optimizing this objective is a straightforward extension of the patch based sparse coding used by (Vondrick et al., 2013),
s.t.  
(69) 
Because we are optimizing over entire images rather than independently sampled patches, the bases learned will (i) produce a more unique mapping between pixel features and HOG features (since translations of features are not represented), and (ii) be more expressive for any given basis set size as a direct result of (i).
Imagescale optimization also reduces blocking artefacts in the image reconstructions, leading to more faithful/plausible representations, with potentially finergrained detail.
References
 Beck and Teboulle [2009] Amir Beck and Marc Teboulle. A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 Boyd [2010] Stephen Boyd. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2010.

Bristow et al. [2013]
Hilton Bristow, Anders Eriksson, and Simon Lucey.
Fast Convolutional Sparse Coding.
Computer Vision and Pattern Recognition (CVPR)
, 1(2), 2013.  Chalasani and Principe [2012] Rakesh Chalasani and JC Principe. A fast proximal method for convolutional sparse coding. International Joint Conference on Neural Networks (IJCNN), 2012.
 Chen et al. [2013] B Chen, Gungor Polatkan, Guillermo Sapiro, and David Blei. Deep Learning with Hierarchical Convolutional Factor Analysis. Pattern Analysis and Machine Intelligence (PAMI), pages 1–30, 2013.
 Gabay and Mercier [1976] D Gabay and B Mercier. A Dual Algorithm for the Solution of Nonlinear Variational Problems via Finite Element Approximations. Computers and Mathematics with Applications, 1976.
 Glowinski and Marroco [1975] R Glowinski and A Marroco. Sur L’Approximation, par Elements Finis d’Ordre Un, et la Resolution, par PenalisationDualite, d’une Classe de Problemes de Dirichlet non Lineares. Revue Francaise d’Automatique, 1975.
 Grosse et al. [2007] Roger Grosse, Rajat Raina, Helen Kwong, and Andrew Y Ng. Shiftinvariant sparse coding for audio classification. UAI, 2007.
 Hyvärinen et al. [2009] Aapo Hyvärinen, Jarmo Hurri, and Patrik O. Hoyer. Natural Image Statistics A probabilistic approach to early computational vision. Computational Imaging and Vision, 39, 2009.
 Kavukcuoglu et al. [2010] Koray Kavukcuoglu, Pierre Sermanet, Ylan Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierarchies for visual recognition. Advances in Neural Information Processing Systems (NIPS), (1):1–9, 2010.
 Krizhevsky et al. [2012] Alex Krizhevsky, I Sutskever, and G Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2012.
 Lee et al. [2007] Honglak Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 19:801, 2007.

Lee et al. [2009]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
International Conference on Machine Learning (ICML), pages 1–8, 2009.  Lewicki and Sejnowski [1999] MS Lewicki and TJ Sejnowski. Coding timevarying signals using sparse, shiftinvariant representations. NIPS, 1999.
 Mairal et al. [2009] Julien Mairal, Francis Bach, and J Ponce. Online dictionary learning for sparse coding. Conference on Machine Learning, 2009.
 Martucci [1993] SA Martucci. Symmetric convolution and the discrete sine and cosine transforms: Principles and applications. PhD thesis, Georgia Institute of Technology, 1993.
 Mø rup et al. [2008] Morten Mø rup, Mikkel N Schmidt, and Lars Kai Hansen. Shift invariant sparse coding of image and music data. Journal of Machine Learning, pages 1–14, 2008.
 Nesterov [1983] Y Nesterov. A method of solving a convex programming problem with convergence rate O(1/k^2). Soviet Mathematics Doklady, 1983.
 Olshausen and Field [1996] B.A. Olshausen and D.J. Field. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
 Olshausen and Field [1997] Bruno A Olshausen and David J Field. Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1? Science, 37(23):3311–3325, 1997.
 Oppenheim et al. [1996] Alan V Oppenheim, Alan S Willsky, and with S. Hamid. Signals and Systems (2nd Edition). Prentice Hall, 1996.
 Parikh and Boyd [2013] Neal Parikh and Stephen Boyd. Proximal Algorithms. 1(3):1–108, 2013.
 Simoncelli et al. [1992] Eero Simoncelli, William Freeman, Edward Adelson, and David Heeger. Shiftable multiscale transforms. Information Theory, 1992.
 Sohn and Jung [2011] K Sohn and DY Jung. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. International Conference on Computer Vision (ICCV), 2011.
 Sullivan [2004] GJ Sullivan. The H. 264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions. In Conference on Applications of Digital Image Processing, pages 1–22, 2004.
 Tibshirani [1996] R Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996.
 Vondrick et al. [2013] Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. HOGgles: Visualizing Object Detection Features. International Conference on Computer Vision (ICCV), 2013.
 Wallace [1991] GK Wallace. The JPEG Still Picture Compression Standard. Communications of the ACM, pages 1–17, 1991.
 Wipf et al. [2011] DP Wipf, BD Rao, and Srikantan Nagarajan. Latent Variable Bayesian Models for Promoting Sparsity. Information Theory, 57(9):6236–6255, 2011.
 Yang et al. [2010] Jianchao Yang, John Wright, Thomas Huang, and Yi Ma. Image SuperResolution via Sparse Representation. IEEE Transactions on Image Processing, 19(11):1–13, November 2010.
 Zeiler et al. [2010] MD Zeiler, Dilip Krishnan, and GW Taylor. Deconvolutional networks. Computer Vision and Pattern Recognition (CVPR), 2010.
 Zhang [2010] CunHui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics (arXiv), 38(2):894–942, April 2010.
 Zhu and Lucey [2014] Yingying Zhu and Simon Lucey. Convolutional Sparse Coded Filters Nonrigid Structure From Motion. Pattern Analysis and Machine Intelligence (PAMI), 2014.
Appendix A Log Likelihood Interpretation
Minimizing the sparse convolutional objective has an equivalent maximum likelihood interpretation,
(70) 
Replacing the product of two probability distributions with the log of their sums, and negating the expression yields,
(71) 
The corresponding unbiased estimators are,
(72)  
(73) 
After choosing appropriate values for the noise variance,
(74) 
Appendix B Inversion of a Rank1 + Scaled Identity Matrix
The Woodbury matrix identity (matrix inversion lemma) states,
(75) 
Substituting , and , , where is a rank1 column matrix gives,
(76)  
Comments
There are no comments yet.