In recent years, dictionary learning and convolutional sparse coding techniques (CSC) have been successfully applied in a wide range of topics, including image classification (mairal2009supervised; huang2007sparse), image restoration (aharon2006rm), and signal processing (mairal2010online). The main idea behind these representations is to conjointly learn a dictionary containing the patterns observed in the signal, and sparse activations that encode the temporal or spatial locations where these patterns occur. Previous works have mainly focused on the study of univariate signals or images (garcia2017convolutional), solving the dictionary learning problem using sparsity constraint on atoms activations.
In many applications ranging from videos to neuroimaging, data are multivariate and therefore better encoded by a tensor structure (zhou2013tensor; cichocki2015tensor). This has led in the recent years to an increased interest in adapting statistical learning methods to the tensor framework, in order to efficiently capture multilinear relationships (su2012multivariate). In particular, the low rank setting has been the subject of many previous works in the past few years, as an efficient tool to exploit the structural information of the data, particularly in regression problems (zhou2013tensor; rabusseau2016low; li2017near; he2018boosted)
. However, traditional low rank approaches for regression problems are difficult in this setting, as estimating or finding the rank of a tensor with order at least 3 is significantly more complex than in the traditional matrix case(haastad1990tensor), and requires new algorithms (li2017near). More recently, (he2018boosted) have proposed a new approach –inspired by the SVD decomposition – that combines both low rank and sparsity constraints in the multivariate regression setting.
In this article, we propose to extend the classical CSC model to multivariate data by introducing a new tensor CSC model that combines both low-rank and sparse activation constraints. This model that we call Kruskal Convolutional Sparse Coding (K-CSC) (or Low-Rank Multivariate Convolutional Sparse Coding) offers the advantages of 1) taking into account the underlying structure of the data and 2) using fewer activations to decompose the data, resulting in an improved summary (dictionary) and a better reconstruction of the original multivariate signal.
Our main contributions are as follows. First, we show that under mild assumptions, this new problem can be rewritten as a Kruskal tensor regression, from which we discuss interesting properties. Then, we provide an efficient algorithm, based on an alternating procedure, that solves the K-CSC problem and scales with the number of activation parameters. Finally, we evaluate our algorithm experimentally using a large range of simulated and real tensor data, illustrating the advantages and drawbacks of our model.
2 Low-Rank Multivariate Convolutional Sparse Coding
2.1 Preliminaries on tensors
We first introduce the notation used in this paper and briefly recall some element of tensor algebra (see (kruskal1977three; kolda2006multilinear; kolda2009tensor; sidiropoulos2017tensor) for a more in-depth introduction).
Across the paper, we use calligraphy font for tensors () bold uppercase letters for matrices (
) bold lowercase letters for vectors () and lowercase letter for scalars (). Let and respectively denotes the Frobenius and the norms, and let be the scalar product associated with . The symbol refers to the the outer product, to the Kronecker product, to the Khatri-Rao product, and denotes the -product. The symbol refers to the multidimensional discrete convolutional operator between scalar valued functions where the subscript indices are the dimension involved (see Figure 1). When the signal is unidimensional, reduces to the 1-D discrete convolutional operator.
(CP Decomposition.) For any , and, , , , such that
where . The smallest for which such decomposition exists is called the Canonical Polyadic rank of ( or rank for short), and in this case (1) is referred to as the CP decomposition of .
(Kruskal operator.) With the notation of Proposition 1,the Kruskal operator is defined as
2.2 Model formulation
Let be a multidimensional signal and in a collection of multidimensional atoms such that . The Kruskal Convolutional Sparse Coding model (K-CSC) is defined as
where A) , (with ) are sparse activation tensors with CP-rank lower than , with small, and B) is an additive (sub)gaussian noise, whose every component are independent and centered.
Advantages of low-rank tensor. The addition of a low rank constraint offers two main advantages. First, it reduces the number of unknown activation parameters from (unconstrained model) to . For instance, in the regression case of typical RGB images of size -by--by-, the number of parameters decrease to instead of , resulting in a better scalability of the problem. Second, it exploits the structural information of , and has already been proved to be effective in various contexts (e.g. (guo2012tensor; liu2013tensor)). For example, previous works have shown that the vectorization of an image removes the inherent spatial structure of it while low rank tensor regression produces more interpretable results (zhou2013tensor).
Relation with separable convolution. The low-rank constraint imposes that each activation has a and writes as the sum of at most separable filters (product of multiple one dimensional filters). Problem (2) is therefore a separable convolution problem, which allows to use a FFT resolution and to significantly speed up the calculus of the convolution. As an example, the complexity of filtering an -by- image with a -by- non-separable filter is – instead of for a separable filter.
For simplicity, we assume in this paper that, , . However, it is straightforward to extend to a model where each tensor have a different CP-rank (i.e. ).
2.3 Kruskal CSC as Tensor Regression
CSC models are often rewritten as a regression problem by using a circulant dictionary matrix (garcia2018convolutional). In this section, we adopt the same strategy. We show that the K-CSC model (2) can be reformulated as an instance of the rank- Generalized Linear tensor regression Models introduced in (zhou2013tensor) – also known as Kruskal tensor regression model – (zhou2013tensor) and then use this new formulation to discuss the properties of the model. First we introduce the notion of circulant tensor, which is a generalization of the circulant matrix.
(Circulant tensor.) Let and . We define the (quasi)-circular tensor generated by , , as follows
In other word, contains the atom which is translated in every directions. The following proposition shows the equivalence between K-CSC model and Kruskal regression. The consequence of this relation are studied in the next Section.
(K-CSC as Kruskal regression.) Let in be a collection of multidimensional atoms. Then, (2) is equivalent to
where is in such that and is the linear map defined by
The proposition is a consequence of the following equality
3 Model estimation
In order to solve (2), we minimize the associated negative log-likelihood with an additional sparsity constraint – a common tool in CSC – on the activation tensors. This leads to the following minimization problem:
where and i.f.f.
Notice that the low rank constraint is embedded in by the use of the Kruskal operator and the fact that .
This use of the Kruskal operator can be seen as a generalization of the Burer-Monteiro heuristic
Burer-Monteiro heuristicfor matrix (burer2003nonlinear).
Multiple CP Decompositions.
It should be noted that the CP decomposition is known to be unique when it satisfies the Kruskal condition (kruskal1989rank), but only up to permutation of the normalized factor matrices. Therefore, the that solve (4) may not be unique, but they are isolated equivalent minimizers, and thus this problem does not negatively impact the optimization.
In (4), the sparsity constraint enforces the sparsity of each element of the CP-decomposition for every activation tensors independently. Hence, the sparsity of each mode can be controlled. In addition, it is possible to add a specific ridge penalization (i.e. , ) to (4) to limit numerical instabilities, decrease risks of overfitting and ensure the unity of the solution (zhou2013tensor).
Sparse Low Rank Regression.
A consequence of Proposition 2 is that (4) can be seen as an instance of a low-rank sparse regression problem. It has been shown that while both properties are desirable, a balance between the two constraints has to be found, as the two regularizations may have adversarial influence on the minimizer (richard2012estimation). This is achieved in our setting by using a Ivanov regularization for the rank (CP-rank ) and a Tykhonov regularization for the sparsity (): the solution should be as sparse as possible while having a CP-rank lower than .
Solving the optimization problem with AK-CSC
The non-convex problem (4)(due to the rank constraint) is convex with respect to each block and . Hence, we use a block coordinate strategy to minimize it. Our algorithm, called Alternated K-CSC (AK-CSC), splits the main non-convex problem into several convex subproblems; 1) by freezing and all except one block at a time (-step) 2) by freezing only the activation tensor (-step). Algorithm 1 presents this process.
Activations update, -step. In order to solve (4) with respect to , we proceed as follows: we assume that the dictionary is fixed and we iteratively solve the problem where all mode except the -th one of each activation tensor are constant, for varying between and . In other words, for each value of , we solve the problem
where the is added to improve the minimization process, as previously discussed. Without any loss of generality, we set in the rest of this section, as the other values of can be treated similarly.
The first term of minimization problem (5) can be rewritten as
with , and .
In the following, for all , we denote by the tensor where we add on each dimension to reach the one of .
From the previous proposition, it is clear that in (5), each subproblem is a CSC with multichannel dictionary filters and single-channel activation maps (wohlberg2016convolutional), i.e. we need to solve
Therefore, this Z block step can be solved using standard multi-channel CSC algorithms (see (garcia2018convolutional) for a complete review).
Dictionary update, -step. Given the activation tensors , the dictionary update aims at improving how the model reconstructs by solving
This step presents no significant difference with existing methods. This problem is smooth and convex and can be solved using classical algorithms (mairal2010online; yellin2017blood; chalasani2013fast).
4 Related work
Related models. With specific choices on the parameters or on the dimension values, the K-CSC model reduces to well-known CSC problems. In the following, we enumerate some of them which also use the (multidimensional) convolutional product.
Univariate CSC: For vector-valued atoms and signals (), our model reduces to the -D Convolutional Dictionary Learning model (CDL). The sparse coding step – i.e. the minimization on – is commonly referred as Convolutional Basis Pursuit DeNoising where the two leading approaches are based on the the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) (beck2009fast) and on ADMM (boyd2011distributed). The dictionary update step – i.e. the minimization on – is referred as Method of Optimal Directions (with a constraint on the filter normalization) where, again, the most efficient solutions are based on FISTA and ADMM (garcia2018convolutional).
Multivariate CSC: If and (i.e. no low-rank constraint), our model reduces to the multivariate CDL model also referred as multi-channel CDL. To the best of our knowledge, the only reference to multivariate CSC is (wohlberg2016convolutional), where the author proposes two models for -channel images. More recently, (garcia2018convolutional) propose scalable algorithms for hyper-spectral images (tensor of order with a high dimension on the third mode).
Multivariate CSC with rank-1 constraint: If , and (i.e. vector-valued atoms), our model reduces to the one recently presented in (la2018multivariate). In this work, the authors strongly rely on the rank one constraint to solve the CSC problem and only consider spatio-temporal data.
Differences with recent tensor based approaches. Some previous works have proposed different approaches to CSC in a tensor setting, albeit without a low rank setting – a key component of our approach. In (jiang2018efficient), the authors introduce a new formulation based on a t-linear combination (related to the t-product). In (bibi2017high) they propose a generic CSC model based on tensor factorization strategy called t-SVD which also use the t-product. Notice that, while this new formulation reduces back to 2-D CSC when the tensor order is set to particular sizes, the notion of low-rank on the activations is not considered. Other tensor-based CSC models enforce the low-rank constraint on the dictionary instead of the activations using a Tucker or a CP decomposition (zhang2017tensor; tan2015tensor) or tensor factorization techniques (huang2015convolutional).
In this section we evaluate our tensor-based dictionary learning framework K-CSC on both synthetic and real-world datasets. We compare our algorithm AK-CSC to state-of-the-art dictionary learning algorithm based on ADMM. Results are for tensor-structured output regression problems on which we report -distance. All experiments are conducted on a linux laptop with -core GHz Intel CPUs using Tensorly (JMLR:v20:18-277), Sporco (wohlberg2017sporco) and standard python libraries.
-step solver. As shown in the previous section, the main problem can be rewriten as a regression problem. Hence, to solve the -step, it is possible to use standard tensor regression solvers (zhou2013tensor; li2017near; he2018boosted). However, it necessitates the construction of an enormous circulant tensor which is not tractable in practice due to memory limitation. In section 3, we show that the -step necessitates to solve multi-channel CSC problem with a large amount of channels (). While this problem has received only limited attention for many more than three channels, in our experiment, we used the algorithm proposed in (garcia2018convolutional) which turns out to be scalable regarding the value of .
Initialization. Each activation subproblem is regularized with a -norm, which induces sparsity on a particular mode. As an example, consider the update of the block during the -step (see equation (5)). From tibshirani2015statistical, we know that there exists a value above which the subproblem solution is always zeros. As depends on the "atoms" and on the multichannel signal , its value changes after a complete loop. In particular, its value might change a lot between the initialization and the first loop. This is problematic since we cannot use a regularization above this initial
. The standard strategy to initialize univariate CSC methods is to generate Gaussian white noise atoms. However, as these atoms generally poorly correlate with the signals, the initial value ofare low compared to the following ones. To fix this problem, we use the strategy propose in (la2018multivariate)
and initialize the dictionary with random parts of the signal. Furthermore, in all the experiments, a zero-padding is added to the initial tensor.
To highlight the behavior of AK-CSC and the influence of the hyperparameters, we consider two different scenarios; 1) The rank is unknown 2) The rank is known. In both cases, the true dictionary is given and only the CSC task is evaluated. We generate random tensor atoms of size
where entries follow a Gaussian distribution with mean
and standard deviation in. Each atom is associated to a sparse activation tensor of CP-rank . The third order tensor generated by model (2) is in . First, we illustrate the convergence of AK-CSC by plotting the loss (4) as a function of the rank parameter (Figure 2 A)). Convergence is reached after a reasonable number of full loops. It appears that choosing a rank greater than the true one permits a faster convergence. However, as depicted in Figure 2 B), an over estimation of the rank does not increase nor decrease the reconstruction error. Also note that, as expected, the error after convergence drastically decreases when we reach the true rank . The research of the best set of hyperparameters for is done in . Figure 2 C) illustrates how the three hyperparameters related to the sparsity influence the -distance, i.e. the reconstruction of . We see that our method is robust to small modification of the hyperparameters. This is an important property which facilitates the search of the best set of hyperparameters.
Color Animated Pictures.
We consider a RGB-animated picture composed of images of size that make up the animation of Mario running. Hence, the animated picture is a -th order tensor of size . The objective is to learn a maximum of RGB-animated atoms in in order to reconstruct . We compare the reconstruction of our algorithm AK-CSC when to the ADMM approach without low-rank constraint, which is a classical and efficient CSC solver (see comparative section (wohlberg2016efficient)). Figure 3 A) shows that for the same level of sparsity our method always uses fewer non-zero coefficients and yet provides a better reconstruction. Indeed, the rank constraint on the activations allows to choose more accurate non-zero coefficients. For instance, Figure 3B) shows that even with times less non-zero coefficients, AK-CSC provides a visual better reconstruction.
Functional Magnetic Resonance Imaging dataset.
In a last example, we analyze functional Magnetic Resonance Imaging data (fMRI) from the Autism Brain Imaging Data Exchange (ABIDE) (di2008functional). The fMRI data is a third-order tensor in . We learn atoms of size and of rank . As in the previous example, we compare the reconstruction of our algorithm AK-CSC to the ADMM approach. The resulting reconstruction is displayed on Figure 4 and the learned atoms are in the supplementary materials. It appears that for the same level of sparsity, the reconstruction performance is more efficient with AK-CSC. Furthermore, the learned atoms are more informative than those with the classical method.
In this paper, we introduced K-CSC, a new multivariate convolutional sparse coding model using tensor algebra that enforces both element-wise sparsity and low-rankness of the activations tensors. We provided an efficient algorithm based on alternate minimization called AK-CSC to solve this model, and we showed that K-CSC can be rewritten as a Kruskal Tensor regression problem. Finally, we showed that AK-CSC achieves good performances on both synthetic and real data.
7 Detailed proof
We detail the proof of the following proposition from the main paper.
The first term of minimization problem () can be rewritten as
with , and .
In the following, for all , we denote by the tensor where we add on each dimension to reach the one of .
8 Additional results
8.1 Synthetic data
We provide the full heatmap obtained in the Synthetic data section on Figure 5. This Figure shows how the three hyperparameters related to the sparsity influence the -distance, i.e. the reconstruction of . We see that our method is robust to small modification of the hyperparameters. This is an important property which facilitates the search of the best set of hyperparameters.
8.2 Functional Magnetic Resonance Imaging dataset
The atoms learned from the main article for functional Magnetic Resonance Imaging (fMRI) are exposed Figure LABEL:fig:fMRI_atoms. We can see that our method learns interesting atoms (on the left) while standard methods failed to use the full dictionary (on the right). Indeed, more than half of the atoms learned by ADMM remain noise.
8.3 Color Animated Picture
To highlight the behavior of AK-CSC with regard to its non-convexity, we performed trials of the same minimization program. Results of such an experiments is provided on Figure LABEL:fig:converge_mario. This illustration shows that the convergence is globally identical on each trials. Hence, our algorithm seems to be robust.
8.4 Black and White Animated Picture
In this section, we focus on a black and white version of the animated pictures of Mario. Interestingly, the number of used atoms and of activations is much more smaller for our method. Indeed, it appears that only important atoms are kept as for the other one, the resulting dictionary is redundant. Quantitatively, for a similar error of reconstruction, the number of activations is much smaller. Figure 8 shows one example of reconstruction with associated important atoms.