## 1 Introduction

Neural activity recorded via measurements of the electrical potential over the scalp by electroencephalography (EEG), or magnetic fields by magnetoencephalography (MEG), is central for our understanding of human cognitive processes and certain pathologies. Such recordings consist of dozens to hundreds of simultaneously recorded signals, for duration going from minutes to hours. In order to describe and quantify neural activity in such multi-gigabyte data, it is classical to decompose the signal in predefined representations such as the Fourier or wavelet bases. It leads to canonical frequency bands such as theta (4–8 Hz), alpha (8–12 Hz), or beta (15–30 Hz) (buzsaki2006rhythms), in which signal power can be quantified. While such linear analyses have had significant impact in neuroscience, there is now a debate regarding whether neural activity consists more of transient bursts of isolated events rather than rhythmically sustained oscillations (van2018neural). To study the transient events and the morphology of the waveforms (mazaheri2008asymmetric; cole2017brain), which matter in cognition and for our understanding of pathologies (jones2016brain; Cole4830), there is a clear need to go beyond traditionally employed signal processing methodologies (cole2018cycle). For instance, a classic Fourier analysis fails to distinguish alpha-rhythms from mu-rhythms, which have the same peak frequency at around 10 Hz, but whose waveforms are different (cole2017brain; hari2017meg)

. The key to many modern statistical analyses of complex data such as natural images, sounds or neural time series is the estimation of data-driven representations. Dictionary learning is one family of techniques, which consists in learning atoms (or patterns) that offer sparse data approximations. When working with long signals in which events can happen at any instant, one idea is to learn

*shift-invariant*atoms. They can offer better signal approximations than generic bases such as Fourier or wavelets, since they are not limited to narrow frequency bands. Multiple approaches have been proposed to solve this shift-invariant dictionary learning problem, such as MoTIF jost2006motif, the sliding window matching gips2017discovering, the adaptive waveform learning hitziger2017adaptive, or the learning of recurrent waveform brockmeier2016learning, yet they all have several limitations, as discussed in Jas2017. A more popular approach, especially in image processing, is the convolutional sparse coding (CSC) model Jas2017; pachitariu2013extracting; kavukcuoglu2010learning; zeiler2010deconvolutional; heide2015fast; wohlberg2016efficient; vsorel2016fast; Grosse-etal:2007; mailhe2008shift

. The idea is to cast the problem as an optimization problem, representing the signal as a sum of convolutions between atoms and activation signals. The CSC approach has been quite successful in several fields such as computer vision

kavukcuoglu2010learning; zeiler2010deconvolutional; heide2015fast; wohlberg2016efficient; vsorel2016fast, biomedical imaging Jas2017; pachitariu2013extracting, and audio signal processing Grosse-etal:2007; mailhe2008shift, yet it was essentially developed for univariate signals. Interestingly, images can be multivariate such as color or hyper-spectral images, yet most CSC methods only consider gray scale images. To the best of our knowledge, the only reference to multivariate CSC is wohlberg2016convolutional, where the author proposes two models well suited for 3-channel images. In the case of EEG and MEG recordings, neural activity is instantaneously and linearly spread across channels, due to Maxwell’s equations hari2017meg. The same temporal patterns are reproduced on all channels with different intensities, which depend on each activity’s location in the brain. To exploit this property, we propose to use a rank-1 constraint on each multivariate atom. This idea has been mentioned in barthelemy2012shift; barthelemy2013multivariate, but was considered less flexible than the full-rank model. Moreover, their proposed optimization techniques are not specific to shift-invariant models, and not scalable to long signals.##### Contribution

In this study, we develop a multivariate model for CSC, using a rank-1 constraint on the atoms to account for the instantaneous spreading of an electromagnetic source over all the channels. We also propose efficient optimization strategies, namely a locally greedy coordinate descent (LGCD) Moreau2018a, and precomputation steps for faster gradient computations. We provide multiple numerical evaluations of our method, which show the highly competitive running time on both univariate and multivariate models, even when working with hundreds of channels. We also demonstrate the estimation performance of the multivariate model by recovering patterns on low signal-to-noise ratio (SNR) data. Finally, we illustrate our method with atoms learned on multivariate MEG data, that thanks to the rank-1 model can be localized in the brain for clinical or cognitive neuroscience studies.

##### Notation

A multivariate signal with time points in is noted , while is a univariate signal. We index time with brackets , while is the channel in

. For a vector

we define the norm as , and for a multivariate signal , we define the time-wise norm as . The transpose of a matrix is denoted by . For a multivariate signal , is obtained by reversal of the temporal dimension, . The convolution of two signals and is denoted by . For , is obtained by convolving every row of by . For , is obtained by summing the convolution between each row of and : . We define as .## 2 Multivariate Convolutional Sparse Coding

In this section, we introduce the convolutional sparse coding (CSC) models used in this work. We focus on 1D-convolution, although these models can be naturally extended to higher order signals such as images by using the proper convolution operators.

##### Univariate CSC

The CSC formulation adopted in this work follows the shift-invariant sparse coding (SISC) model from Grosse-etal:2007. It is defined as follows:

(1) |

where are observed signals, is the regularization parameter, are the temporal atoms we aim to learn, and are signals of activations aka the code associated with . This model assumes that the coding signals are sparse, in the sense that only few entries are nonzero in each signal. In this work, we will also assume that the entries of are positive, which means that the temporal patterns are present each time with the same polarity.

##### Multivariate CSC

The multivariate formulation uses an additional dimension on the signals and on the atoms, since the signal is recorded over channels (mapping to space locations):

(2) |

where are observed multivariate signals, are the spatio-temporal atoms, and are the sparse activations associated with .

##### Multivariate CSC with rank-1 constraint

This model is similar to the multivariate case but it adds a rank-1 constraint on the dictionary, , with being the pattern over channels and the pattern over time. The optimization problem boils down to:

(3) |

The rank-1 constraint is consistent with Maxwell’s equations and the physical model of electrophysiological signals like EEG or MEG, where each source is linearly spread instantaneously over channels with a constant topographic map hari2017meg. Using this assumption, one aims to improve the estimation of patterns under the presence of independent noise over channels. Moreover, it can help separating overlapping sources which are inherently rank-1 but whose sum is generally of higher rank. Finally, as explained below, several computations can be factorized to speed up computational time.

## 3 Model estimation

Problems (1), (2) and (3) share the same structure. They are convex in each variable but not jointly convex. The resolution is done by using a block coordinate descent approach which minimizes alternatingly the objective function over one block of the variables. In this section, we describe this approach on the multivariate with rank-1 constraint case (3), updating iteratively the activations , the spatial patterns , and the temporal pattern .

### 3.1 -step: solving for the activations

Given *fixed* atoms and a regularization parameter , the -step aims to retrieve the activation signals associated to the signals by solving the following -regularized optimization problem:

(4) |

This problem is convex in and can be efficiently solved. In Chalasani2013, the authors proposed an algorithm based on FISTA beck2009fast to solve it. bristow2013fast introduced a method based on ADMM boyd2011distributed to compute efficiently the activation signals . These two methods are detailed and compared by wohlberg2016efficient

, which also made use of the fast Fourier transform (FFT) to accelerate the computations. Recently,

Jas2017 proposed to use L-BFGS byrd1995limited to improve on first order methods. Finally, kavukcuoglu2010learning adapted the greedy coordinate descent (GCD) to solve this convolutional sparse coding problem. However, for long signals, these techniques can be quite slow due the computation of the gradient (FISTA, ADMM, L-BFGS) or the choice of the best coordinate to update in GCD, which are operations that scale linearly in . A way to alleviate this limitation is to use a locally greedy coordinate descent (LGCD) strategy, presented recently in Moreau2018a. Note that problem (missing) is independent for each signal . The computation of each can thus be parallelized, independently of the technique selected to solve the optimization (Jas2017). Therefore, we omit the superscript in the following subsection to simplify the notation.##### Coordinate descent (CD)

The key idea of coordinate descent is to update our estimate of the solution one coordinate at a time. For (missing), it is possible to compute the optimal value of one coordinate given that all the others are fixed. Indeed, the problem (missing) restricted to one coordinate has a closed-form solution given by:

(5) |

where is the canonical basis vector with value 1 at index and 0 elsewhere. When updating the coefficient to the value , is updated with:

(6) |

The term is zero for . Thus, only coefficients of need to be changed (kavukcuoglu2010learning). The CD algorithm updates at each iteration a coordinate to this optimal value. The coordinate to update can be chosen with different strategies, such as the cyclic strategy which iterates over all coordinates (Friedman2007), the randomized CD (Nesterov2010; richtarik2014iteration) which chooses a coordinate at random for each iteration, or the greedy CD (Osher2009) which chooses the coordinate the farthest from its optimal value.

##### Locally greedy coordinate descent (LGCD)

The choice of a coordinate selection strategy results of a tradeoff between the computational cost of each iteration and the improvement it provides. For cyclic and randomized strategies, the iteration complexity is as the coordinate selection can be performed in constant time. The greedy selection of a coordinate is more expensive as it is linear in the signal length . However, greedy selection is more efficient iteration-wise (Nutini2015). Moreau2018a proposed to consider a locally greedy selection strategy for CD. The coordinate to update is chosen greedily in one of subsegments of the signal, at iteration , the selected coordinate is:

(7) |

with . With this strategy, the coordinate selection complexity is linear in the length of the considered subsegment . By choosing , the complexity of update is the same as the complexity of random and cyclic coordinate selection, . We detail the steps of LGCD in Algorithm 1. This algorithm is particularly efficient when the are sparser. Indeed, in this case, only few coefficients need to be updated in the signal, resulting in a low number of iterations. Computational complexities are detailed in Table 1.

### 3.2 -step: solving for the atoms

Given *fixed* activation signals , associated to signals , the -step aims to update the spatial patterns and temporal patterns , by solving:

(8) |

The problem (missing) is convex in each block of variables and , but not jointly convex. Therefore, we optimize first , then , using in both cases a projected gradient descent with an Armijo backtracking line-search wright1999numerical to find a good step size. These steps are detailed in Algorithm 2.

##### Gradient relative to and

The gradient of relatively to and

can be computed using the chain rule. First, we compute the gradient relatively to a full atom

:(9) |

where we reordered this expression to define and . These terms are both constant during a -step and can thus be precomputed to accelerate the computation of the gradients and the cost function . We detail these computations in the supplementary materials (see Section A.1).
Computational complexities are detailed in Table 1. Note that the dependence in is present *only* in the precomputations, which makes the following iterations very fast.
Without precomputations, the complexity of *each* gradient computation in the -step would be .

### 3.3 Initialization

The activations sub-problem (-step) is regularized with a -norm, which induces sparsity: the higher the regularization parameter , the higher the sparsity. Therefore, there exists a value above which the sub-problem solution is always zeros (Hastie2015). As depends on the atoms and on the signals , its value changes after each -step. In particular, its value might change a lot between the initialization and the first -step. This is problematic since we cannot use a regularization above this initial , even though the following might be higher.

Step | Computation | Computed | Rank-1 | Full-rank |

-step | initialization | once | ||

-step | Precomputation | once | ||

-step | M coordinate updates | multiple times | ||

-step | precomputation | once | ||

-step | precomputation | once | ||

-step | Gradient evaluation | multiple times | ||

-step | Function evaluation | multiple times |

The standard strategy to initialize CSC methods is to generate random atoms with Gaussian white noise. However, as these atoms generally poorly correlate with the signals, the initial value of

is low compared to the following ones. For example, on the MEG dataset described later on, we found that the initial is about of the following ones in the univariate case, with . On the multivariate case, it is even more problematic as with , we could have an initial as low asof the following ones. To fix this problem, we propose to initialize the dictionary with random chunks of the signal, projecting each chunk on a rank-1 approximation using singular value decomposition (SVD). We noticed on the MEG dataset that the initial

was then about the same values as the following ones, which allows to use higher regularization parameters. We used this scheme in all our experiments.## 4 Experiments

We evaluated our model on several experiments, using both synthetic and empirical data. First, using MEG data, we demonstrate the speed performance of our algorithm on both univariate and multivariate signals, compared with state-of-the-art CSC methods. We also show that the algorithm scales well with the number of channels. Then, we used synthetic data to show that the multivariate model with rank-1 constraint is more robust to low SNR signals than the univariate models. Finally, we illustrate how our unsupervised model is able to extract simultaneously prototypical waveforms and the corresponding topographic maps of a multivariate MEG signal.

##### Speed performance

To illustrate the performance of our optimization strategy, we monitored its convergence speed on a real MEG dataset.
The somatosensory dataset from the MNE software gramfort2013meg; gramfort2014mne contains responses to median nerve stimulation.
We consider only the gradiometers channels ^{3}^{3}3These channels measure the gradient of the magnetic field and we used the following parameters: , , , and .
First we compared our strategy against three state-of-the-art *univariate* CSC solvers available online. The first was developed by garcia2017convolutional and is based on ADMM. The second and third were developed by Jas2017, and are respectively based on FISTA and L-BFGS.
All solvers shared the same objective function, but as the problem is non-convex, the solvers are not guaranteed to reach the same local minima, even though we started from the same initial settings. Hence, for a fair comparison, we computed the convergence curves relative to each local minimum, and averaged them over 10 different initializations. The results, presented in Figure 1(a, b), demonstrate the competitiveness of our method, for reasonable choices of . Indeed, a higher regularization parameter leads to sparser activations , on which the LGCD algorithm is particularly efficient.
Then, we also compared our method against a multivariate ADMM solver developed by wohlberg2016convolutional. As this solver was quite slow on these long signals, we limited our experiments to channels. The results, presented in Figure 1(c, d), show that our method is faster than the competing method for large .
More benchmarks are available in the supplementary materials.

##### Scaling with the number of channels

The multivariate model involves an extra dimension but its impact on the computational complexity of our solver is limited. Figure 2 shows the average running times of the -step and the -step. Timings are normalized the timings for a single channel. The running times are computed using the same signals from the somatosensory dataset, with the following parameters: , , , . We can see that the scaling of these three operations is sub-linear in . For the -step, only the initial computations for the first and the constants depend linearly on so that the complexity increase is limited compared to the complexity of solving the optimization problem (missing). For the -step, the scaling to compute the gradients is linear with . However, the most expensive operations here are the computation of the constant , which does not on .

##### Finding patterns in low SNR signals

Since the multivariate model has access to more data, we would expect it to perform better compared to the univariate model especially for low SNR signals. To demonstrate this, we compare the two models when varying the number of channels and the SNR of the data. The original dictionary contains two patterns, a square and a triangle, presented in Figure (a). The signals are obtained by convolving the atoms with activation signals , where the activation locations are sampled uniformly in with non-zero activations, and the amplitudes are uniformly sampled in

. Then, a Gaussian white noise with variance

is added to the signal. We fixed , and for our simulated signals. We can see in Figure (a) the temporal patterns recovered for using only one channel and using 5 channels. While the patterns recovered with one channel are very noisy, the multivariate model with rank-1 constraint recovers the original atoms accurately. This can be expected as the univariate model is ill-defined in this situation, where some atoms are superimposed. For the rank-1 model, as the atoms have different spatial maps, the problem is easier. Then, we evaluate the learned temporal atoms. Due to permutation and sign ambiguity, we compute the -norm of the difference between the temporal pattern and the ground truths, or , for all permutations(10) |

Multiple values of were tested and the best loss is reported in Figure (b) for varying noise levels . We observe that independently of the noise level, the multivariate rank-1 model outperforms the univariate one. This is true even for good SNR, as using multiple channels disambiguates the separation of overlapping patterns.

*a*) Patterns recovered with and . The signals were generated with the two simulated temporal patterns and with . (

*b*) Evolution of the recovery loss with for different values of . Using more channels improves the recovery of the original patterns.

##### Examples of atoms in real MEG signals:

We will now show the results of our algorithm on experimental data, using the MNE somatosensory dataset gramfort2013meg; gramfort2014mne. Here we first extract trials from the data. Each trial lasts 6 s with a sampling frequency of 150 Hz (). We selected only gradiometer channels, leading to channels. The signals were notch-filtered to remove the power-line noise, and high-pass filtered at 2 Hz to remove the low-frequency trend. The purpose of the temporal filtering is to remove low frequency drift artifacts which contribute a lot to the variance of the raw signals. Figure (a) shows a recovered non-sinusoidal brain rhythm which resembles the well-known mu-rhythm. The mu-rhythm has been implicated in motor-related activity hari2006action and is centered around 9–11 Hz. Indeed, while the power is concentrated in the same frequency band as the alpha, it has a very different spatial topography (Figure (b)). In Figure (c), the power spectral density (PSD) shows two components of the mu-rhythm – one at around 9 Hz., and a harmonic at 18 Hz as previously reported in (hari2006action). Based on our analysis, it is clear that the 18 Hz component is simply a harmonic of the mu-rhythm even though a Fourier-based analysis could lead us to falsely conclude that the data contained beta-rhythms. Finally, due to the rank-1 nature of our atoms, it is straightforward to fit an equivalent current dipole tuomisto1983studies to interpret the origin of the signal. Figure (d) shows that the atom does indeed localize in the primary somatosensory cortex, or the so-called S1 region with a 59.3% goodness of fit. For results on more MEG datasets, see Section B.2. In notably includes mu-shaped atoms from S2.

## 5 Conclusion

Many neuroscientific debates today are centered around the morphology of the signals under consideration. For instance, are alpha-rhythms asymmetric (mazaheri2008asymmetric)? Are frequency specific patterns the result of sustained oscillations or transient bursts (van2018neural)? In this paper, we presented a multivariate extension to the CSC problem applied to MEG data to help answer such questions. In the original CSC formulation, the signal is expressed as a convolution of atoms and their activations. Our method extends this to the case of multiple channels and imposes a rank-1 constraint on the atoms to account for the instantaneous propagation of electromagnetic fields. We demonstrate the usefulness of our method on publicly available multivariate MEG data. Not only are we able to recover neurologically plausible atoms, but also we are able to find temporal waveforms which are non-sinusoidal. Empirical evaluations show that our solvers are significantly faster compared to existing CSC methods even for the univariate case (single channel). The algorithm scales sublinearly with the number of channels which means it can be employed even for dense sensor arrays with 200-300 sensors, leading to better estimation of the patterns and their origin in the brain. We will release our code online upon publication.

## Acknowledgment

This work was supported by the ERC Starting Grant SLAB ERC-YStG-676943 and by the ANR THALAMEEG ANR-14-NEUC-0002-01

## References

## Appendix A Optimization details

In this section, we give more details about the optimization procedures used to speed-up both -step and -step.

### a.1 Details on the -step optimization

First, let’s recall the objective function, as introduced in Section 3.2:

(A.1) |

which we optimize under the constraints and . To compute the gradient of relatively to a full atom , we introduce some constants and , which are constant during the entire -step:

(A.2) |

Indeed, we have:

(A.3) | ||||

(A.4) | ||||

(A.5) | ||||

(A.6) | ||||

(A.7) | ||||

(A.8) |

where are computed with:

(A.9) |

and where are computed with:

(A.10) |

Note that in the last equation (A.10), the sum only concerns the defined terms, . The computational complexities of and are respectively and . Then, the gradients relative to and are obtained using the chain rule,

(A.11) | ||||

(A.12) |

and can be computed, up to a constant term , with the following

(A.13) |

Algorithm 2 details the different step used in our algorithm to update and .

### a.2 Details on the -step optimization

#### a.2.1 The coordinate update

###### Proposition 1.

The optimal update of the coefficient is given by

with and where is the canonical vector in with value 1 in and value 0 elsewhere.

###### Proof.

For , we will denote the cost difference between our current solution estimate and the signal where the coefficient has been replaced by ,

Let for all . This quantity denotes the residual when is set to 0. It is important to note that it can be re-written as,

and thus, . The cost difference is,

Using this result, we can derive the optimal value to update the coefficient as the solution of the following optimization problem:

(A.14) |

Simple computations show the desired result,

. ∎

#### a.2.2 The update

###### Proposition 2.

When updating the coefficient to the value , is updated with:

(A.15) |

###### Proof.

The value of is independent of the value of . Indeed, the term cancel the contribution of in the convolution . Thus, when updating the value of the coefficient , is not updated. We denote the activation signal where the coefficient as been updated to , ,

For ,

With this relation, it is possible to keep up to date with few operation after each coordinate update. ∎

#### a.2.3 Precomputation for

Similarly to the -step precomputations, we can precompute to speed up the LGCD iterations during the -step. We have:

(A.16) |

In the case of the rank-1 constraint model, we can factorize the computation with:

(A.17) |

The computational complexities are respectively and .

## Appendix B Additional Experiments

### b.1 Speed performance

We present here more benchmarks as described in section 4, yet with different settings. First we used shorter atoms of length instead of , and results are presented in Figure B.1. They confirm the competitiveness of our method, especially when using large regularization parameters. On these problems, the maximum possible regularization was around 90.