1 Introduction
Magnetoencephalography (MEG) measures the components of the magnetic field surrounding the head, while Electroencephalography (EEG) measures the electric potential at the surface of the scalp. Both can do so with a temporal resolution of less than a millisecond. Localizing the underlying neural activity on a high resolution grid of the cortex, a problem known as source imaging, is inherently an “illposed” linear inverse problem: Indeed, the number of potential sources is larger than the number of MEG and EEG sensors, which implies that, even in the absence of noise, different neural activity patterns could result in the same electromagnetic field measurements. To limit the set of possible solutions, prior hypotheses on the nature of the source distributions are necessary. The minimumnorm estimates (MNE) for instance are based on Tikhonov regularization which leads to a linear solution [11]. An norm penalty was also proposed by [34], modeling the underlying neural pattern as a sparse collection of focal dipolar sources, hence their name “Minimum Current Estimates” (MCE). These methods have inspired a series of contributions in source localization techniques relying on noise normalization [6, 28] to correct for the depth bias [1] or blocksparse norms [30, 10] to leverage the spatiotemporal dynamics of MEG signals. While such techniques have had some success, source estimation in the presence of complex multidipole configurations remains a challenge. In this work we aim to leverage the anatomical and functional diversity of multisubject datasets to improve localization results.
Related work.
This idea of using multisubject information to improve statistical estimation has been proposed before in the neuroimaging literature. In [20] it is showed that different anatomies across subjects allow for point spread functions that agree on a main activation source but differ elsewhere. Averaging across subjects thereby increases the accuracy of source localization. On fMRI data, [35] proposed a probabilistic dictionary learning model to infer activation maps jointly across a cohort of subjects. A similar idea led [19] to introduce a Bayesian framework to account for functional intersubject variability. To our knowledge, the only contribution formulating the problem as a multitask regression model employs a Group Lasso with an block sparse norm [21]. Yet this forces every potential neural source to be either active for all subjects or for none of them.
Contribution.
The assumption of identical functional activity across subjects is clearly not realistic. Here we investigate several multitask regression models that relax this assumption. One of them is the multitask Wasserstein (MTW) model [15]
. MTW is defined through an Unbalanced Optimal Transport (UOT) metric that promotes support proximity across regression coefficients. However, applying MTW to group level data assumes that the signaltonoise ratio is the same for all subjects. We propose to build upon MTW and alleviate this problem by inferring estimates of both sources and noise variance for each subject. To do so, we follow similar ideas that lead to the concomitant Lasso
[27, 31, 25] or the multitask Lasso [24]. This paper is organized as follows. Section 2 introduces the multitask regression source imaging problem. Section 3 presents some background on UOT metrics and explains how MWE are carried out. Section 4 presents the results of our experiments on both simulated and MEG datasets.Notation.
We denote by
the vector of ones in
and by the set for any integer . The set of vectors in with nonnegative (resp. positive) entries is denoted by (resp. ). On matrices, , and the division operator are applied elementwise. We use for the elementwise multiplication between matrices or vectors. If is a matrix, denotes its row and its column. We define the KullbackLeibler (KL) divergence between two positive vectors by with the continuous extensions and . We also make the convention . The entropy of is defined as . The same definition applies for matrices with an elementwise double sum.2 Source imaging as a multitask regression problem
We formulate in this section the inverse problem of interest in this paper, and recall how a multitask formulation can be useful to carry out a joint estimation of all these parameters through regularization.
Source modeling.
Using a volume segmentation of the MRI scan of each subject, the positions of potential sources are constructed as a set of coordinates uniformly distributed on the cortical surface of the gray matter. Moreover, synchronized currents in the apical dendrites of cortical pyramidal neurons are thought to be mostly responsible for MEG signals
[26]. Therefore, the dipole orientations are usually constrained to be normal to the cortical surface. We model the current density as a set of focal current dipoles with fixed positions and orientations. The purpose of source localization is to infer their amplitudes. The ensemble of possible candidate dipoles forms the source space.Forward modeling.
Let denote the number of sensors (EEG and/or MEG) and the number of sources. Following Maxwell’s equations, at each time instant, the measurements are a linear combination of the current density . However, we observe noisy measurements given by:
(1) 
where is the noise vector. The linear forward operator is called the leadfield or gain matrix, and can be computed by solving Maxwell’s equations using the Boundary element method [12]. Up to a whitening preprocessing step,
can be assumed Gaussian distributed
.Source localization.
Source localization consists in solving in the inverse problem (1) which can be cast as a least squares problem:
(2) 
Since , problem (2) is illposed and additional constraints on the solution are necessary. When analyzing evoked responses, one can promote source configurations made of a few focal sources, e.g. using the norm. This regularization leads to problem (3
) called minimum current estimates (MCE), also known in the machine learning community as the Lasso
[32].(3) 
where
is a tuning hyperparameter.
Common source space.
Here we propose to go beyond the classical pipeline and carry out source localization jointly for subjects. First, dipole positions (features) must correspond to each other across subjects. To do so, the source space of each subject is mapped to a high resolution average brain using morphing where the sulci and gyri patterns are matched in an auxiliary spherical inflating of each brain surface [8]. The resulting leadfields have therefore the same shape with aligned columns.
Multitask framework.
Jointly estimating the current density of each subject can be expressed as a multitask regression problem where some coupling prior is assumed on through a penalty :
(4) 
Following the work of [15], we propose to define using an UOT metric.
3 Minimum Wassertein Estimates
We start this section with background material on UOT. Consider the finite metric space where each element of corresponds to a vertex of the source space. Let be the matrix where corresponds to the geodesic distance between vertices and . Kantorovich [16]
defined a distance for normalized histograms (probability measures) on
. However, it can easily be extended to nonnormalized measures by relaxing marginal constraints [4].Marginal relaxation.
Let be two normalized histograms on . Assuming that transporting a fraction of mass from to is given by , the total cost of transport is given by . Minimizing this total cost with respect to must be carried out on the set of feasible transport plans with marginals and . The (normalized) WassersteinKantorovich distance reads:
(5) 
In practice, if and are positive and normalized current densities, WK will quantify the geodesic distance between their supports along the curved geometry of the cortex. This property makes OT metrics adequate for assessing the proximity of functional patterns across subjects. To allow to be nonnormalized, the marginal constraints in (5) can be relaxed using a KL divergence:
(6) 
where is a hyperparameter that enforces a fit to the marginals.
Entropy regularization.
Entropy regularization was introduced by [5]
to propose a faster and more robust alternative to the direct resolution of the linear programming problem (
5). Formally, this amounts to minimizing the loss whereis a tuning hyperparameter. This penalized loss function can be written:
up to a constant [3]. Combining entropy regularization with marginal relaxation in (6), we get the unbalanced Wasserstein distance as introduced by [4]:(7) 
Generalized Sinkhorn.
Problem (7) can be solved as follows. Let and . Starting from two vectors set to and iterating the scaling operations , until convergence, the minimizer of (7) can be computed as . This algorithm is a generalization of the Sinkhorn algorithm [18]. Since it involves matrixmatrix operations, it benefits from parallel hardware, such as GPUs.
Extension to .
We extend next the Wasserstein distance to signed measures. We adopt a similar idea to what was suggested in [23, 29, 15] using a decomposition into positive and negative parts, where and . For any vectors , we define the generalized Wasserstein distance as:
(8) 
Note that (see [15] for a proof), thus on positive measures . For the sake of convenience, we refer to in (8) by the Wasserstein distance, even though it does not verify indiscernability. In practice, this extension allows to compare current dipoles across subjects according to their polarity which could be either towards the deep or superficial layers of the cortex.
The MTW model.
The multitask Wasserstein model is the specific case of (4) with a penalty promoting both sparsity and supports’ proximity:
(9) 
where are tuning hyperparameters. The OT term in (9) can be seen as a spatial variance. Indeed, the minimizer corresponds to the Wasserstein barycenter with respect to the distance .
Minimum Wasserstein Estimates.
One of the drawbacks of MTW is that is common to all subjects. Indeed, the loss considered in MTW implicitly assumes that the level of noise is the same across subjects. Following the work of [25]
on the smoothed concomitant Lasso, we propose to extend MTW by inferring the specific noise standard deviation
along with the regression coefficient of each subject. This allows to scale the weight of the according to the level of noise. The Minimum Wasserstein Estimates (MWE) model reads:(10) 
where is a predefined constant. This lower bound constraint avoids numerical issues when and therefore the standard deviation estimate also tends to 0. In practice can be set for example using prior knowledge on the variance of the data or as a small fraction of the initial estimate of the standard deviation . In practice we adopt the second option and set .
Algorithm.
By combining (7), (8) and (10), we obtain an objective function taking as arguments . This function restricted to all parameters except is jointly convex [15]. Moreover, each is only coupled with the variable . The restriction on every pair is also jointly convex [25]. Thus the problem is jointly convex in all its variables. We minimize it by alternating optimization. To justify the convergence of such an algorithm, one needs to notice that the nonsmooth norms in the objective are separable [33]. The update with respect to each is given by solving the first order optimality condition (Fermat’s rule):
(11) 
which also corresponds to the empirical estimator of the standard deviation when the constraint is not active. To update the remaining variables, we follow the same optimization procedure laid out in [15] and adapted to MWE in Algorithm 1. Briefly, let (resp. ), when minimizing with respect to one (resp. ), the resulting problem can be written (dropping the exponents for simplicity):
(12) 
which can be solved using proximal coordinate descent [7]. Note that the additional inference of a specific for each subject allows to scale the Lasso penalty depending on their particular level of noise. The final update with respect to can be cast as two Wasserstein barycenter problems, carried out using generalized Sinkhorn iterations [4]. Note that we do not need to compute the transport plans since inferring every source estimate only requires the knowledge of the left marginal which does not require storing in memory.
4 Experiments
Benchmarks: Dirty models and Multilevel Lasso.
As discussed in introduction, standard sparse source localization solvers are based on an norm regularization, applied to the data of each subject independently. We use the independent Lasso estimator as a baseline. We compare MWE to the GroupLasso estimator [37, 2] which was proposed in this context to promote functional consistency across subjects [21]. It falls in the multitask framework of (4) where the joint penalty is defined through an mixed norm where . We also evaluate the performance of more flexible block sparse models where only a fraction of the source estimates are shared across all tasks: Dirty models [14] and the multivel lasso [22]. In Dirty models source estimates are written as a sum of two parts which are penalized with different norms. One common to all subjects (penalty ) and one specific for each subject (penalty ). The Multilevel Lasso (MLL) [22] applies the same idea using instead a product decomposition and a Lasso penalty on both parts. We also compare MWE with MTW to evaluate the benefits of inferring noise levels adaptively.
Simulation data and MEG/fMRI datasets.
We use the public dataset DS117 [36] which provides MEG, EEG and fMRI data of 16 healthy subjects to whom were presented images of famous, unfamiliar and scrambled faces. Using the MRI scan of each subject, we compute a source space and its associated leadfield comprising around 2500 sources per hemisphere [9]. Keeping only MEG gradiometer channels, we have observations per subject. For realistic data simulation, we use the actual leadfields from all subjects, yet restricted to the left hemisphere. We thus have 16 leadfields with . We simulate an inverse solution with sources (sparse vector) by randomly selecting one source per label among predefined labels using the aparc.a2009s parcellation of the Destrieux atlas. To model functional consistency, 50% of the subjects share sources at the same locations, the remaining 50% have sources randomly generated in the same labels (see Figure 1
). Their amplitudes are taken uniformly between 20 and 30 nAm. Their sign is taken at random with a Bernoulli distribution (0.5) for each label. We simulate
using the forward model with a variance matrix . We set so as to have an average signaltonoise ratio across subjects equal to 4 (SNR).We evaluate the performance of all models knowing the ground truth by comparing the best estimates in terms of three metrics: the mean squared error (MSE) to quantify accuracy in amplitude estimation, AUC and a generalized Earth mover distance (EMD) to assess supports estimation. We generalize the PRAUC (Area under the curve Precisionrecall) by defining AUC where PRAUC is computed between the estimated coefficients and the true supports. We compute EMD between normalized values of sources: EMD. Since is expressed in centimeters, WK can be seen as an expectation of the geodesic distance between sources. The mean across subjects is reported for all metrics.
Simulation results.
We set the number of sources to 3 and vary the number of subjects under two conditions: (1) using one leadfield for all subjects, (2) using individual leadfields. Each model is fitted on a grid of hyperparameters and the best AUC/MSE/EMD scores are reported. We perform 30 different trials (with different true activations and noise, different common leadfield for condition (1)) and report the mean within a 95% confidence interval in Figure
2. Various observations can be made. The Group Lasso performs poorly – even compared to independent Lasso – which is expected since sources are not common for all subjects. Nonconvexity allows MLL to be very effective with less than 24 subjects. Its performance yet degrades with more subjects. OTbased models (MWE and MTW) however benefit from the presence of more subjects by leveraging spatial proximity. They reduce the average error distance from 4 cm (Lasso) to less than 1 cm and reach an AUC of 0.9. One can also observe that the estimation of the noise standard deviation in the MTW model does improve performance. Finally, we can appreciate the improvement of multitask models when increasing the number subjects, especially when using different leadfield matrices. We argue that the different folding patterns of the cortex across subjects lead to different dipole orientations thereby increasing the chances of source identification.Results on MEG/fMRI data
The fusiform face area specializes in facial recognition and activates around 170ms after stimulus
[17, 13]. To study this response, we perform MEG source localization using Lasso and MWE. We pick the time point with the peak response for each subject within the interval 150200 ms after visual presentation of famous faces. For both models, we select the smallest tuning parameter for which less than 10 sources are active for each subject. Figure 3 shows how UOT regularization favors activation in the ventral pathway of the visual cortex.The Lasso solutions in Figure 3 show significant differences between subjects. Since no ground truth exists, one could argue that MWE promotes consistency at the expense of individual signatures. To address this concern we compute the standardized fMRI Zscore of the conditions famous vs scrambled faces. We compare Lasso, MWE and fMRI by computing for each subject the ratio largest value in fusiform gyrus / largest absolute value. We report the mean across all subjects in Figure 4. Note that for all subjects, the fMRI Zscore reaches its maximum in the fusiform gyrus, and that MWE regularization leads to more agreement between MEG and fMRI. Figure 5 shows MEG with MWE and fMRI results for subject 2.
Conclusion
We proposed in this work a novel approach to promote functional consistency through a convex model defined using an Unbalanced Optimal Transport regularization. Using a public MEG and fMRI dataset, we presented experiments demonstrating that MWE outperform multitask sparse models in both amplitude and support estimation. We have shown in these experiments that MWE can close the gap between MEG and fMRI source imaging by gathering data from different subjects.
References
 [1] Ahlfors, S.P., Ilmoniemi, R.J., Hämäläinen, M.S.: Estimates of visually evoked cortical currents. Electroencephalography and Clinical Neurophysiology 82(3), 225–236 (2018/11/20 1992)
 [2] Argyriou, A., Evgeniou, T., Pontil, M.: Multitask feature learning. In: NIPS (2007)
 [3] Benamou, J., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman Projections For Regularized Transportation Problems. Society for Industrial and Applied Mathematics (2015)
 [4] Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Scaling Algorithms for Unbalanced Transport Problems. arXiv:1607.05816 [math.OC] (2017)
 [5] Cuturi, M.: Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NIPS (2013)
 [6] Dale, A.M., Liu, A.K., Fischl, B.R., Buckner, R.L., Belliveau, J.W., Lewine, J.D., Halgren, E.: Dynamic statistical parametric mapping. Neuron 26(1), 55–67 (2000)
 [7] Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM Journal on Optimization 25, 1997–2023 (2015)
 [8] Fischl, B., Sereno, M.I., Dale, A.M.: Cortical surfacebased analysis: Ii: Inflation, flattening, and a surfacebased coordinate system. NeuroImage 9, 195 – 207 (1999), mathematics in Brain Imaging
 [9] Gramfort, A., Luessi, M., Larson, E., Engemann, D.A., Strohmeier, D., Brodbeck, C., Parkkonen, L., Hämäläinen, M.: MNE software for processing MEG and EEG data. NeuroImage 86 (10 2013)
 [10] Gramfort, A., Strohmeier, D., Haueisen, J., Hämäläinen, M., Kowalski, M.: Timefrequency mixednorm estimates: Sparse M/EEG imaging with nonstationary source activations. NeuroImage 70(0), 410 – 422 (2013)
 [11] Hämäläinen, M.S., Ilmoniemi, R.J.: Interpreting magnetic fields of the brain: minimum norm estimates. Medical & Biological Engineering & Computing 32(1), 35–42 (Jan 1994)
 [12] Hämäläinen, M.S., Sarvas, J.: Feasibility of the homogeneous head model in the interpretation of neuromagnetic fields. Physics in Medicine and Biology 32(1), 91 (1987)
 [13] Henson, R.N., Wakeman, D.G., Litvak, V., Friston, K.J.: A parametric empirical bayesian framework for the EEG/MEG inverse problem: Generative models for multisubject and multimodal integration. Frontiers in human neuroscience 5, 76; 76–76 (08 2011)
 [14] Jalali, A., Ravikumar, P., Sanghavi, S., Ruan, C.: A Dirty Model for Multitask Learning. NIPS (2010)
 [15] Janati, H., Cuturi, M., Gramfort, A.: Wasserstein regularization for sparse multitask regression. Arxiv. preprint (2018)
 [16] Kantorovic, L.: On the translocation of masses. C.R. Acad. Sci. URSS (1942)
 [17] Kanwisher, N., McDermott, J., Chun, M.M.: The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience 17(11), 4302–4311 (1997)
 [18] Knopp, P., Sinkhorn, R.: Concerning nonnegative matrices and doubly stochastic matrices. . Pacific Journal of Mathematics 1(2), 343–348 (1967)
 [19] Kozunov, V.V., Ossadtchi, A.: Gala: group analysis leads to accuracy, a novel approach for solving the inverse problem in exploratory analysis of group MEG recordings. Frontiers in Neuroscience 9, 107 (2015)
 [20] Larson, E., Maddox, R.K., Lee, A.K.C.: Improving spatial localization in meg inverse imaging by leveraging intersubject anatomical differences. Frontiers in Neuroscience 8, 330 (2014)
 [21] Lim, M., Ales, J., Cottereau, B.M., Hastie, T., Norcia, A.M.: Sparse eeg/meg source estimation via a group lasso. PLOS (2017)
 [22] Lozano, A., Swirszcz, G.: Multilevel Lasso for Sparse Multitask Regression. ICML (2012)
 [23] Mainini, E.: A description of transport cost for signed measures. Journal of Mathematical Sciences 181(6), 837–855 (Mar 2012)
 [24] Massias, M., Fercoq, O., Gramfort, A., Salmon, J.: Generalized concomitant multitask lasso for sparse multimodal regression. Proceedings of Machine Learning Research, vol. 84, pp. 998–1007. PMLR (09–11 Apr 2018)
 [25] Ndiaye, E., Fercoq, O., Gramfort, A., Leclère, V., Salmon, J.: Efficient smoothed concomitant lasso estimation for high dimensional regression. Journal of Physics: Conference Series 904(1), 012006 (2017)
 [26] Okada, Y.: Empirical bases for constraints in currentimaging algorithms. Brain Topography p. 373–377

[27]
Owen, A.B.: A robust hybrid of lasso and ridge regression. Contemporary Mathematics
443, 59–72 (2007)  [28] PascualMarqui, R.: Standardized lowresolution brain electromagnetic tomography (sloreta): technical details. Methods Find Exp Clin Pharmacol 24, D:5–12 (2002)
 [29] Profeta, A., Sturm, K.T.: Heat flow with dirichlet boundary conditions via optimal transport and gluing of metric measure spaces (2018)
 [30] Strohmeier, D., Bekhti, Y., Haueisen, J., Gramfort, A.: The iterative reweighted mixednorm estimate for spatiotemporal MEG/EEG source reconstruction. IEEE Transactions on Medical Imaging 35(10), 2218–2228 (Oct 2016)

[31]
Sun, T., Zhang, C.H.: Scaled sparse linear regression. Biometrika
99, 879–898 (2012)  [32] Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society 58(1), 267–288 (1996)
 [33] Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)
 [34] Uutela, K., Hämäläinen, M.S., Somersalo, E.: Visualization of magnetoencephalographic data using minimum current estimates. NeuroImage 10(2), 173–180 (1999)
 [35] Varoquaux, G., Gramfort, A., Pedregosa, F., Michel, V., Thirion, B.: Multisubject dictionary learning to segment an atlas of brain spontaneous activity. In: Information Processing in Medical Imaging. vol. 6801, pp. 562–573. Springer (2011)
 [36] Wakeman, D., Henson, R.: A multisubject, multimodal human neuroimaging dataset. Scientific Data 2(150001) (2015)
 [37] Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society 68(1), 49–67 (2006)