In many important statistical applications, the number of parameters is much larger than the number of observations
. A popular approach to tackle linear regression problems in such high dimension scenarios is to consider convex-type penalties, as popularized by Tibshirani (1996). The use of these penalties relies on a regularization parameter trading data fidelity versus sparsity. Unfortunately, statistical analysis reveals that the optimal should be proportional to the noise level (Bickel et al., 2009), which is rarely known in practice. To tackle this issue, one can jointly estimate the noise level and the regression coefficients. Such a concomitant estimation (Huber and Dutter, 1974; Huber, 1981) has recently been adapted for sparse regression by Owen (2007) and analyzed under several names such as Square root Lasso (Belloni et al., 2011) or scaled Lasso (Sun and Zhang, 2012).
In these latter works, the noise parameter consists of a single variance parameter. However, in various applied settings, mixing data of different natures or coming from different sources is customary to increase the number of observations. This often leads to heteroscedasticity: the data may be contaminated with non-uniform noise levels (differing across features or samples). This is the case for magneto-electroencephalographic (M/EEG) data, where observations come from three different types of sensors (gradiometers, magnetometers and electrodes), leading to very different amplitudes, noise levels and noise covariance matrices. Attempts to cope with heteroscedasticity were analyzed in this context by Daye et al. (2012); Wagener and Dette (2012); Kolar and Sharpnack (2012); Dalalyan et al. (2013). Moreover, fast algorithms relying on smoothing techniques from the optimization community (Nesterov, 2005; Beck and Teboulle, 2012) have been extended to heteroscedastic regression in a multi-task setting, through the Smooth Generalized Concomitant Lasso (SGCL, Massias et al. (2018)). The SGCL is designed to jointly estimate the regression coefficients and the noise co-standard deviation matrix111the square root of the noise covariance matrix
. However, in certain applications, such as with M/EEG data, the number of parameters in the co-standard deviation matrix () is typically equal to the number of observations, making it statistically impossible to estimate accurately.
When observations are contaminated with a strong noise and the signal-to-noise ratio (SNR) is too low, provided measurements can be repeated, a natural idea is to average them. Indeed, under the assumption that the signal of interest is corrupted by some additive independant noise realizations, averaging different measurements divides the noise variance by the number of repetitions. This is classically done in experimental sciences from chemistry, to physics or neuroscience as it generally allows to visually inspect signals. This effect is illustrated in Figure 1 for electroencephalography (EEG) data. By averaging from 5 to 50 repetitions of the electrical response of the brain to a stimulus one can reveal a so-called evoked brain response around 100 ms after stimulation. It is usually this type of averaged data which is plugged into optimization solvers, hence discarding individual observations that could be used to better characterize the noise and improve the statistical estimation (Gramfort et al., 2013; Ou et al., 2009).
In this work, we propose the (), an estimator designed to exploit all available measurements collected during repetitions of experiments. The proposed concomitant formulation of the optimization problem derived has two strong benefits: first, the noise covariance is an explicit parameter of the model, on which it is easy to add structural constraints (block diagonality) and second, smoothing theory leads to a cost function that can be minimized using efficient proximal coordinate descent techniques. By estimating the regression coefficients and the noise structure, this estimator demonstrates improvements in suppport identification compared to estimators using averaged data or assuming homoscedastic noise.
2 Heteroscedastic concomitant estimation
For a matrix its column ( row) is denoted (). Let be the number of repetitions of the experiment. The observations matrices are denoted with being the number of sensors/samples and corresponds, for example, to a number of tasks or a number of time samples. The mean over the repetitions of the observations matrices is written . Let be the design matrix, with features stored column-wise: . The matrix contains the coefficients of the linear regression model. We write (
) for the standard Euclidean norm (inner product) on vectors and matrices,for the norm, for any . For a matrix , () is its row-wise norm, and for any , we write for the Schatten -norm (the
norm of the singular values of). The notation () stands for the set of positive semi-definite matrices (positive definite matrices). For and , if , and if . For a square matrix , represents the trace of and is the Mahalanobis norm induced by . For , we denote , and . The block soft-thresholding operator at level , is denoted , and reads for any vector , .
For , let us write for the associated unit
ball. The identity matrix of sizeis denoted , and is the set of integers from to .
2.1 Model and proposed estimator
We consider observations where repetitions of the same experiment are performed, leading to measurements . We assume each measurement follows the same linear model:
where the entries of are
standard normal distributions, thes are independent and is the co-standard deviation matrix, the square root of the noise covariance matrix.222since we impose , it is uniquely defined Note that even if the observations are different because of the noise , the true parameter and the noise structure are shared across repetitions.
2.2 Connections with previous estimator
In low SNR settings, the standard way to deal with strong noise is to use the averaged observation instead of the raw observations. The associated model reads:
with and has entries drawn from a standard normal distribution. The SNR of the average is multiplied by , yet the number of available samples to characterize the noise goes from to . This explains why averaging is not sufficient to estimate complex heteroscedastic noise models, when is far from a scalar matrix. Our introduced is a generalization of the (Massias et al., 2018) which only targets averaged observations: [, ] estimates the parameters of model (3), by solving:
Note that estimates , while estimates . Since we impose the constraint , we adapt the scaling so that in (4) for future comparisons.
is a particular case of : for and , the solution of is the same as the one of .
The justification for the introduction of is the following: using the quadratic loss , the parameters of model (1) can be estimated by using either or as a data-fitting term. It turns out that in such a case, the two alternatives yield the same solutions (as the two terms are equal up to constants in ). Hence, apart from averaging, the quadratic loss does not leverage the multiple repetitions, and ignores the noise structure. Using the data-fitting term of allows to incorporate this structure.
2.3 Smoothing of the nuclear norm
In this section we shed some light on the properties of , especially with respect to smoothing theory (Nesterov, 2005; Beck and Teboulle, 2012). The following proposition relates the data-fitting term used in and showing the link with the smoothing of the Schatten 1-norm (the trace norm or the nuclear norm). For that, we introduce the following smoothing function:
and the inf-convolution of functions and , defined as .
The -smoothing of the Schatten-1 norm, the function , is the solution of the following (smooth) optimization problem:
[Clipped Square Root] For , let us define the Clipped Square Root operator:
where and is orthogonal.
Note that is the projection of the square root of onto the affine cone .
We can now state explicitly the connection between the , and the Schatten 1-norm.
[Smoothing properties of ]
The solution of the problem in Equation 2, is a solution of:
where and .
Following Section 2.2, similar properties can be obtained for and letting and substituting to in the former proposition.
Ideas similar to the ones of Sections 2.3 and 2.3 can be traced to van de Geer (2016, Lemma 3.4, p. 37), where the following formulation was introduced to prove oracle inequalities for the multivariate square-root Lasso333defined as the solution of Equation 9 with :
In the present contribution, the problem formulation in Section 2.3 is motivated by computational aspects, as it helps to address the combined non-differentiability of the data-fitting term and the penalty term.
Other alternatives to exploit the multiple repetitions without simply averaging them, would consist in investigating other Schatten -norms:
Without smoothing, problems of the form given in Equation 9 have the drawback of having no smooth term, and solvers have to resort to proximal splitting algorithms (the ones by Douglas and Rachford (1956) or Chambolle and Pock (2011)) that can handle the sum of two non-smooth components.
Even if the non-smooth Schatten 1-norm is replaced by the formula in (8), numerical challenges remain: can approach 0 arbitrarily, hence, the gradient of the data-fitting term is not Lipschitz over the optimization domain. A similar problem was raised by Ndiaye et al. (2017) for the concomitant Lasso and leads to the introduction of smoothing techniques to address it.
3 Properties of
We detail the principal results needed to solve Problem (2) numerically in this section, leading to the efficient implementation proposed in Algorithm 1. We first recall some technical results from alternate minimization to optimize composite problems.
3.1 Alternate minimization
is jointly convex in and so minimizing the objective alternatively in and in (see Algorithm 1) converges to a global minimum.
is jointly convex over , see Boyd and Vandenberghe (2004, Example 3.4). This means that is jointly convex in , moreover is linear in , thus is jointly convex in , meaning that is jointly convex in . Moreover the constraint set is convex and thus solving is a convex problem. ∎
As the next proposition shows, optimizing over with being fixed involves the Clipped Square Root operator of Section 2.3. [Minimization in ] Let be fixed. The minimization of with the constraint admits the closed-form solution:
Minimizing in amounts to solving
with . The solution is (see Massias et al. (2018, Appendix A2)), and . ∎
We can now state the update of the term in the alternate minimization process. Here, because of the row-wise separability of , we use block updates over rows, whose closed-form are provided by the next proposition:
Each step of the block minimization of in the line of admits a closed-form solution:
The function to minimize is the sum of a smooth term and a non-smooth but separable term, , whose proximal operator can be computed:
is -smooth with respect to , with partial gradient ,
is row-wise separable over , with .
Hence, proximal block-coordinate descent converges (Tseng and Yun, 2009), and the update reads like Equation 12. The closed-form formula arises since the smooth part of the objective is quadratic and isotropic . ∎
3.2 Critical parameter, duality gap and stopping criterion
As for the Lasso we show that there is a critical parameter, there exists such that whenever is greater than this value, the coefficients recovered vanish: [Critical regularization parameter] For the estimator we have the following property: with ,
First notice that if , then .
Fermat’s rules states that
To ensure the convergence of our algorithm, we use the duality gap as a stopping criterion (as it guarantees a targeted sub-optimality level). For its computation we therefore need to explicitely state the dual optimization problem of Problem (2):
With , the dual formulation of Problem (2) is
In Algorithm 1 the dual point at iteration is obtained through the relations (with the current primal iterate), and then projected on .
Once the quantity has been pre-computed, the cost of updating in does not depend on , hence is the same as working with averaged data. Indeed, being defined as , computing can be done in (details are in Section B.3).
3.3 Statistical comparison
In this section, we show the statistical interest of using all repetitions of the experiments instead of using a mere averaging as would do (remind that the later is equivalent to with and , see Section 2.2).
Let us introduce , the true covariance matrix of the noise ( with our notation). In and alternate minimization consists in a succession of estimations of and (more precisely is estimated along the process). In this section we explain why the estimation of provided by has some more interesting statistical properties with respect to one obtained by . For that, we can compare the estimates of one would obtain provided that the true parameter is known by both and . In such “ideal” scenario, the associated estimators of could be written:
Section 3.3 states that and
are unbiased estimators ofbut our newly introduced , improves the estimation of the covariance structure by a factor , the number of repetitions performed.
Empirically444in that case we plug (resp. ) in Section 3.3, we have also observed that has larger eigenvalues than , leading to a less biased estimation of after clipping the singular values.
The code is released as an open source package and can be found here https://github.com/QB3/CLaR. The implementation is in Python, with Numba compilation (Lam et al., 2015) to increase the speed of the algorithm.
Comparison with other estimators
We compare to other estimators: , the Multi-Task Lasso (, Obozinski et al. 2010) and a version of the MTL with repetitions: . Recall that solves Problem (2) and that solves Problem (4); we also remind the definition of and :
With , first solves:
then the estimator is defined as , with .
The cost of an epoch of coordinate descent and the cost of computing the gap for each for each algorithm are summarized in Table 1.
4.1 Support recovery
Here we demonstrate the ability of our estimator to recover the support the ability to identify the predictive features. There are observations, features, tasks. The design is random with Toeplitz-correlated features with parameter (correlation between and is ), and its columns have unit Euclidean norm. The true coefficient has non-zeros rows whose entries are independent and normally centered distributed. is a Toeplitz matrix with parameter . The SNR is fixed and constant across all repetitions
For Figures 5, 4, 3 and 2, fhe figure of merit is the ROC curve, the true positive rate against the false positive rate. For the four estimators, the ROC curve is obtained by varying the value of the regularization parameter on a geometric grid of points, starting from (specific to each algorithm) to , the latest being also specific and chosen to obtain large enough false positive rates.
On Figure 2 we can see that when the SNR is high (top left), all curves reach the (0, 1) point. This means that for each algorithm, there exists a such that the estimated support is exactly the true one. However, when the SNR decreases (top right, bottom left), the performance of , and starts to drop, while that of remains stable. Finally, when the SNR is too low (bottom right), all algorithms perform poorly, but still performs better. This highlights the capacity of to leverage multiple repetitions of measurements to finely estimate the heteroscedastic noise.
Noise structure influence
Figure 3 represents the ROC curves of , , and for different values of . As increases, the noise becomes less and less heteroscedastic: from top left to bottom right, the performance of and increases as they are designed to exploit correlations in the noise, while the performance of and decreases, as their homoscedastic model becomes less and less valid.
The same idea is presented in a different manner in Figure 4 which shows the cases where performs well and poorly. For instance when the structure of the correlation matrix of the noise is close to identity, performs poorly. This a limitation of (and ) as if the noise turns out to be homoscedastic, just adds model-useless parameters that are likely to fit the noise.
Influence of the number of repetitions
Figure 5 shows ROC curves of , , and for different values of , starting from , where and are equivalent (as well as and ), to . It can be seen that even with outperforms the other estimators, and that with benefits more than any of the other algorithms from the large number of repetitions.
4.2 Real data
We now evaluate our estimators on real magneto and electroencephalography (M/EEG) data. The M/EEG recordings measure the electrical potential and magnetic field induced by the active neurons. Data are time series of lengthwith sensors and sources (locations in the brain). Because the propagation of the electromagnetic fields is driven by the linear Maxwell equations, one can assume that the relation between the measurements and the amplitudes of sources in the brain is linear. The M/EEG inverse problem consists in identifying . Because of the limited number of sensors (a few hundred in practice), as well as the physics of the problem, the M/EEG inverse problem is severely ill-posed and needs regularization that provides plausible biological solutions. Because the experiments are usually short (less than 1 s.) and focused on specific cogntive functions, the number of active sources is expected to be small, is assumed to be row-sparse. Thus the M/EEG inverse problem fits the framework of Section 2.
We use the sample dataset from MNE (Gramfort et al., 2014). The experimental conditions are here auditory stimulations in the right or left ears or visual stimulations in the right or left visual fields, leading to different active locations in the brain (different for each type of stimulation). For this experiment, we keep only the gradiometer magnetic channels, and we reject one of them due to strong artifacts. This leads to gradiometers, signals. The length of the temporal series is , and the data contains repetitions. We choose a source space of size (oct-3 resolution). The orientation is fixed, and normal to the cortical mantle.
To generate the semi-real data we use the real design matrix and the real co-standard deviation matrix , estimated on pre-stimulation data. We then generate plausible MEG signals with MNE. The signals being contaminated with correlated noise, if one wants to use homoscedastic solvers it is necessary to whiten the data first (and thus to have an estimation of the covariance matrix, the later often being poor or not known). In this experiment we demonstrate that without this whitening process, the homoscedastic solver fails, whereas succeeds (we dropped here since it performed poorly on synthetic data and was way to slow to converge).
However, it would not make sense to apply our solver directly on the design matrix . Here we describe the preprocessing applied to and , using information from only. First we pre-whiten the data and the design matrix by multiplying from the left by the whitener matrix . Then, we rescale each column of the design matrix to have a (Euclidean) norm of . Finally, as in Section 4.1, Figure 6 is obtained by varying the estimator-specific regularization parameter from to on a geometric grid.
Results of this experiment are in Figure 6. With (top) active sources in the brain, auditory left and auditory right (), the estimator performs poorly and does not recover the full support before reaching a false positive rate (FPR) of . It makes sense since the is not designed to cope with heteroscedastic noise, and the data is not whitened in this experiment. also performs poorly, because of its statistical limitations: the number of observations used to estimate the covariance matrix in is too low (here observations to estimate parameters). performs better and is able to recover the true sources with a lower FPR. It is worth noting that in observations are used to estimate the parameters of the covariance matrix.
Figure 6 (bottom) shows an experiment with active sources, auditory left, auditory right and visual left, can recover sources among with a low FPR, whereas and do not identify the sources before at least a FPR of .
This work introduces , a sparse regression estimator designed to handle heteroscedastic noise in the context of repeated observations, a standard framework in applied sciences such as neuroimaging. The resulting optimization problem can be solved efficiently with standard tools, and the algorithmic cost is the same as for single repetition data. The theory of smoothing connects to the Schatten 1-Lasso in a principled manner, which opens the way for the smoothing of other Schatten norms. The benefits of for support recovery in heteroscedastic context were extensively investigated both on simulations and on real data.
This work was funded by ERC Starting Grant SLAB ERC-YStG-676943.
- Beck  A. Beck. First-Order Methods in Optimization, volume 25. SIAM, 2017.
- Beck and Teboulle  A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM J. Optim., 22(2):557–580, 2012.
- Belloni et al.  A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
- Bickel et al.  P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4):1705–1732, 2009.
- Boyd and Vandenberghe  S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- Chambolle and Pock  A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120–145, 2011.
- Dalalyan et al.  A. S. Dalalyan, M. Hebiri, K. Meziani, and J. Salmon. Learning heteroscedastic models by convex programming under group sparsity. In ICML, 2013.
- Daye et al.  J. Daye, J. Chen, and H. Li. High-dimensional heteroscedastic regression with an application to eQTL data analysis. Biometrics, 68(1):316–326, 2012.
- Douglas and Rachford  J. Douglas and H. H. Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
- Gramfort et al.  A. Gramfort, D. Strohmeier, J. Haueisen, M. S. Hämäläinen, and M. Kowalski. Time-frequency mixed-norm estimates: Sparse M/EEG imaging with non-stationary source activations. NeuroImage, 70:410–422, 2013.
- Gramfort et al.  A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. Hämäläinen. MNE software for processing MEG and EEG data. NeuroImage, 86:446–460, 2014.
- Huber  P. J. Huber. Robust Statistics. John Wiley & Sons Inc., 1981.
- Huber and Dutter  P. J. Huber and R. Dutter. Numerical solution of robust regression problems. In Compstat 1974 (Proc. Sympos. Computational Statist., Univ. Vienna, Vienna, 1974), pages 165–172. Physica Verlag, Vienna, 1974.
- Kolar and Sharpnack  M. Kolar and J. Sharpnack. Variance function estimation in high-dimensions. In ICML, pages 1447–1454, 2012.
- Lam et al.  S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pages 1–6. ACM, 2015.
- Massias et al.  M. Massias, O. Fercoq, A. Gramfort, and J. Salmon. Generalized concomitant multi-task lasso for sparse multimodal regression. In AISTATS, volume 84, pages 998–1007, 2018.
- Ndiaye et al.  E. Ndiaye, O. Fercoq, A. Gramfort, V. Leclère, and J. Salmon. Efficient smoothed concomitant lasso estimation for high dimensional regression. Journal of Physics: Conference Series, 904(1), 2017.
- Nesterov  Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152, 2005.
- Obozinski et al.  G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2010.
- Ou et al.  W. Ou, M. Hämaläinen, and P. Golland. A distributed spatio-temporal EEG/MEG inverse solver. NeuroImage, 44(3):932–946, Feb 2009.
A. B. Owen.
A robust hybrid of lasso and ridge regression.Contemporary Mathematics, 443:59–72, 2007.
Parikh et al. 
N. Parikh, S. Boyd, E. Chu, B. Peleato, and J. Eckstein.
Foundations and Trends in Machine Learning, 1(3):1–108, 2013.
- Sun and Zhang  T. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
- Tibshirani  R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267–288, 1996.
- Tseng and Yun  P. Tseng and S. Yun. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl., 140(3):513, 2009.
van de Geer 
S. van de Geer.
Estimation and testing under sparsity, volume 2159 of
Lecture Notes in Mathematics.
Lecture notes from the 45th Probability Summer School held in Saint-Four, 2015, École d’Été de Probabilités de Saint-Flour.
- Wagener and Dette  J. Wagener and H. Dette. Bridge estimators and the adaptive Lasso under heteroscedasticity. Math. Methods Statist., 21:109–126, 2012.
Appendix A Smoothing
a.1 Basic properties of inf-convolution
Let be a closed proper convex function and let be a convex function with Lipschitz gradient. Let .
The following holds (see Parikh et al. [2013, p. 136]):
a.2 Smoothing of Schatten norms
In all this section, the variable is a matrix . Let . Let be the Hölder conjugate of , . For the choice , the following holds true:
a.3 Schatten 1-norm (nuclear/trace norm)
For the choice , then for any the following holds true: