# Generalized Linear Model Regression under Distance-to-set Penalties

Estimation in generalized linear models (GLM) is complicated by the presence of constraints. One can handle constraints by maximizing a penalized log-likelihood. Penalties such as the lasso are effective in high dimensions, but often lead to unwanted shrinkage. This paper explores instead penalizing the squared distance to constraint sets. Distance penalties are more flexible than algebraic and regularization penalties, and avoid the drawback of shrinkage. To optimize distance penalized objectives, we make use of the majorization-minimization principle. Resulting algorithms constructed within this framework are amenable to acceleration and come with global convergence guarantees. Applications to shape constraints, sparse regression, and rank-restricted matrix regression on synthetic and real data showcase strong empirical performance, even under non-convex constraints.

## Authors

• 20 publications
• 16 publications
• 16 publications
07/18/2019

### Least Angle Regression in Tangent Space and LASSO for Generalized Linear Model

We propose sparse estimation methods for the generalized linear models, ...
02/08/2013

### Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models

It has been shown that AIC-type criteria are asymptotically efficient se...
09/02/2020

### Extensions to the Proximal Distance of Method of Constrained Optimization

The current paper studies the problem of minimizing a loss f(x) subject ...
12/05/2018

### Jeffreys' prior, finiteness and shrinkage in binomial-response generalized linear models

This paper studies the finiteness properties of a reduced-bias estimator...
04/29/2021

### Generalized Linear Models with Structured Sparsity Estimators

In this paper, we introduce structured sparsity estimators in Generalize...
09/09/2021

### A Proximal Distance Algorithm for Likelihood-Based Sparse Covariance Estimation

This paper addresses the task of estimating a covariance matrix under a ...
01/10/2014

### Lasso and equivalent quadratic penalized models

The least absolute shrinkage and selection operator (lasso) and ridge re...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Background

In classical linear regression, the response variable

follows a Gaussian distribution whose mean

depends linearly on a parameter vector

through a vector of predictors . Generalized linear models (GLMs) extend classical linear regression by allowing to follow any exponential family distribution, and the conditional mean of to be a nonlinear function of mccullagh1989

. This encompasses a broad class of important models in statistics and machine learning. For instance, count data and binary classification come within the purview of generalized linear regression.

In many settings, it is desirable to impose constraints on the regression coefficients. Sparse regression is a prominent example. In high-dimensional problems where the number of predictors exceeds the number of cases , inference is possible provided the regression function lies in a low-dimensional manifold fan2010 . In this case, the coefficient vector is sparse, and just a few predictors explain the response . The goals of sparse regression are to correctly identify the relevant predictors and to estimate their effect sizes. One approach, best subset regression, is known to be NP hard. Penalizing the likelihood by including an penalty (the number of nonzero coefficients) is a possibility, but the resulting objective function is nonconvex and discontinuous. The convex relaxation of regression replaces by the norm . This LASSO proxy for restores convexity and continuity tibshirani1996

. While LASSO regression has been a great success, it has the downside of simultaneously inducing both sparsity and parameter shrinkage. Unfortunately, shrinkage often has the undesirable side effect of including spurious predictors (false positives) with the true predictors.

Motivated by sparse regression, we now consider the alternative of penalizing the log-likelihood by the squared distance from the parameter vector to the constraint set. If there are several constraints, then we add a distance penalty for each constraint set. Our approach is closely related to the proximal distance algorithm lange2016 ; lange2015 and proximity function approaches to convex feasibility problems byrne2001 . Neither of these prior algorithm classes explicitly considers generalized linear models. Beyond sparse regression, distance penalization applies to a wide class of statistically relevant constraint sets, including isotonic constraints and matrix rank constraints. To maximize distance penalized log-likelihoods, we advocate the majorization-minimization (MM) principle BecYanLan1997 ; LanHunYan2000 ; lange2016 . MM algorithms are increasingly popular in solving the large-scale optimization problems arising in statistics and machine learning mairal2015 . Although distance penalization preserves convexity when it already exists, neither the objective function nor the constraints sets need be convex to carry out estimation. The capacity to project onto each constraint set is necessary. Fortunately, many projection operators are known. Even in the absence of convexity, we are able to prove that our algorithm converges to a stationary point. In the presence of convexity, the stationary points are global minima.

In subsequent sections, we begin by briefly reviewing GLM regression and shrinkage penalties. We then present our distance penalty method and a sample of statistically relevant problems that it can address. Next we lay out in detail our distance penalized GLM algorithm, discuss how it can be accelerated, summarize our convergence results, and compare its performance to that of competing methods on real and simulated data. We close with a summary and a discussion of future directions.

### GLMs and Exponential Families:

In linear regression, the vector of responses

is normally distributed with mean vector

and covariance matrix . A GLM preserves the independence of the responses but assumes that they are generated from a shared exponential family distribution. The response is postulated to have mean , where is the th row of a design matrix , and the inverse link function is smooth and strictly increasing mccullagh1989 . The functional inverse of is called the link function. The likelihood of any exponential family can be written in the canonical form

 p(yi|θi,τ)=c1(yi,τ)exp{yθi−ψ(θi)c2(τ)}. (1)

Here is a fixed scale parameter, and the positive functions and are constant with respect to the natural parameter . The function is smooth and convex; a brief calculation shows that . The canonical link function is defined by the condition . In this case, , and the log-likelihood is concave in . Because and are not functions of , we may drop these terms and work with the log-likelihood up to proportionality. We denote this by . The gradient and second differential of amount to

 ∇L=m∑i=1[yi−ψ′(xtiβ)]xiandd2L=−m∑i=1ψ′′(xtiβ)xixti. (2)

As an example, when and , the density (1) is the Gaussian likelihood, and GLM regression under the identity link coincides with standard linear regression. Choosing and

corresponds to logistic regression under the canonical link

with inverse link . GLMs unify a range of regression settings, including Poisson, logistic, gamma, and multinomial regression.

### Shrinkage penalties:

The least absolute shrinkage and selection operator (LASSO) friedman2010 ; tibshirani1996 solves

 (3)

where is a tuning constant that controls the strength of the penalty. The relaxation is a popular approach to promote a sparse solution, but there is no obvious map between and the sparsity level . In practice, a suitable value of is found by cross-validation. Relying on global shrinkage towards zero, LASSO notoriously leads to biased estimates. This bias can be ameliorated by re-estimating under the model containing only the selected variables, known as the relaxed LASSO meinshausen2007 , but success of this two-stage procedure relies on correct support recovery in the first step. In many cases, LASSO shrinkage is known to introduce false positives su2017 , resulting in spurious covariates that cannot be corrected. To combat these shortcomings, one may replace the LASSO penalty by a non-convex penalty with milder effects on large coefficients. The smoothly clipped absolute deviation (SCAD) penalty fan2001 and minimax concave penalty (MCP) zhang2010 are even functions defined through their derivatives

 q′γ(βi,λ)=λ[1{|βi|≤λ}+(γλ−|βi|)+(γ−1)λ1{|βi|>λ}]andq′γ(βi,λ)=λ(1−|βi|λγ)+

for

. Both penalties reduce bias, interpolate between hard thresholding and LASSO shrinkage, and significantly outperform the LASSO in some settings, especially in problems with extreme sparsity. SCAD, MCP, as well as the relaxed lasso come with the disadvantage of requiring an extra tuning parameter

to be selected.

## 2 Regression with distance-to-constraint set penalties

As an alternative to shrinkage, we consider penalizing the distance between the parameter vector and constraints defined by sets . Penalized estimation seeks the solution

 ^β=argminβ[12∑ividist(β,Ci)2−1mm∑j=1L(β∣yj,xj)]:=argminβf(β), (4)

where the are weights on the distance penalty to constraint set . The Euclidean distance can also be written as

 dist(β,Ci)=∥β−PCi(β)∥2,

where denotes the projection of onto . The projection operator is uniquely defined when is closed and convex. If is merely closed, then may be multi-valued for a few unusual external points . Notice the distance penalty is 0 precisely when . The solution (4) represents a tradeoff between maximizing the log-likelihood and satisfying the constraints. When each is convex, the objective function is convex as a whole. Sending all of the penalty constants to produces in the limit the constrained maximum likelihood estimate. This is the philosophy behind the proximal distance algorithm lange2016 ; lange2015 . In practice, it often suffices to find the solution (4) under fixed large. The reader may wonder why we employ squared distances rather than distances. The advantage is that squaring renders the penalties differentiable. Indeed, whenever is single valued. This is almost always the case. In contrast, is typically nondifferentiable at boundary points of even when is convex. The following examples motivate distance penalization by considering constraint sets and their projections for several important models.

### Sparse regression:

Sparsity can be imposed directly through the constraint set Projecting a point onto is trivially accomplished by setting all but the largest entries in magnitude of equal to , the same operation behind iterative hard thresholding algorithms. Instead of solving the -relaxation (3), our algorithm approximately solves the original -constrained problem by repeatedly projecting onto the sparsity set . Unlike LASSO regression, this strategy enables one to directly incorporate prior knowledge of the sparsity level in an interpretable manner. When no such information is available, can be selected by cross validation just as the LASSO tuning constant is selected. Distance penalization escapes the NP hard dilemma of best subset regression at the cost of possible convergence to a local minimum.

### Shape and order constraints:

As an example of shape and order restrictions, consider isotonic regression barlow1972 . For data , isotonic regression seeks to minimize subject to the condition that the are non-decreasing. In this case, the relevant constraint set is the isotone convex cone . Projection onto is straightforward and efficiently accomplished using the pooled adjacent violators algorithm barlow1972 ; chi2014 . More complicated order constraints can be imposed analogously: for instance, might be required of all edges in a directed graph model. Notably, isotonic linear regression applies to changepoint problems wu2001 ; our approach allows isotonic constraints in GLM estimation. One noteworthy application is Poisson regression where the intensity parameter is assumed to be nondecreasing with time.

### Rank restriction:

Consider GLM regression where the predictors and regression coefficients are matrix-valued. To impose structure in high-dimensional settings, rank restriction serves as an appropriate matrix counterpart to sparsity for vector parameters. Prior work suggests that imposing matrix sparsity is much less effective than restricting the rank of in achieving model parsimony zhou2014 . The matrix analog of the LASSO penalty is the nuclear norm penalty. The nuclear norm of a matrix

is defined as the sum of its singular values

. Notice is a convex relaxation of . Including a nuclear norm penalty entails shrinkage and induces low-rankness by proxy.

Distance penalization of rank involves projecting onto the set for a given rank . Despite sacrificing convexity, distance penalization of rank is, in our view, both more natural and more effective than nuclear norm penalization. Avoiding shrinkage works to the advantage of distance penalization, which we will see empirically in Section 4. According to the Eckart-Young theorem, the projection of a matrix onto

is achieved by extracting the singular value decomposition of

and truncating all but the top singular values. Truncating the singular value decomposition is a standard numerical task best computed by Krylov subspace methods golub2012 .

### Simple box constraints, hyperplanes, and balls:

Many relevant set constraints reduce to closed convex sets with trivial projections. For instance, enforcing non-negative parameter values is accomplished by projecting onto the non-negative orthant. This is an example of a box constraint. Specifying linear equality and inequality constraints entails projecting onto a hyperplane or half-space, respectively. A Tikhonov or ridge penalty constraint

requires spherical projection.

Finally, we stress that it is straightforward to consider combinations of the aforementioned constraints. Multiple norm penalties are already in common use. To encourage selection of correlated variables zou2005 , the elastic net includes both and regularization terms. Further examples include matrix fitting subject to both sparse and low-rank matrix constraints richard2012 and LASSO regression subject to linear equality and inequality constraints gaines2016 . In our setting the relative importance of different constraints can be controlled via the weights .

## 3 Majorization-minimization

To solve the minimization problem (4), we exploit the principle of majorization-minimization. An MM algorithm successively minimizes a sequence of surrogate functions majorizing the objective function around the current iterate . See Figure 1. Forcing downhill automatically drives downhill as well lange2016 ; mairal2015

. Every expectation-maximization (EM) algorithm

dempster1977 for maximum likelihood estimation is an MM algorithm. Majorization requires two conditions: tangency at the current iterate , and domination for all . The iterates of the MM algorithm are defined by

 βk+1:=argminβg(β∣βk)

although all that is absolutely necessary is that . Whenever this holds, the descent property

 f(βk+1)≤g(βk+1∣βk)≤g(βk∣βk)=f(βk)

follows. This simple principle is widely applicable and converts many hard optimization problems (non-convex or non-smooth) into a sequence of simpler problems.

To majorize the objective (4), it suffices to majorize each distance penalty . The majorization is an immediate consequence of the definitions of the set distance and the projection operator chi2014 . The surrogate function

 g(β∣βk) = 12∑ivi∥β−PCi(βk)∥22−1mm∑j=1L(β∣yj,xj).

 ∇g(β∣βk) = ∑ivi[β−PCi(βk)]−1mm∑j=1∇L(β∣yj,xj)

and second differential

 d2g(β∣βk) = (∑ivi)In−1mm∑j=1d2L(β∣yj,xj):=Hk. (5)

The score and information appear in equation (2). Note that for GLMs under canonical link, the observed and expected information matrices coincide, and their common value is thus positive semidefinite. Adding a multiple of the identity to the information matrix is analogous to the Levenberg-Marquardt maneuver against ill-conditioning in ordinary regression more1978 . Our algorithm therefore naturally benefits from this safeguard.

Since solving the stationarity equation is not analytically feasible in general, we employ one step of Newton’s method in the form

 βk+1=βk−ηkd2g(βk∣βk)−1∇f(βk),

where is a stepsize multiplier chosen via backtracking. Note here our application of the gradient identity , valid for all smooth surrogate functions. Because the Newton increment is a descent direction, some value of is bound to produce a decrease in the surrogate and therefore in the objective. The following theorem, proved in the Supplement, establishes global convergence of our algorithm under simple Armijo backtracking for choosing :

###### Theorem 3.1.

Consider the algorithm map

 M(β)=β−ηβH(β)−1∇f(β),

where the step size has been selected by Armijo backtracking. Assume that is coercive in the sense . Then the limit points of the sequence are stationary points of . Moreover, the set of limit points is compact and connected.

We remark that stationary points are necessarily global minimizers when is convex. Furthermore, coercivity of

is a very mild assumption, and is satisfied whenever either the distance penalty or the negative log-likelihood is coercive. For instance, the negative log-likelihoods of the Poisson and Gaussian distributions are coercive functions. While this is not the case for the Bernoulli distribution, adding a small

penalty restores coerciveness. Including such a penalty in logistic regression is a common remedy to the well-known problem of numerical instability in parameter estimates caused by a poorly conditioned design matrix park2007 . Since is concave in , the compactness of one or more of the constraint sets is another sufficient condition for coerciveness.

### Generalization to Bregman divergences:

Although we have focused on penalizing GLM likelihoods with Euclidean distance penalties, this approach holds more generally for objectives containing non-Euclidean measures of distance. As reviewed in the Supplement, the Bregman divergence generated by a convex function provides a general notion of directed distance bregman1967 . The Bregman divergence associated with the choice , for instance, is the squared Euclidean distance. One can rewrite the GLM penalized likelihood as a sum of multiple Bregman divergences

 (6)

The first sum in equation (6) represents the distance penalty to the constraint sets . The projection denotes the closest point to in measured under . The second sum generalizes the GLM log-likelihood term where . Every exponential family likelihood uniquely corresponds to a Bregman divergence generated by the conjugate of its cumulant function polson2015 . Hence, is proportional to . The functional form (6) immediately broadens the class of objectives to include quasi-likelihoods and distances to constraint sets measured under a broad range of divergences. Objective functions of this form are closely related to proximity function minimization in the convex feasibility literature byrne2001 ; CenElf1994 ; censor2005 ; xu2016 . The MM principle makes possible the extension of the projection algorithms of censor2005 to minimize this general objective.

Our MM algorithm for distance penalized GLM regression is summarized in Algorithm 1. Although for the sake of clarity the algorithm is written for vector-valued arguments, it holds more generally for matrix-variate regression. In this setting the regression coefficients and predictors are matrix valued, and response has mean . Here the operator stacks the columns of its matrix argument. Thus, the algorithm immediately applies if we replace by and by . Projections requiring the matrix structure are performed by reshaping into matrix form. In contrast to shrinkage approaches, these maneuvers obviate the need for new algorithms in matrix regression zhou2014 .

### Acceleration:

Here we mention two modifications to the MM algorithm that translate to large practical differences in computational cost. Inverting the -by- matrix naively requires flops. When the number of cases , invoking the Woodbury formula requires solving a substantially smaller linear system at each iteration. This computational savings is crucial in the analysis of the EEG data of Section 4. The Woodbury formula says

 (vIn+UV)−1=v−1In−v−2U(Im+v−1VU)−1V

when and are and matrices, respectively. Inspection of equations (2) and (5) shows that takes the required form. Under Woodbury’s formula the dominant computation is the matrix-matrix product , which requires only flops. The second modification to the MM algorithm is quasi-Newton acceleration. This technique exploits secant approximations derived from iterates of the algorithm map to approximate the differential of the map. As few as two secant approximations can lead to orders of magnitude reduction in the number of iterations until convergence. We refer the reader to zhou2011 for a detailed description of quasi-Newton acceleration and a summary of its performance on various high-dimensional problems.

## 4 Results and performance

We first compare the performance of our distance penalization method to leading shrinkage methods in sparse regression. Our simulations involve a sparse length coefficient vector with nonzero entries. Nonzero coefficients have uniformly random effect sizes. The entries of the design matrix are Gaussian random deviates. We then recover from undersampled responses following Poisson and Bernoulli distributions with canonical links. Figure 2 compares solutions obtained using our distance penalties (MM) to those obtained under MCP, SCAD, and LASSO penalties. Relative errors (left) with cases clearly show that LASSO suffers the most shrinkage and bias; MM seems to outperform MCP and SCAD. For a more detailed comparison, the right side of the figure plots mean squared error (MSE) as a function of the number of cases averaged over trials. All methods significantly outperform LASSO, which is omitted for scale, with MM achieving lower MSE than competitors, most noticeably in logistic regression. As suggested by an anonymous reviewer, similar results from additional experiments for Gaussian (linear) regression with comparison to relaxed lasso are included in the Supplement.

For underdetermined matrix regression, we compare to the spectral regularization method developed by Zhou and Li zhou2014 . We generate their cross-shaped true signal and in all trials sample responses

. Here the design tensor

is generated with standard normal entries. Figure 3 demonstrates that imposing sparsity alone fails to recover and that rank-set projections visibly outperform spectral norm shrinkage as varies. The rightmost panel also shows that our method is robust to over-specification of the rank of the true signal to an extent.

We consider two real datasets. We apply our method to count data of global temperature anomalies relative to the 1961-1990 average, collected by the Climate Research Unit jones2016 . We assume a non-decreasing solution, illustrating an instance of isotonic regression. The fitted solution displayed in Figure 4 has mean squared error , clearly obeys the isotonic constraint, and is consistent with that obtained on a previous version of the data wu2001 . We next focus on rank constrained matrix regression for electroencephalography (EEG) data, collected by zhang1995 to study the association between alcoholism and voltage patterns over times and channels. The study consists of individuals with alcoholism and controls, providing binary responses indicating whether subject has alcoholism. The EEG measurements are contained in predictor matrices ; the dimension is thus greater than . Further details about the data appear in the Supplement.

Previous studies apply dimension reduction li2010 and propose algorithms to seek the optimal rank solution hung2013 . These methods could not handle the size of the original data directly, and the spectral shrinkage approach proposed in zhou2014 is the first to consider the full EEG data. Figure 4 shows that our regression solution is qualitatively similar to that obtained under nuclear norm penalization zhou2014 , revealing similar time-varying patterns among channels 20-30 and 50-60. In contrast, ignoring matrix structure and penalizing the norm of yields no useful information, consistent with findings in zhou2014 . However, our distance penalization approach achieves a lower misclassification error of . The lowest misclassification rate reported in previous analyses is by hung2013 . As their approach is strictly more restrictive than ours in seeking a rank solution, we agree with zhou2014

in concluding that the lower misclassification error can be largely attributed to benefits from data preprocessing and dimension reduction. While not visually distinguishable, we also note that shrinking the eigenvalues via nuclear norm penalization

zhou2014 fails to produce a low-rank solution on this dataset.

We omit detailed timing comparisons throughout since the various methods were run across platforms and depend heavily on implementation. We note that MCP regression relies on the MM principle, and the LQA and LLA algorithms used to fit models with SCAD penalties are also instances of MM algorithms fan2010 . Almost all MM algorithms share an overall linear rate of convergence. While these require several seconds of compute time on a standard laptop machine, coordinate-descent implementations of LASSO outstrip our algorithm in terms of computational speed. Our MM algorithm required 31 seconds to converge on the EEG data, the largest example we considered.

## 5 Discussion

GLM regression is one of the most widely employed tools in statistics and machine learning. Imposing constraints upon the solution is integral to parameter estimation in many settings. This paper considers GLM regression under distance-to-set penalties when seeking a constrained solution. Such penalties allow a flexible range of constraints, and are competitive with standard shrinkage methods for sparse and low-rank regression in high dimensions. The MM principle yields a reliable solution method with theoretical guarantees and strong empirical results over a number of practical examples. These examples emphasize promising performance under non-convex constraints, and demonstrate how distance penalization avoids the disadvantages of shrinkage approaches.

Several avenues for future work may be pursued. The primary computational bottleneck we face is matrix inversion, which limits the algorithm when faced with extremely large and high-dimensional datasets. Further improvements may be possible using modifications of the algorithm tailored to specific problems, such as coordinate or block descent variants. Since the linear systems encountered in our parameter updates are well conditioned, a conjugate gradient algorithm may be preferable to direct methods of solution in such cases. The updates within our algorithm can be recast as weighted least squares minimization, and a re-examination of this classical problem may suggest even better iterative solvers. As the methods apply to a generalized objective comprised of multiple Bregman divergences, it will be fruitful to study penalties under alternate measures of distance, and settings beyond GLM regression such as quasi-likelihood estimation.

While our experiments primarily compare against shrinkage approaches, an anonymous referee points us to recent work revisiting best subset selection using modern advances in mixed integer optimization bertsimas2016 . These exciting developments make best subset regression possible for much larger problems than previously thought possible. As bertsimas2016 focus on the linear case, it is of interest to consider how ideas in this paper may offer extensions to GLMs, and to compare the performance of such generalizations. Best subsets constitutes a gold standard for sparse estimation in the noiseless setting; whether it outperforms shrinkage methods seems to depend on the noise level and is a topic of much recent discussion hastie2017 ; mazumder2017 . Finally, these studies as well as our present paper focus on estimation, and it will be fruitful to examine variable selection properties in future work. Recent work evidences an inevitable trade-off between false and true positives under LASSO shrinkage in the linear sparsity regime su2017 . The authors demonstrate that this need not be the case with methods, remarking that computationally efficient methods which also enjoy good model performance would be highly desirable as and approaches possess one property but not the other su2017 . Our results suggest that distance penalties, together with the MM principle, seem to enjoy benefits from both worlds on a number of statistical tasks.

## 6 Acknowledgements

We would like to thank Hua Zhou for helpful discussions about matrix regression and the EEG data. JX was supported by NSF MSPRF #1606177.

## References

• (1) Barlow, R. E., Bartholomew, D. J., Bremner, J., and Brunk, H. D. Statistical inference under order restrictions: The theory and application of isotonic regression. Wiley New York, 1972.
• (2) Becker, M. P., Yang, I., and Lange, K. EM algorithms without missing data. Stat. Methods Med. Res., 6:38–54, 1997.
• (3) Bertsimas, D., King, A., Mazumder, R., et al. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852, 2016.
• (4) Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.
• (5) Byrne, C. and Censor, Y. Proximity function minimization using multiple Bregman projections, with applications to split feasibility and Kullback–Leibler distance minimization. Annals of Operations Research, 105(1-4):77–98, 2001.
• (6) Censor, Y. and Elfving, T. A multiprojection algorithm using Bregman projections in a product space. Numerical Algorithms, 8(2):221–239, 1994.
• (7) Censor, Y., Elfving, T., Kopf, N., and Bortfeld, T. The multiple-sets split feasibility problem and its applications for inverse problems. Inverse Problems, 21(6):2071–2084, 2005.
• (8) Chi, E. C., Zhou, H., and Lange, K. Distance majorization and its applications. Mathematical Programming Series A, 146(1-2):409–436, 2014.
• (9) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), pages 1–38, 1977.
• (10) Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
• (11) Fan, J. and Lv, J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101, 2010.
• (12) Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
• (13) Gaines, B. R. and Zhou, H. Algorithms for fitting the constrained lasso. arXiv preprint arXiv:1611.01511, 2016.
• (14) Golub, G. H. and Van Loan, C. F. Matrix computations, volume 3. JHU Press, 2012.
• (15) Hastie, T., Tibshirani, R., and Tibshirani, R. J. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017.
• (16) Hung, H. and Wang, C.-C. Matrix variate logistic regression model with application to EEG data. Biostatistics, 14(1):189–202, 2013.
• (17) Jones, P., Parker, D., Osborn, T., and Briffa, K. Global and hemispheric temperature anomalies–land and marine instrumental records. Trends: a compendium of data on global change, 2016.
• (18) Lange, K., Hunter, D. R., and Yang, I. Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist., 9:1–20, 2000.
• (19) Lange, K. Optimization. Springer Texts in Statistics. Springer-Verlag, New York, 2nd edition, 2013.
• (20) Lange, K. MM Optimization Algorithms. SIAM, 2016.
• (21) Lange, K. and Keys, K. L. The proximal distance algorithm. arXiv preprint arXiv:1507.07598, 2015.
• (22) Li, B., Kim, M. K., and Altman, N. On dimension folding of matrix-or array-valued statistical objects. The Annals of Statistics, pages 1094–1121, 2010.
• (23) Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
• (24) Mazumder, R., Radchenko, P., and Dedieu, A. Subset selection with shrinkage: Sparse linear modeling when the SNR is low. arXiv preprint arXiv:1708.03288, 2017.
• (25) McCullagh, P. and Nelder, J. A. Generalized linear models, volume 37. CRC press, 1989.
• (26) Meinshausen, N. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007.
• (27) Moré, J. J. The Levenberg-Marquardt algorithm: implementation and theory. In Numerical analysis, pages 105–116. Springer, 1978.
• (28) Park, M. Y. and Hastie, T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Methodological), 69(4):659–677, 2007.
• (29) Polson, N. G., Scott, J. G., and Willard, B. T. Proximal algorithms in statistics and machine learning. Statistical Science, 30(4):559–581, 2015.
• (30) Richard, E., Savalle, P.-a., and Vayatis, N. Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1351–1358, 2012.
• (31) Su, W., Bogdan, M., and Candes, E. False discoveries occur early on the lasso path. The Annals of Statistics, 45(5), 2017.
• (32) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), pages 267–288, 1996.
• (33) Wu, W. B., Woodroofe, M., and Mentz, G. Isotonic regression: Another look at the changepoint problem. Biometrika, pages 793–804, 2001.
• (34) Xu, J., Chi, E. C., Yang, M., and Lange, K. A majorization-minimization algorithm for split feasibility problems. arXiv preprint arXiv:1612.05614, 2016.
• (35) Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942, 2010.
• (36) Zhang, X. L., Begleiter, H., Porjesz, B., Wang, W., and Litke, A. Event related potentials during object recognition tasks. Brain Research Bulletin, 38(6):531–538, 1995.
• (37) Zhou, H., Alexander, D., and Lange, K. A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing, 21:261–273, 2011.
• (38) Zhou, H. and Li, L. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Methodological), 76(2):463–483, 2014.
• (39) Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Methodological), 67(2):301–320, 2005.

## 7 Proof of Convergence

We repeat the statement of Theorem 3.1 below:

###### Theorem 7.1.

Consider the algorithm map

 M(β)=β−ηβH(β)−1∇f(β),

where the step size has been selected by Armijo backtracking. Assume that is coercive: . Then the limit points of the sequence are stationary points of . Moreover, this set of limit points is compact and connected.

Our algorithm selects the step-size according to the Armijo condition: suppose is a descent direction at in the sense that . The Armijo condition chooses a step size such that

 f(β+tv)≤f(β)+αηdf(β)v,

for a constant . Before proving the statement, the following lemma follows an argument in Chapter 12 of Lan2013 to show that step-halving under the Armijo condition requires finitely many steps.

###### Lemma 7.2.

Given and , there exists an integer such that

 f(β+σsv)≤f(β)+ασsdf(β)v,

where .

###### Proof.

Since is coercive by assumption, its sublevel sets are compact. Namely, the set is compact. Smoothness of the GLM likelihood and squared distance penalty ensure continuity of and . Together with coercivity, this implies that there exist positive constants and such that

for all . Together with the fact that the Euclidean distance to a closed set is a Lipschitz function with Lipschitz constant 1, we produce the inequality

 f(β+ηv)≤f(β)+ηdf(β)v+12η2L∥v∥2 (7)

where . The squared term appearing at the end of (7) can be bounded by

 ∥v∥2=∥H(β)−1∇f(β)∥2≤a2∥∇f(β)∥2.

We next identify a bound for :

 ∥∇f(β)∥2=∥H(β)1/2H(β)−1/2∇f(β)∥2≤∥H(β)1/2∥2∥H(β)−1/2∇f(β)∥2≤bdf(β)H(β)−1∇f(β)=−bdf(β)v. (8)

Combining inequalities (7) and (8) yields

 f(β+ηv)≤f(β)+η(1−a2bL2t)df(β)v.

Thus, the Armijo condition is guaranteed to be satisfied as soon as , where

 s⋆=1lnσln(2(1−α)a2bL).

Of course, in practice a much lower value of may suffice.

We are now ready to prove the original theorem:

###### Proof.

Consider the iterates of the algorithm . Since is continuous, attains its infimum over , and therefore the monotonically decreasing sequence is bounded below. This implies that converges to 0. Let denote the number of backtracking steps taken at the th iteration under the Armijo stopping rule. By Lemma 7.2, is finite, and thus

 f(βk)−f(βk+1) ≥ −ασskdf(βk)vk = ασskdf(βk)H(βk)−1∇f(β) ≥ ασskβ∥∇f(βk)∥2 ≥ ασs⋆+1β∥∇f(βk)∥2.

This inequality implies that converges to 0, and therefore all the limit points of the sequence are stationary points of . Further, taking norms of the update yields the inequality

 ∥βk+1−βk∥ = σsk∥H(βk)−1∇f(βk)∥ ≤ σska∥∇f(βk)∥.

Thus, the iterates are a bounded sequence such that tends to 0, allowing us to conclude that the limit points form a compact and connected set by Propositions 12.4.2 and 12.4.3 in Lan2013 . ∎

## 8 Bregman Divergences

Let be a strictly convex function defined on a convex domain differentiable on the interior of . The Bregman divergence bregman1967 between and with respect to is defined as

 Dϕ(v,u) = ϕ(v)−ϕ(u)−dϕ(u)(v−u). (9)

Note that the Bregman divergence (9) is a convex function of its first argument , and measures the distance between and a first order Taylor expansion of about evaluated at . While the Bregman divergence is not a metric as it is not symmetric in general, it provides a natural notion of directed distance. It is non-negative for all and equal to zero if and only if . Instances of Bregman divergences abound in statistics and machine learning, many useful measures of closeness.

Recall exponential family distributions takes the canonical form

 p(y|θ,τ)=C1(y,τ)exp{yθ−ψ(θ)C2(τ)}.

Each distribution belonging to an exponential family shares a close relationship to a Bregman divergence, and we may explicitly relate GLMs as a special case using this connection. Specifically, the conjugate of its cumulant function , which we denote , uniquely generates a Bregman divergence that represents the exponential family likelihood up to proportionality polson2015 . With denoting the link function, the negative log-likelihood of can be written as its Bregman divergence to the mean:

 −lnp(y|θ,τ)=Dζ(y,g−1(θ))+C(y,τ).

As an example, the cumulant function in the Poisson likelihood is , whose conjugate produces the relative entropy

 Dζ(p,q)=pln(p/q)−p+q.

Similarly, recall that the Bernoulli likelihood in logistic regression has cumulant function . Its conjugate is given by , and generates

 Dζ(p,q)=plnpq+(1−p)ln1−p1−q.

This relationship implies that maximizing the likelihood in an exponential family is equivalent to minimizing a corresponding Bregman divergence between the data and the regression coefficients

. Notice this is a different statement than the well-known equivalence between maximizing the likelihood and minimizing the Kullback-Leibler divergence between the empirical and parametrized distributions. The gradients of the Bregman projection take the form

 ∇Dϕ(PϕCi(βk),β)=d2ϕ(β)(β−PϕCi(βk)).

Further, the notion of Bregman divergence naturally applies to matrices:

 Dϕ(V,U)=ϕ(V)−ϕ(U)−⟨∇ϕ(U),V−U⟩

where denotes the inner product. For instance, the squared Frobenius distance between is generated by the choice of . The MM algorithm therefore applies analogously to objective functions consisting of multiple Bregman divergences.

## 9 EEG dataset

The dataset we consider using rank restricted matrix regression seeks to study the association between alcoholism and the voltage patterns over times and channels from EEG data. The data are collected by zhang1995 , who provide further details of the experiment, and measures subjects over trials. The study consists of individuals with alcoholism and controls. For each subject, channels of electrodes were placed across the scalp, and voltages are recorded at 256 time points sampled at 256 Hz over one second. This is repeated over trials with three different stimuli. Following the practice of previous studies of the data by li2010 ; hung2013 ; zhou2014 , we consider covariates representing the average over all trials of voltages recorded from each electrode. Other than averaging over trials, no data preprocessing is applied. is thus a matrix whose th entries represent the voltage at time in channel or electrode , averaged over the 120 trials. The binary responses indicate whether subject has alcoholism.

As mentioned in the main text, the study by li2010 focuses on reduction of the data via dimension folding, and the matrix-variate logistic regression algorithm proposed by hung2013 is also applied to preprocessed data using a generic dimension reduction technique. The nuclear norm shrinkage proposed by zhou2014 is the first to consider matrix regression on the full, unprocessed data (apart from averaging over the 120 trials). The authors zhou2014 point out that previous methods nonetheless attain better classification rates, likely due to the fact that preprocessing and tuning were chosen to optimize predictive accuracy. Indeed, the lowest misclassification rate reported in previous analyses is by hung2013 , yet the authors show that their method is equivalent to seeking the best rank approximation to the true coefficient matrix in terms of Kullback-Leibler divergence. Since this approach is strictly more restrictive than ours, which attains an error of , we agree with zhou2014 in concluding that the lower misclassification error achieved by previous studies can be largely attributed to benefiting from removal of noise via data preprocessing and dimension reduction.

## 10 Additional Gaussian regression comparison

We consider an analogous simulation study including a comparison to the two-stage relaxed LASSO procedure, implemented in R package relaxo. The author’s implementation is limited to the Gaussian case, and we consider linear regression with dimension as the number of samples varies, with nonzero true coefficients. We consider a reduced experiment due to runtime considerations of relaxo, repeating only over trials and varying by increments of . Though timing is heavily dependent on implementations, the average total runtimes (across all values of ) of the experiment across trials for MM, MCP, SCAD, and relaxed lasso are seconds, respectively. We see that relaxed LASSO is effective toward removing the bias induced by standard LASSO, and overall results are similar to those included in the main text.