Robust and Sparse Regression in GLM by Stochastic Optimization

The generalized linear model (GLM) plays a key role in regression analyses. In high-dimensional data, the sparse GLM has been used but it is not robust against outliers. Recently, the robust methods have been proposed for the specific example of the sparse GLM. Among them, we focus on the robust and sparse linear regression based on the γ-divergence. The estimator of the γ-divergence has strong robustness under heavy contamination. In this paper, we extend the robust and sparse linear regression based on the γ-divergence to the robust and sparse GLM based on the γ-divergence with a stochastic optimization approach in order to obtain the estimate. We adopt the randomized stochastic projected gradient descent as a stochastic optimization approach and extend the established convergence property to the classical first-order necessary condition. By virtue of the stochastic optimization approach, we can efficiently estimate parameters for very large problems. Particularly, we show the linear regression, logistic regression and Poisson regression with L_1 regularization in detail as specific examples of robust and sparse GLM. In numerical experiments and real data analysis, the proposed method outperformed comparative methods.

Authors

• 4 publications
• 16 publications
04/22/2016

Robust and Sparse Regression via γ-divergence

In high-dimensional data, many sparse regression methods have been propo...
06/01/2020

Universal Robust Regression via Maximum Mean Discrepancy

Many datasets are collected automatically, and are thus easily contamina...
10/06/2021

Robust Generalized Method of Moments: A Finite Sample Viewpoint

For many inference problems in statistics and econometrics, the unknown ...
07/22/2019

A collection of U (∈N) data vectors is called a U-tuple, and the assoc...
07/18/2017

Global optimization for low-dimensional switching linear regression and bounded-error estimation

The paper provides global optimization algorithms for two particularly d...
02/03/2021

Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence

Consider a regression problem where the learner is given a large collect...
11/18/2020

A Discussion on Practical Considerations with Sparse Regression Methodologies

Sparse linear regression is a vast field and there are many different al...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The regression analysis is a fundamental tool in data analysis. The Generalized Linear Model (GLM)

(Nelder and Wedderburn 1972; McCullagh and Nelder 1989) is often used and includes many important regression models, including linear regression, logistic regression and Poisson regression. Recently, the sparse modeling has been popular in GLM to treat high-dimensional data and, for some specific examples of GLM, the robust methods have also been incorporated (linear regression: Khan et al. (2007); Alfons et al. (2013), logistic regression: Bootkrajang and Kabán (2013); Chi and Scott (2014) ).

Kawashima and Fujisawa (2017) proposed a robust and sparse regression based on the -divergence (Fujisawa and Eguchi 2008), which has a strong robustness that the latent bias can be sufficiently small even under heavy contamination. The proposed method showed better performances than the past methods by virtue of strong robustness. A coordinate descent algorithm with Majorization-Minimization algorithm was constructed as an efficient estimation procedure for linear regression, but it is not always useful for GLM. To overcome this problem, we propose a new estmation procedure with a stochastic optimization approach, which largely reduces the computational cost and is easily applicable to any example of GLM. In many stochastic optimization approaches, we adopt the randomized stochastic projected gradient descent (RSPG) proposed by Ghadimi et al. (2016). In particular, when we consider the Poisson regression with

-divergence, although the loss function includes a hypergeometric series and demands high computational cost, the stochastic optimization approach can easily overcome this difficulty.

In Section 2, we review the robust and sparse regression via -divergence. In Section 3, the RSPG is explained with regularized expected. In Section 4, an online algorithm is proposed for GLM and the robustness of online algorithm is described with some typical examples of GLM. In Section 5, the convergence property of the RSPG is extended to the classical first-order necessary condition. In Sections 6 and 7, numerical experiments and real data analysis are illustrated to show better performances than comparative methods. Concluding remarks are given in Section 8.

2 Regression via γ-divergence

2.1 Regularized Empirical risk minimization

We suppose

is the underlying probability density function and

is a parametric probability density function. Let us define the -cross entropy for regression given by

 dγ(g(y|x),f(y|x);g(x)) =−1γlog∫∫g(y|x)f(y|x)γdy(∫f(y|x)1+γdy)γ1+γg(x)dx =−1γlog∫∫f(y|x)γ(∫f(y|x)1+γdy)γ1+γg(x,y)dxdy, =−1γlogEg(x,y)⎡⎢⎣f(y|x)γ(∫f(y|x)1+γdy)γ1+γ⎤⎥⎦.

The -divergence for regression is defined by

 Dγ(g(y|x),f(y|x);g(x)) =−dγ(g(y|x),g(y|x);g(x))+dγ(g(y|x),f(y|x);g(x)).

The main idea of robustness in the -divergence is based on density power weight which gives a small weight to the terms related to outliers. Then, the parameter estimation using the -divergence becomes robust against outliers and it is known for having a strong robustness, which implies that the latent bias can be sufficiently small even under heavy contamination. More details about robust properties were investigated by Fujisawa and Eguchi (2008), Kanamori and Fujisawa (2015) and Kawashima and Fujisawa (2017).

Let be the parametric probability density function with parameter . The target parameter can be considered by

 θ∗γ =argminθDγ(g(y|x),f(y|x;θ);g(x)) =argminθdγ(g(y|x),f(y|x;θ);g(x)).

Moreover, we can also consider the target parameter with a convex regularization term, given by

 θ∗γ,pen =argminθDγ(g(y|x),f(y|x;θ);g(x))+λP(θ) =argminθdγ(g(y|x),f(y|x;θ);g(x))+λP(θ), (2.1)

where is a convex regularization term for parameter and is a tuning parameter. As an example of convex regularization term, we can consider (Lasso, Tibshirani 1996), elasticnet (Zou and Hastie 2005), the indicator function of a closed convex set (Kivinen and Warmuth 1995; Duchi et al. 2008) and so on. In what follows, we refer to the regression based on the -divergence as the -regression.

Let be the observations randomly drawn from the underlying distribution . The -cross entropy can be empirically estimated by

 ¯dγ(f(y|x;θ))=−1γlog1nn∑i=1f(yi|xi)γ(∫f(y|xi)1+γdy)γ1+γ.

By virtue of (2.1), the sparse -estimator can be proposed by

 ^θγ,pen =argminθ¯dγ(f(y|x;θ))+λP(θ). (2.2)

To obtain the minimizer, we solve a non-convex and non-smooth optimization problem. Iterative estimation algorithms for such a problem can not easily achieve numerical stability and efficiency.

2.2 MM algorithm for γ-regression

Kawashima and Fujisawa (2017) proposed the iterative estimation algorithm for (2.2) by Majorization-Minimization algorithm (MM algorithm) (Hunter and Lange 2004). It has a monotone decreasing property, i.e., the objective function monotonically decreases at each iterative step, which property leads to numerical stability and efficiency. In particular, the linear regression with penalty was deeply considered.

Here, we explain the idea of MM algorithm briefly. Let be the objective function. Let us prepare the majorization function satisfying

 hMM(η(m)|η(m)) =h(η(m)), hMM(η|η(m)) ≥h(η)   for all η,

where is the parameter of the -th iterative step for . MM algorithm optimizes the majorization function instead of the objective function as follows:

 η(m+1)=argminηhMM(η|η(m)).

Then, we can show that the objective function monotonically decreases at each iterative step, because

 h(η(m)) =hMM(η(m)|η(m)) ≥hMM(η(m+1)|η(m)) ≥h(η(m+1)).

Note that is not necessary to be the minimizer of . We only need

 hMM(η(m)|η(m))≥hMM(η(m+1)|η(m)).

The problem on MM algorithm is how to make a majoraization function .

In Kawashima and Fujisawa (2017), the following majorization function was proposed by using Jensen’s inequality:

 hMM(θ|θ(m))=−1γn∑i=1α(m)ilog⎧⎪⎨⎪⎩f(yi|xi;θ)γ(∫f(y|xi;θ)1+γdy)γ1+γ⎫⎪⎬⎪⎭+λP(θ), (2.3)

where

 α(m)i=f(yi|xi;θ(m))γ(∫f(y|xi;θ(m))1+γdy)γ1+γ∑nl=1f(yl|xl;θ(m))γ(∫f(y|xl;θ(m))1+γdy)γ1+γ.

Moreover, for linear regression with regularization, the following majorization function and iterative estimation algorithm based on a coordinate descent method were obtained:

 hMM, linear(θ|θ(m)) =12(1+γ)logσ2+12n∑i=1α(m)i(yi−β0−xTiβ)2σ2+λ||β||1, β(m+1)0 =n∑i=1α(m)i(yi−xiTβ(m)), β(m+1)j =S(∑ni=1α(m)i(yi−β(m+1)0−r(m)i,−j)xij, σ2(m)λ)(∑ni=1α(m)ix2ij)  (j=1,…,p), σ2(m+1) =(1+γ)n∑i=1α(m)i(yi−β(m+1)0−xTiβ(m+1))2,

where and .

2.3 Sparse γ-Poisson regression case

Typical GLMs are a linear regression, logistic regression and Poisson regression: The former two regressions are easily treated with the above coordinate descent algorithm, but the Poisson regression has a problem as described in the following Here, we consider a Poisson regression with a regularization term. Let be the conditional density with , given by

 f(y|x;θ)=exp(−μx(θ))y!μx(θ)y,

where . By virtue of (2.3), we can obtain the majorization function for Poisson regression with a regularization term, given by

 hMM, poisson(θ|θ(m))=−n∑i=1α(m)ilogexp(−μxi(θ))yi!μxi(θ)yi (2.4)

The second term contains the hypergeometric series, and then we can not obtain a closed form on the MM algorithm with respect to the parameters although this series converges (see Sect. 4.3). Therefore, we can not derive an efficient iterative estimation algorithm based on a coordinate descent method in a similar way to in Kawashima and Fujisawa (2017). Other sparse optimization methods which use a linear approximation on the loss function, e.g., proximal gradient descent (Nesterov 2007; Duchi and Singer 2009; Beck and Teboulle 2009), can solve (2.3). However, these methods require at least sample size times of an approximate calculation for the hypergeometric series at each iterative step in sub-problem . Therefore, it requires high computation cost, especially for very large problems. We need another optimization approach to overcome such problems. In this paper, we consider minimizing the regularized expected risk (2.1) directly by a stochastic optimization approach. In what follows, we refer to the sparse -regression in GLM as the sparse -GLM.

3 Stochastic optimization approach for regularized expected risk minimization

The regularized expected risk minimization is generally the following form:

 Ψ∗\coloneqqminθ∈Θ{Ψ(θ)\coloneqqE(x,y)[l((x,y);θ)]+λP(θ)}, (3.1)

where is a closed convex set in , is a loss function with a parameter and is bounded below over by . Stochastic optimization approach solves (3.1) sequentially. More specifically, we draw a sequence of i.i.d. paired samples and, at -th time, update the parameter based on the latest paired sample and the previous updated parameter . Therefore, it requires low computational complexity per iteration and stochastic optimization can scale well for very large problems.

The stochastic gradient descent (SGD) is one of popular stochastic optimization approaches and is widely used in machine learning community

(Bottou 2010). The SGD takes the form

 (3.2)

where is a step size parameter. For some important examples, e.g., regularization, (3.2) can be solved in a closed form.

When a loss function is convex (possibly non-differentiable) and is set to be appropriate, e.g., , under some mild conditions, the convergence property was established for the average of the iterates, i.e., as follows (see, e.g., Bubeck (2015)):

 E[Ψ(¯θT)]−Ψ∗≤O(1√T),

where the expectation is taken with respect to past paired samples . Moreover, for some variants of SGD, e.g., RDA (Xiao 2010), Mirror descent (Duchi et al. 2010), Adagrad (Duchi et al. 2011), the convergence property was established under similar assumptions.

These methods assume that a loss function is convex to establish the convergence property, but the loss function is non-convex in our problem (2.1). Then, we can not adopt these methods directly. Recently, for non-convex loss function with convex regularization term, randomized stochastic projected gradient (RSPG) was proposed by Ghadimi et al. (2016). Under some mild conditions, the convergence property was established. Therefore, we consider applying the RSPG to our problem (2.1).

First, we explain the RSPG, following Ghadimi et al. (2016). The RSPG takes the form

 (3.3)

where is the size of mini-batch at -th time, is the -th mini-batch sample at -th time and

 V(a,b)=w(a)−w(b)−⟨∇w(b),a−b⟩,

where is continuously differentiable and -strongly convex function satisfying for . When , i.e., , (3.3) is almost equal to (3.2).

Here, we denote two remarks on RSPG as a difference from the SGD. One is that the RSPG uses the mini-batch strategy, i.e., taking multiple samples at -th time. The other is that the RSPG randomly select a final solution from

according to a certain probability distribution instead of taking the average of the iterates. This is because for non-convex stochastic optimization, later iterates does not always gather around local minimum and the average of the iterates can not work in such a convex case.

Next, we show the implementation of the RSPG, given by Algorithm1. However, Algorithm 1 has a large deviation of output because the only one final output is selected via some probability mass function . Therefore, Ghadimi et al. (2016) also proposed the two phase RSPG (2-RSPG) which has the post-optimization phase. In the post-optimization phase, multiple outputs are selected and these are validated to determine the final output, as shown in Algorithm 2.

This can be expected to achieve a better complexity result of finding an , i.e., Prob, where is some convergence criterion, for some and . For more detailed descriptions and proofs, we refer to the Sect.4 in Ghadimi et al. (2016).

4 Online robust and sparse GLM

In this section, we show the sparse -GLM with the stochastic optimization approach on three specific examples; linear regression, logistic regression and Poisson regression with regularization. In what follows, we refer to the sparse -GLM with the stochastic optimization approach as the online sparse -GLM.

In order to apply the RSPG to our methods (2.1), we prepare the monotone transformation of the -cross entropy for regression in (2.1) as follows

 argminθ∈ΘEg(x,y)⎡⎢⎣−f(y|x;θ)γ(∫f(y|x;θ)1+γdy)γ1+γ⎤⎥⎦+λP(θ), (4.1)

and we suppose that is or closed ball with sufficiently large radius. Then, we can apply the RSPG to (4.1) and by virtue of (3.3), the update formula takes the form

 (4.2)

More specifically, we suppose that because the update formula can be obtained in closed form for some important sparse regularization terms, e.g., regularization, elasticnet. We illustrate the update algorithms based on Algorithm 1 for three specific examples. The update algorithms based on Algorithm 2 are obtained in a similar manner.

In order to implement our methods, we need to determine some tuning parameters, e.g., the step size , mini-batch size . In Sect. 5, we discuss how to determine some tuning parameters in detail.

4.1 Online sparse γ-linear regression

Let be the conditional density with , given by

 f(y|x;θ)=ϕ(y;β0+xTβ,σ2),

where is the normal density with mean parameter

and variance parameter

. Suppose that is the regularization . Then, by virtue of (4.2), we can obtain the update formula given by

 (β(t+1)0,β(t+1),σ2(t+1)) =argminβ0,β,σ2ξ1(β(t)0)β0+⟨ξ2(β(t)),β⟩+ξ3(σ2(t))σ2 +λ∥β∥1+12ηt∥β0−β(t)0∥22+12ηt∥β−β(t)∥22+12ηt∥σ2−σ2(t)∥22, (4.3)

where

 ξ1(β(t)0) =−1mtmt∑i=1⎡⎢⎣γ(yt,i−β(t)0−xt,iTβ(t))σ2(t)(1+γ2πσ2(t))γ2(1+γ)exp⎧⎨⎩−γ(yt,i−β(t)0−xt,iTβ(t))22σ2(t)⎫⎬⎭⎤⎥⎦, ξ2(β(t)) =−1mtmt∑i=1⎡⎢⎣γ(yt,i−β(t)0−xt,iTβ(t))σ2(t)(1+γ2πσ2(t))γ2(1+γ)exp⎧⎨⎩−γ(yt,i−β(t)0−xt,iTβ(t))22σ2(t)⎫⎬⎭xt,i⎤⎥⎦, ξ3(σ2(t)) =1mtmt∑i=1⎡⎢⎣γ2(1+γ2πσ2(t))γ2(1+γ)⎧⎨⎩1(1+γ)σ2(t)−(yt,i−β(t)0−xt,iTβ(t))2σ4(t)⎫⎬⎭ exp⎧⎨⎩−γ(yt,i−β(t)0−xt,iTβ(t))22σ2(t)⎫⎬⎭⎤⎦.

Consequently, we can obtain the update algorithm, as shown in Algorithm 3.

Here, we briefly show the robustness of online sparse -linear regression. For simplicity, we consider the intercept parameter . Suppose that the is an outlier at -th time. The conditional probability density can be expected to be sufficiently small. We see from and (4.1) that

 β(t+1)0 =argminβ0−1mt∑1≤i≠k≤mt⎡⎢⎣γ(yt,i−β(t)0−xt,iTβ(t))σ2(t)(1+γ2πσ2(t))γ2(1+γ)exp⎧⎨⎩−γ(yt,i−β(t)0−xt,iTβ(t))22σ2(t)⎫⎬⎭⎤⎥⎦×β0 −1mtγ(yt,k−β(t)0−xt,kTβ(t))σ2(t)(1+γ2πσ2(t))γ2(1+γ)exp⎧⎨⎩−γ(yt,k−β(t)0−xt,kTβ(t))22σ2(t)⎫⎬⎭×β0 +12ηt∥β0−β(t)0∥22 =argminβ0−1mt∑1≤i≠k≤mt⎡⎢⎣γ(yt,i−β(t)0−xt,iTβ(t))σ2(t)(1+γ2πσ2(t))γ2(1+γ)exp⎧⎨⎩−γ(yt,i−β(t)0−xt,iTβ(t))22σ2(t)⎫⎬⎭⎤⎥⎦×β0 −1mtγ(1+γ)γ2(1+γ)(yt,k−β(t)0−xt,kTβ(t))σ2(t)(2πσ2(t))γ22(1+γ)f(yt,k|xt,k;θ(t))γ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––≈0×β0 +12ηt∥β0−β(t)0∥22

Therefore, the effect of an outlier is naturally ignored in (4.1). Similarly, we can also see the robustness for parameters and .

4.2 Online sparse γ-logistic regression

Let be the conditional density with , given by

 f(y|x;β0,β)=F(~xTθ)y(1−F(~xTθ))(1−y),

where and . Then, by virtue of (4.2), we can obtain the update formula given by

 (β(t+1)0,β(t+1)) argminβ0,βν1(β(t)0)β0+⟨ν2(β(t)),β⟩+λ||β||1+12ηt∥β0−β(t)0∥22+12ηt∥β−β(t)∥22, (4.4)

where

 ν1(β(t)0) ν2(β(t))

Consequently, we can obtain the update algorithm as shown in Algorithm 4.

In a similar way to online sparse -linear regression, we can also see the robustness for parameters and in online sparse -logistic regression (4.2).

4.3 Online sparse γ-Poisson regression

Let be the conditional density with , given by

 f(y|x;θ)=exp(−μx(θ))y!μx(θ)y,

where . Then, by virtue of (4.2), we can obtain the update formula given by

 (β(t+1)0,β(t+1)) =argminβ0,βζ1(β(t)0)β0+⟨ζ2(β(t)),β⟩+λ||β||1+12ηt∥β0−β(t)0∥22+12ηt∥β−β(t)∥22, (4.5)

where

 ζ1(β(t)0) =1mtmt∑i=1⎡⎢ ⎢ ⎢ ⎢⎣γf(yt,i|xt,i;θ(t))γ{∑∞y=0(y−yt,i)f(y|xt,i;θ(t))1+γ}{∑∞y=0f(y|xt,i;θ(t))1+γ}1+2γ1+γ⎤⎥ ⎥ ⎥ ⎥⎦, ζ2(β(t)) =1mtmt∑i=1⎡⎢ ⎢ ⎢ ⎢⎣γf(yt,i|xt,i;θ(t))γ{∑∞y=0(y−yt,i)f(y|xt,i;θ(t))1+γ}{∑∞y=0f(y|xt,i;θ(t))1+γ}1+2γ1+γxt,i⎤⎥ ⎥ ⎥ ⎥⎦.

In (4.3), two types hypergeometric series exist. Here, we prove a convergence of and . First, let us consider and we denote -th term that . Then, we use the dalembert ratio test for :

 limn→∞∣∣∣Sn+1Sn<