# A Random Block-Coordinate Douglas-Rachford Splitting Method with Low Computational Complexity for Binary Logistic Regression

In this paper, we propose a new optimization algorithm for sparse logistic regression based on a stochastic version of the Douglas-Rachford splitting method. Our algorithm sweeps the training set by randomly selecting a mini-batch of data at each iteration, and it allows us to update the variables in a block coordinate manner. Our approach leverages the proximity operator of the logistic loss, which is expressed with the generalized Lambert W function. Experiments carried out on standard datasets demonstrate the efficiency of our approach w.r.t. stochastic gradient-like methods.

## Authors

• 1 publication
• 11 publications
• 20 publications
• 17 publications
12/03/2019

### A Hidden Variables Approach to Multilabel Logistic Regression

Multilabel classification is an important problem in a wide range of dom...
01/26/2022

### Privacy-Preserving Logistic Regression Training with a Faster Gradient Variant

Logistic regression training on an encrypted dataset has been an attract...
11/18/2015

### A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

We examine a new form of smooth approximation to the zero one loss in wh...
04/16/2021

### Affine-invariant ensemble transform methods for logistic regression

We investigate the application of ensemble transform approaches to Bayes...
05/11/2018

### Stochastic Approximation EM for Logistic Regression with Missing Values

Logistic regression is a common classification method in supervised lear...
01/28/2021

### Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

Theoretical results show that Bayesian methods can achieve lower bounds ...
05/21/2013

### Robust Logistic Regression using Shift Parameters (Long Version)

Annotation errors can significantly hurt classifier performance, yet dat...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sparse classification algorithms have gained much popularity in the context of supervised learning, thanks to their ability to discard irrelevant features during the training stage. Such algorithms aim at learning a weighted linear combination of basis functions that fits the training data, while encouraging as many weights as possible to be equal to zero. This amounts to solving an optimization problem that involves a loss function plus a sparse regularization term. Different types of classifiers arise by varying the loss function, the most popular being the hinge and the logistic losses

Rosasco2004 ; Bartlett2006 .

In the context of supervised learning, sparse regularization traces back to the work of Bradley and Mangasarian Bradley1998 , who showed that the

-norm can efficiently perform feature selection by shrinking small coefficients to zero. Other forms of regularization have also been studied, such as the

-norm Weston2002 , the -norm with Liu2007 , the -norm Zou2008 , and other nonconvex terms Laporte2014 . Mixed-norms have been investigated as well, due to their ability to impose a more structured form of sparsity Krishnapuram2005 ; Meier2008 ; Duchi2009 ; Obozinski2010 ; Yuan2010 ; Rosasco2013 .

Many efficient learning algorithms exist in the case of quadratic regularization, by benefiting from the advantages brought by Lagrangian duality Blondel2014

. This is unfortunately not true for sparse regularization, because the dual formulation is as difficult to solve as the primal one. Consequently, sparse linear classifiers are usually trained through the direct resolution of the primal optimization problem. Among the possible approaches, one can resort to linear programming

Wang2007 , gradient-like methods Krishnapuram2005 ; Mairal2013 , proximal algorithms Laporte2014 ; Chierchia2015 ; Barlaud2015 , and other optimization techniques Tan2010 .

Nowadays, it is well known that random updates can significantly reduce the computational time when a quadratic regularization is used Hsieh2008 ; Lacoste2013 . Therefore, a great deal of attention has been paid recently to stochastic approaches capable of handling a sparse regularization Pereyra2016 . The list of investigated techniques includes block-coordinate descent strategies Shalev2011 ; Blondel2013 ; Fercoq2015 ; Lu2015 , stochastic forward-backward iterations Langford2009 ; Xiao2010 ; Richtarik2014 ; Rosasco2014 ; Combettes2016 , random Douglas-Rachford splitting methods Combettes2015 , random primal-dual proximal algorithms Pesquet2015 , and stochastic majorization-minimization methods Mairal2013b ; Chouzenoux2015 .

In this paper, we propose a random-sweeping block-coordinate Douglas-Rachford splitting method. In addition to the stochastic behavior, it presents three distinctive features with respect to related approaches Briceno2011 ; Bot2013 ; Combettes2015 . Firstly, the matrix to be inverted at the initial step is block-diagonal, while in the concurrent approaches, it did not present any specific structure. The block diagonal property implies that the inversion step actually amounts to inverting a set of smaller size matrices. Secondly, the proposed algorithm can take advantage explicitly from a strong convexity property possibly fulfilled by some of the functions involved in the optimization problem. Finally, the dual variables appear explicitly in the proposed scheme, making it possible to use clever block-coordinate descent strategies Perekrestenko2017 .

Moreover, the proposed algorithm appears to be well tailored to binary logistic regression with sparse regularization. In contrast to gradient-like methods, our approach deals with the logistic loss through its proximity operator. This results in an algorithm that is not tied up to the Lipschitz constant of the loss function, possibly leading to larger updates per iteration. In this regard, our second contribution is to show that the proximity operator of the binary logistic loss can be expressed in closed form using the generalized Lambert W function Mezo2014 ; Maignan2016 . We also provide comparisons with state-of-the-art stochastic methods using benchmark datasets.

The paper is organized as follows. In Section 2, we derive the proposed Douglas-Rachford algorithm. In Section 3, we introduce sparse logistic regression, along with the proximal operator of the logistic loss. In Section 4, we evaluate our approach on standard datasets, and compare it to three sparse classification algorithms proposed in the literature Xiao2010 ; Rosasco2014 ; Chierchia2016 . Finally, conclusions are drawn in Section 5.

Notation: denotes the set of proper, lower semicontinuous, convex functions from a real Hilbert space to . Let . For every , the subdifferential of at is , the proximity operator of at is , and the conjugate of is in . The adjoint of a bounded linear operator from to a real Hilbert space is denoted by Let

be the underlying probability space, the

-algebra generated by a family

of random variables is denoted by

.

## 2 Optimization method

Throughout this section, , are separable real Hilbert spaces. In addition, denotes the Hilbertian sum of

. Any vector

can thus be uniquely decomposed as where, for every , . In the following, a similar notation will be used to denote vectors in any product space (in bold) and their components.

We will now aim at solving the following problem.

###### Problem 1

For every and for every , let , let be a differentiable convex function with -Lipschitz gradient, for some , and let be a linear bounded operator from to . The problem is to

 minimizew∈HB∑b=1fb(wb)+L∑ℓ=1hℓ(B∑b=1Aℓ,bwb),

under the assumption that the set of solutions is nonempty.

In order to address Problem 1, we propose to employ the random-sweeping block-coordinate version of the Douglas-Rachford splitting method with stochastic errors described in Algorithm 1. Let us define

 (∀b∈{1,…,B})Cb=(Id+τbL∑ℓ=1γℓ1+γℓρℓA∗ℓ,bAℓ,b)−1:Hb→Hb, (1)

where and are the positive constants introduced in Algorithm 1.

The next result establishes the convergence of the proposed algorithm.

###### Proposition 1

For every , let , and be -valued random variables and, for every , let and be -valued random variables and let be -valued random variables. In addition, let be identically distributed -valued random variables and in Algorithm 1 assume that

1. and are independent;

2. ;

3. ;

4. and .

Then, the sequence generated by Algorithm 1 converges weakly -a.s. to an -valued random variable.

###### Proof.

Problem 1 can be reformulated as minimizing where

 f:H→]−∞,+∞]:w↦B∑b=1fb(wb) (2) A:H→G:w↦(Aℓ,1w1,…,Aℓ,BwB)1≤ℓ≤L (3) h:G→R:v↦L∑ℓ=1hℓ(Λℓvℓ) (4) (∀ℓ∈{1,…,L}) Λℓ:GBℓ→Gℓ:vℓ↦B∑b=1vℓ,b (5)

and denotes a generic element of with for every . Since , from (Bauschke2017, , Theorem 16.47(i)), Problem 1 is equivalent to

 findw∈Hsuch % that0∈∂f(w)+A∗∇h(Aw), (6)

which, from (Briceno2011, , Proposition 2.8) is also equivalent to

 find(w,v)∈H×Gsuch that (0,0)∈N(w,v)+S(w,v), (7)

where is maximally monotone and

is a skewed linear operator. Note that

and, from (2), (4) and (Bauschke2017, , Proposition 13.30 and Proposition 16.9), and . Since, for every , is convex differentiable with a Lipschitzian gradient , it follows from Baillon-Haddad theorem (Bauschke2017, , Corollary 18.17) that is cocoercive and, hence, is strongly monotone. Therefore, for every , is strongly convex. By defining

 (∀ℓ∈{1,…,L})φℓ=(hℓ∘Λℓ)∗−ρℓ∥⋅∥2/2, (8)

it follows from (Bauschke2017, , Proposition 10.8) that, for every , and, hence, is maximally monotone. Consequently, Problem 1 can be rewritten equivalently as

 find(w,v)∈H×Gsuch that {(∀b∈{1,…,B})0∈∂fb(wb)+Bb(w,v)(∀ℓ∈{1,…,L})0∈∂φℓ(vℓ)+Bℓ(w,v), (9)

which, for strictly positive constants and , is equivalent to

 find(w,v)∈H×Gsuch that {(∀b∈{1,…,B})0∈τb∂fb(wb)+τbBb(w,v)(∀ℓ∈{1,…,L})0∈γℓ∂φℓ(vℓ)+γℓBℓ(w,v), (10)

where

 {Bb:(w,v)↦∑Lℓ=1A∗ℓ,bvℓ,bBℓ:(w,v)↦−(Aℓ,1w1,…,Aℓ,BwB)+ρℓvℓ. (11)

Since and are linear and monotone operators in and , respectively, the operator

 B:(w,v)↦(A∗v,−Aw+Dv)=((Bb(w,v))1≤b≤B,(Bℓ(w,v))1≤ℓ≤L)

is maximally monotone in . Therefore, by defining the strongly positive diagonal linear operator

 U:H×G →H×G (w,v) ↦(Tw,Γv), (12)

where and , the operator

 UB:(w,v)↦(TA∗v,−ΓAw+ΓDv)=((τbBb(w,v))1≤b≤B,(γℓBℓ(w,v))1≤ℓ≤L) (13)

is maximally monotone in , where

 (∀(w,v)∈H×G)∥(w,v)∥U−1= ⎷B∑b=1τ−1b∥wb∥2+L∑ℓ=1γ−1ℓ∥vℓ∥2. (14)

Note that the renormed product space is the Hilbert sum where, for every and , and are endowed by the norm and , respectively. Therefore, since and are maximally monotone in and , respectively, we conclude that (10) is a particular case of the primal inclusion in (Combettes2015, , Proposition 5.1).

Now we write Algorithm 1 as a particular case of the random block-coordinate Douglas-Rachford splitting proposed in (Combettes2015, , Proposition 5.1) applied to (10) in . Given , let . It follows from (13) that

 {w=t−TA∗vv=(Id+ΓD)−1(s+ΓAw), (15)

 w=(Id+TA∗(Id+ΓD)−1ΓA)−1(t−TA∗(Id+ΓD)−1s). (16)

In order to derive an explicit formula for the matrix inversion in (16), set . We have and, since (3) and is diagonal, we obtain

 TA∗(Id+ΓD)−1ΓA:w↦(τbL∑ℓ=1γℓA∗ℓ,bAℓ,bwb1+γℓρℓ)1≤b≤B,

and, hence,

 (∀b∈{1,…,B})wb=(Id+τbL∑ℓ=1γℓA∗ℓ,bAℓ,b1+γℓρℓ)−1zb=Cbzb. (17)

Therefore, (16) can be written equivalently as

 (∀b∈{1,…,B})wb=Cb(t[i]b−τbL∑ℓ=1A∗ℓ,bsℓ,b1+γℓρℓ)=Cb(tb−τbub), (18)

where

 (∀b∈{1,…,B})ub=L∑ℓ=1A∗ℓ,bsℓ,b1+γℓρℓ. (19)

Moreover, from (15), we deduce that

 (∀ℓ∈{1,…,L})vℓ=sℓ+γℓ(Aℓ,bwb)1≤b≤B1+γℓρℓ, (20)

and, hence, we have , where

 ⎧⎪⎨⎪⎩(∀b∈{1,…,B})Qb:(t,s)↦Cb(tb−τbub)(∀ℓ∈{1,…,L})Qℓ:(t,s)↦sℓ+γℓ(Aℓ,bQb(t,s))1≤b≤B1+γℓρℓ. (21)

Now, it follows from (Bauschke2017, , Proposition 16.44) that, for every and , and and, for every and , we have

 rℓ=proxγℓφℓzℓ⇔ zℓ−rℓγℓ∈∂φℓ(rℓ) ⇔ zℓ−rℓγℓ∈∂(hℓ∘Λℓ)∗rℓ−ρℓrℓ ⇔ zℓ−(1−γℓρℓ)rℓ∈γℓ∂(hℓ∘Λℓ)∗rℓ. (22)

Therefore, if we have that (22) is equivalent to

 rℓ=proxγℓ1−γℓρℓ(hℓ∘Λℓ)∗(zℓ1−γℓρℓ) (23)

and, from Moreau’s decomposition formula (Bauschke2017, , Theorem 14.13(ii)), we obtain

 rℓ=11−γℓρℓ(zℓ−γℓprox1−γℓρℓγℓ(hℓ∘Λℓ)(zℓγℓ)). (24)

Noting that, for every , , from (Bauschke2017, , Proposition 24.14), (22) and (24) we deduce that

 (∀b∈{1,…,B})rℓ,b=1B(1−γℓρℓ)(B∑d=1zℓ,d−γℓproxB(1−γℓρℓ)γℓhℓ(1γℓB∑d=1zℓ,d)) (25)

and, hence,

 proxγℓφℓzℓ=1B(1−γℓρℓ)(B∑d=1zℓ,d−γℓproxB(1−γℓρℓ)γℓhℓ(1γℓB∑d=1zℓ,d))1≤b≤B. (26)

Therefore, by defining, for every and , via

 (∀ℓ∈{1,…,L})e[i]ℓ=(−γℓB(1−γℓρℓ)d[i]ℓ)1≤b≤B, (27)

we deduce that Algorithm 1 can be written equivalently as

For

 ⎢⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣For b=1,…,B⎢⎢ ⎢ ⎢⎣w[i+1]b=w[i]b+ε[i]b(Qb(t[i],s[i])−w[i]b)t[i+1]b=t[i]b+ε[i]bμ[i](proxτbfb(2w[i+1]b−t[i]b)+a[i]b−w[i+1]b)For ℓ=1,…,L⎢⎢ ⎢ ⎢⎣v[i+1]ℓ=v[i]ℓ+ε[i]B+ℓ(Qℓ(t[i],s[i])−v[i]ℓ)s[i+1]ℓ=s[i]ℓ+ε[i]B+ℓμ[i](proxγℓφℓ(2v[i+1]ℓ−s[i]ℓ)+e[i]ℓ−v[i+1]ℓ).

Defining, for every , , we have

 ∑i∈N√E(∥a[i]∥2U−1|χ[i]) =∑i∈N ⎷B∑b=1τ−1bE(∥a[i]b∥2|χ[i])+L∑ℓ=1γ−1ℓE(∥e[i]ℓ∥2|χ[i]) ≤B∑b=1τ−1/2b∑i∈N√E(∥a[i]b∥2|χ[i])+L∑ℓ=1γℓ−1/2∑i∈N√E(∥e[i]ℓ∥2|χ[i]) <+∞, (28)

where the last inequality follows from (ii), (iii), (27) and

 ∑i∈N√E(∥e[i]ℓ∥2|χ[i])=γℓ√B(1−γℓρℓ)∑i∈N√E(∥d[i]ℓ∥2|χ[i])<+∞. (29)

Altogether, since operator is weakly sequentially continuous because it is continuous and linear, the result follows from (Combettes2015, , Proposition 5.1) when the error term in the computation of is zero. ∎∎

###### Remark 1
1. In Proposition 1

, the binary variables

and signal whether the variables and are activated or not at iteration . Assumption (iv) guarantees that each of the latter variables is activated with a nonzero probability at each iteration. In particular, it must be pointed out that the variables and only need to be computed when .

2. Note that Algorithm 1 may look similar to the stochastic approach proposed in (Combettes2015, , Corollary 5.5) (see also (Briceno2011, , Remark 2.9), and Bot2013 for deterministic variants). It exhibits however three key differences. Most importantly, the operator inversions performed at the initial step amount to inverting a set of positive definite self-adjoint operators defined on the spaces . We will see in our application that this reduces to invert a set of small size symmetric positive definite matrices. Another advantage is that the smoothness of the functions is taken into account, and a last one is that the dual variables appear explicitly in the iterations.

3. If, for every , and , Algorithm 1 simplifies to Algorithm 2, where unnecessary indices have been dropped and we have set

 (∀i∈N){˜u[i]=−τu[i](∀ℓ∈{1,…,L}˜s[i]ℓ=−τs[i]ℓ. (30)

In this case,

 C=(Id+τL∑ℓ=1γℓA∗ℓAℓ)−1. (31)

When , it turns out this algorithm is exactly the same as the one resulting from a direct application of (Combettes2015, , Corollary 5.5)Chierchia2017 .

4. The situation when, for a given , is not Lipschitz-differentiable can be seen as the limit case when . It can then be shown that Algorithm 1 remains valid by setting .

## 3 Sparse logistic regression

The proposed algorithm can be applied in the context of binary linear classification. A binary linear classifier can be modeled as a function that predicts the output associated to a given input . This prediction is defined through a linear combination of the input components, yielding the decision variable

 dw(x)=sign(x⊤w), (32)

where

is the weight vector to be estimated. In supervised learning, this weight vector is determined from a set of input-output pairs

 S={(xℓ,yℓ)∈RN×{−1,+1}|ℓ∈{1,…,L}}, (33)

which is called training set. More precisely, the learning task can be defined as the trade-off between fitting the training data and reducing the model complexity, leading to an optimization problem expressed as

 minimizew∈RNf(w)+L∑ℓ=1h(yℓx⊤ℓw), (34)

where is a regularization function and stands for the loss function. In the context of sparse learning, a popular choice for the regularization is the -norm. Although many choices for the loss function are possible, we are primarily interested in the logistic loss, which is detailed in the next section.

### 3.1 Logistic regression

Logistic regression aims at maximizing the posterior probability density function of the weights given the training data, here assumed to be a realization of statistically independent input-output random variables. This leads us to

 maximizew∈RNφ(w)L∏ℓ=1π(yℓ∣xℓ,w)θℓ(xℓ|w), (35)

where

is the weight prior probability density function and, for every

, is the conditional data likelihood of the -th input knowing the weight values, while is the conditional probability of the -th output knowing the

-th input and the weights. Let us model this conditional probability with the sigmoid function defined as

 π(yℓ∣xℓ,w)=11+exp(−yℓx⊤ℓw), (36)

and assume that the inputs and the weights are statistically independent and that is log-concave. Then, the negative-logarithm of the energy in (35) yields an instance of Problem (34) in which

 (∀v∈R)h(v)=log(1+exp(−v)) (37)

and, for every , . (The term can be discarded since the inputs and the weights are assumed statistically independent.) The function in (37) is called logistic loss. For completeness, note that other loss functions, leading to different kinds of classifiers, are the hinge loss Cortes1995

 (38)

with , and the Huber loss Martins2016

 (∀v∈R)hhuber(v)=⎧⎪⎨⎪⎩0if% v≥1−vif v≤−114