# An efficient stochastic Newton algorithm for parameter estimation in logistic regressions

Logistic regression is a well-known statistical model which is commonly used in the situation where the output is a binary random variable. It has a wide range of applications including machine learning, public health, social sciences, ecology and econometry. In order to estimate the unknown parameters of logistic regression with data streams arriving sequentially and at high speed, we focus our attention on a recursive stochastic algorithm. More precisely, we investigate the asymptotic behavior of a new stochastic Newton algorithm. It enables to easily update the estimates when the data arrive sequentially and to have research steps in all directions. We establish the almost sure convergence of our stochastic Newton algorithm as well as its asymptotic normality. All our theoretical results are illustrated by numerical experiments.

## Authors

• 6 publications
• 7 publications
• 3 publications
06/23/2020

### An efficient Averaged Stochastic Gauss-Newton algorithm for estimating parameters of non linear regressions models

Non linear regression models are a standard tool for modeling real pheno...
06/23/2020

### An efficient Averaged Stochastic Gauss-Newtwon algorithm for estimating parameters of non linear regressions models

Non linear regression models are a standard tool for modeling real pheno...
11/19/2020

### On the asymptotic rate of convergence of Stochastic Newton algorithms and their Weighted Averaged versions

The majority of machine learning methods can be regarded as the minimiza...
11/07/2018

### Interpreting the Ising Model: The Input Matters

The Ising model is a widely used model for multivariate binary data. It ...
09/15/2021

### Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data

Motivated by the high-frequency data streams continuously generated, rea...
07/08/2021

### Identification and Adaptation with Binary-Valued Observations under Non-Persistent Excitation Condition

Dynamical systems with binary-valued observations are widely used in inf...
10/03/2015

### Distributed Parameter Map-Reduce

This paper describes how to convert a machine learning problem into a se...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Logistic regression is a well-known statistical model which is commonly used in the situation where the output is a binary random variable. It has a wide range of applications including machine learning [1], public health [11], social sciences, ecology [9] and econometry [17]. In what follows, we will consider a sequence of random variables taking values in , and we will assume that

is a sequence of independent and identically distributed random vectors such that, that for all

, the conditional distribution of knowing [8]. More precisely, let be the unknown parameter belonging to of the logistic regression. For all , we denote and we assume that

 L(Yn|Φn)=B(π(θTΦn))whereπ(x)=exp(x)1+exp(x).

Our goal is the estimation of the vector of parameters . For that purpose, let be the convex positive function defined, for all , by

 G(h) = E[−log(π(hTΦ)Y(1−π(hTΦ))1−Y)], = E[log(1+exp(hTΦ))−hTΦY]

where and shares the same distribution as . We clearly have . Hence, one can easily check that the unknown parameter satisfies

 (1.1) ∇G(θ)=E[Φ(π(θTΦ)−Y)]=0.

Consequently, under some standard convexity assumptions on ,

 (1.2) θ=argminh∈Rd+1G(h).

Since there is no explicit solution of the equation , it is necessary to make use of an approximation algorithm in order to estimate . Usually, when the sample size is fixed, we approximate the solution with the help of a Newton root-finding numerical algorithm. However, when the data streams arrive sequentially and at high speed, it is much more appropriate and efficient to treat them with the help of stochastic gradient algorithms. We refer the reader to the seminal paper [15] and to its averaged version [14, 16], as well as to the more recent contributions on the logistic regression [1, 6, 5]. One can observe that in these last references, the conditional distribution is the Rademacher distribution, instead of the usual Bernoulli one.

In this paper, we propose an alternative strategy to stochastic gradient algorithms, in the spirit of the Newton algorithm, in the sense that the step sequence of stochastic gradient algorithms is replaced by recursive estimates of the inverse of the Hessian matrix of the function we are minimizing. This strategy enables us to properly deal with the situation where the Hessian matrix has eigenvalues with significantly different absolute values. Indeed, in that case, it can be necessary to adapt automatically the step of the algorithm in all directions. To be more precise, we propose to estimate the unknown parameter

with the help of a stochastic Newton algorithm given, for all , by

 an = π(θTn−1Φn)(1−π(θTn−1Φn)), S−1n = S−1n−1−an(1+anΦTnS−1n−1Φn)−1S−1n−1ΦnΦTnS−1n−1, θn = θn−1+S−1nΦn(Yn−π(θTn−1Φn))

where the initial value is a bounded vector of which can be arbitrarily chosen and is a positive definite and deterministic matrix. For the sake of simplicity and in all the sequel, we take where

stands for the identity matrix of order

. One can observe that

 Sn=n∑k=1akΦkΦTk+Id+1

Moreover, is updated recursively, thanks to Riccati’s equation ([4], page 96) which enables us to avoid the useless inversion of a matrix at each iteration of the algorithm. Furthermore, the matrix is an estimate of the Hessian matrix at the unknown value . In order to ensure the convergence of the stochastic Newton algorithm, a modified version of this algorithm is provided. We shall prove its asymptotic efficiency by establishing its almost sure convergence and its asymptotic normality.
This algorithm is closely related to the iterative one used to estimate the unknown vector

of a linear regression model satisfying, for all

, . As a matter of fact, the updating of the least squares estimator of the parameter is given by

 S−1n = S−1n−1−(1+ΦTnS−1n−1Φn)−1S−1n−1ΦnΦTnS−1n−1 θn = θn−1+S−1nΦn(Yn−θTn−1Φn)

where the initial value can be arbitrarily chosen and is a positive definite deterministic matrix. This algorithm can be considered as a Newton stochastic algorithm since the matrix is an estimate of the Hessian matrix of the least squares criterion .
To the best of our knowledge and apart from the least squares estimate mentioned above, stochastic Newton algorithms are hardly ever used and studied since they often require the inversion of a matrix at each step, which can be very expensive in term of time calculation. An alternative to the stochastic Newton algorithm is the BFGS algorithm [12, 10, 2] based on the recursive estimation of a matrix whose behavior is closed to the one of the inverse of the Hessian matrix. Nevertheless, this last estimate does not converge to the exact inverse of the Hessian matrix. Consequently, the estimation of the unknown vector is not satisfactory.
The paper is organized as follows. Section 2 describes the framework and our main assumptions. In Section 3, we introduce our new stochastic Newton algorithm. Section 4 is devoted to its almost sure convergence as well as its asymptotic normality. Our theoretical results are illustrated by numerical experiments in Section 5. Finally, all technical proofs are postponed to Sections 6 and 7.

## 2. Framework

In what follows, we shall consider a couple of random variables taking values in where is a positive integer, and such that

with and where is the unknown parameter to estimate. We recall that is a minimizer of the convex function defined, for all , by

 (2.1) G(h)=E[−log(π(hTΦ)Y(1−π(hTΦ))1−Y)]=E[g(Φ,Y,h)].

In all the sequel, we assume that the following assumptions are satisfied.

• The vector

has a finite moment of order

and the matrix is positive definite.

• The Hessian matrix is positive definite.

These assumptions ensure that is the unique minimizer of the functional . Assumption (A1) enables us to find a first lower bound for the smallest eigenvalue of the estimates of the Hessian matrix, while assumptions (A1) and (A2) give the unicity of the minimizer of and ensure that the functional is twice continuously differentiable. More precisely, for all , we have

 (2.2) ∇G(h) = (2.3) ∇2G(h) = E[∇2hg(Φ,Y,h)]=14E[1(cosh(hTΦ/2))2ΦΦT].
###### Remark 2.1.

In the previous literature, it is more usual to consider a variable taking values in , which means that is the Rademacher distribution [1, 6, 5]. In this context, is a minimizer of the functional defined, for all , by

 G(h)=E[ln(1+exp(−YhTΦ))].

Under assumptions, the functional is twice continuously differentiable and, for all ,

 ∇G(h) = −E[exp(−YhTΦ)1+exp(−YhTΦ)YΦ], ∇2G(h) = 14E[1(cosh(hTΦ/2))2ΦΦT].

One can observe that the Hessian matrix remains the same. It ensures that the algorithm introduced in Section 3 can be adapted to this functional and that all the results given in Section 4 still hold in this case.

## 3. Stochastic Newton algorithm

In order to deal with massive data acquired online, let us recall that the stochastic Newton algorithm presented in the introduction is given, for all , by

 (3.1) an = π(θTn−1Φn)(1−π(θTn−1Φn)), (3.2) S−1n = S−1n−1−an(1+anΦTnS−1n−1Φn)−1S−1n−1ΦnΦTnS−1n−1, (3.3) θn = θn−1+S−1nΦn(Yn−π(θTn−1Φn))

where the initial value is a bounded vector of which can be arbitrarily chosen and . Unfortunately, we were not able to prove that converges almost surely to the Hessian matrix , as well as to establish the almost sure convergence of to . This is the reason why we slightly modify our strategy by proposing a truncated version of previous estimates given, for all , by

 (3.4) ˆan = π(ˆθTn−1Φn)(1−π(ˆθTn−1Φn)) (3.5) ˆθn = ˆθn−1+S−1n−1Φn(Yn−π(ˆθTn−1Φn)) (3.6) S−1n = S−1n−1−αn(1+αnΦTnS−1n−1Φn)−1S−1n−1ΦnΦTnS−1n−1

where the initial value is a bounded vector of which can be arbitrarily chosen, and is a sequence of random variable defined, for some positive constant , by

 (3.7) αn=max{ˆan,cαnβ}=max⎧⎨⎩14(cosh(ΦTnˆθn−1/2))2,cαnβ⎫⎬⎭

with . From now on and for the sake of simplicity, we assume that . It immediately implies that, for all , . However, the proofs remains true for any . We already saw in Section 1 that coincides with the exact inverse of the weighted matrix given, for all , by

 (3.8) Sn=n∑k=1αkΦkΦTk+Id+1.

Moreover, we will see in Section 4 that, even with this truncation of the estimate of the Hessian matrix, converges almost surely to the Hessian matrix . Consequently, we will still have an optimal asymptotic behavior of the estimator of .

## 4. Main results

Our first result deals with the almost sure convergence of our estimates of and the Hessian matrix . For all , denote

 ¯¯¯¯Sn=1nSn.
###### Theorem 4.1.

Assume that (A1) and (A2) are satisfied. Then, we have the almost sure convergences

 (4.1) limn→∞ˆθn=θa.s
 (4.2)

We now focus on the almost sure rates of convergence of our estimate of .

###### Theorem 4.2.

Assume that (A1) and (A2) are satisfied. Then, we have for all ,

 (4.3) ∥∥ˆθn−θ∥∥2=o((logn)1+γn)a.s

Moreover, suppose the random vector has a finite moment of order . Then, we have

 (4.4) ∥∥ˆθn−θ∥∥2=O(lognn)a.s

The almost sure rates of convergence of our estimate of the Hessian matrix and its inverse are as follows.

###### Theorem 4.3.

Assume that (A1) and (A2) are satisfied and that the random vector has a finite moment of order . Then, we have for all ,

 (4.5) ∥∥¯¯¯¯Sn−∇2G(θ)∥∥2=O(1n2β)a.s

 (4.6) ∥∥¯¯¯¯S−1n−(∇2G(θ))−1∥∥2=O(1n2β)a.s
###### Proof.

The proofs of Theorems 4.1, 4.2 and 4.3 are given in Section 6. ∎

###### Remark 4.1.

One can observe that we do not obtain the parametric rate for these estimates. This is due to the truncation which slightly modifies our estimation procedure. However, without this truncation, we were not able to establish the almost sure convergence of any estimate. Finally, the last result (4.6) ensures that our estimation procedure performs pretty well and that the estimator has an optimal asymptotic behavior.

###### Theorem 4.4.

Assume that (A1) and (A2) are satisfied and that the random vector has a finite moment of order . Then, we have the asymptotic normality

 (4.7) √n(ˆθn−θ)L−−−→n→∞N(0,(∇2G(θ))−1).
###### Proof.

The proof of Theorem 4.4 is given in Section 7. ∎

###### Remark 4.2.

We deduce from (4.2) and (4.7) that

 (4.8) (ˆθn−θ)TSn(ˆθn−θ)L−−−→n→∞χ2(d+1).

Convergence (4.8) allows us to build confidence regions for the parameter . Moreover, for any vector different from zero, we also have

 (4.9) wT(ˆθn−θ)√wTS−1nwL−−−→n→∞N(0,1).

Confidence intervals and significance tests for the components of can be designed from (4.9). One can observe that our stochastic Newton algorithm has the same asymptotic behavior as the averaged version of a stochastic gradient algorithm [5, 7, 13].

## 5. Numerical experiments

The goal of this section is to illustrate the asymptotic behavior of the truncated stochastic Newton algorithm (TSN) defined by equation (3.5). For that purpose, we will focus on the model introduced in [3] and used for comparing several gradient algorithms. We shall compare the numerical performances of the TSN algorithm with those obtained with three different algorithms : the Stochastic Newton (SN) algorithm given by equation (3.3), the stochastic gradient algorithm (SG), and the averaged stochastic gradient algorithm (ASG). Let us mention that simulations were carried out using the statistical software R.

### 5.1. Experiment model

We focus on the model introduced in [3], defined by

 L(Y|Φ)=B(π(θTΦ))

where and is a random vector of with

with independent coordinates uniformly distributed on the interval

. Moreover the unknown parameter . This model is particularly interresting since it leads to a Hessian matrix with eigenvalues of different order sizes. Indeed, one can see in Table 1 that the smallest eigenvalue of is close to 4.422 while its largest eigenvalue is close to 0.1239.

### 5.2. Comparison of the different algorithms

Our comparisons are based on the mean squared error (MSE) defined, for all estimate of , by

 E[∥∥ˆθn−θ∥∥2].

We simulate samples wth a maximum number of iterations . For each sample, we estimate the unknown parameter using the four algorithms (TSN, SN, SG, ASG) which are initialized identically by choosing the initial value uniformly in a compact subset containing the true value . For the TSN and SN algorithms, we take . In addition, for the TSN algorithm, we choose the truncation term defined by and . Finally, to be fairplay, we choose the best step sequence for the SG algorithm with the help of a cross validation method. Figure 1 shows the decreasing behavior of the MSE, calculated for the four algorithms, as the number of iterations grows from to .

It is clear that the stochastic Newton algorithms perform much more better than the stochastic gradient algorithms The bad behavior of the stochastic gradient algorithms is certainly due to the fact that the eigenvalues of the Hessian matrix are at different scales. One can also observe that it is quite useless to average the SG algorithm.
On can find in Figure 2 the boxplots of the values of the squared error computed for the TSN and SN algorithms, as well as for the deterministic Newton-Raphson algorithm (NR).

### 5.3. Some comments concerning the truncation.

To close this section, let us make some comments concerning the truncation term introduced in the TSN algorithm. This short numerical experiment tends to show that the use of the truncation is artificial and useless. Indeed, one can take the constant in (3.7) as small as possible and see that the TSN and SN algorithms match. Finally, an inappropriate choice of can lead to a poor numerical behavior of the TSN algorithm.

## 6. Proofs of the almost sure convergence results

### 6.1. Two technical lemmas

We start the proofs of the almost sure convergence results with two technical lemmas.

###### Lemma 6.1.

Assume that the random vector has a finite moment of order . Then, we have the almost sure convergence for all ,

 (6.1) limn→∞1∑nk=1k−βn∑k=1k−βΦkΦTk=E[ΦΦT]a.s
###### Remark 6.1.

We obtain from (3.7) together with (3.8) that for all ,

 (6.2) Sn⩾cαn∑k=1k−βΦkΦTk.

Denote by the minimum eigenvalue of the positive definite matrix . We immediately obtain from (6.1) and (6.2) that for large enough

 λmin(Sn)⩾λcα2n∑k=1k−β⩾λcα2n1−β% a.s

where stands for the minimum eigenvalue of the positive definite deterministic matrix . Consequently, we have under assumption (A1) that for large enough,

 (λmin(Sn))−2⩽4λ2c2α1n2(1−β)a.s

Therefore, as soon as ,

 (6.3) ∞∑n=1(λmin(Sn))−2<∞a.s
###### Proof.

It follows from a straightforward Abel transform calculation that

 (6.4) n∑k=1k−βΦkΦTk = n∑k=1k−β(Σk−Σk−1), = n−βΣn+n−1∑k=1(k−β−(k+1)−β)Σk, = n−βΣn+n−1∑k=1bkk−1Σk

where and for all ,

 Σn=n∑k=1ΦkΦTkandbn=n(n−β−(n+1)−β).

On the one hand, we obtain from the standard strong law of large numbers that

 (6.5) limn→∞1nΣn=E[ΦΦT]a.s

On the other hand,

 n∑k=1bk=n∑k=1k−β−n1−β

which implies that

 limn→∞1n1−βn∑k=1bk=β1−β.

Then, we deduce from Toeplitz’s lemma given e.g. in ([4], page 54) that

 (6.6) limn→∞1n1−βn−1∑k=1bkk−1Σk=β1−βE[ΦΦT]a.s

Consequently, we obtain from (6.4) together with (6.5) and (6.6) that

 limn→∞1n1−βn∑k=1k−βΦkΦTk=11−βE[ΦΦT]a.s

which immediately leads to (6.1). ∎

Our second lemma concerns a useful Lipschitz property of the function defined, for all , by

 (6.7) α(h,ℓ)=π(hTℓ)(1−π(hTℓ))=14(cosh(hTℓ/2))2.
###### Lemma 6.2.

For all , we have

 (6.8) ∣∣α(h,ℓ)−α(h,ℓ′)∣∣⩽112√3∥h∥∥∥ℓ−ℓ′∥∥.
###### Proof.

Let be the function defined, for all , by

 φ(x)=π(x)(1−π(x))=exp(x)(1+exp(x))2=14(cosh(x/2))2.

We clearly have

 φ′(x) = exp(x)(1−exp(x))(1+exp(x))3, φ′′(x) = exp(x)((exp(x))2−4exp(x)+1)(1+exp(x))4.

It is not hard to see that for all ,

 (6.9) |φ′(x)|⩽16√3.

Hence, it follows from (6.9) together with the mean value theorem that for all ,

 (6.10) |φ(x)−φ(y)|⩽16√3|x−y|.

Consequently, we obtain from (6.10) that for all ,

 ∣∣α(h,ℓ)−α(h,ℓ′)∣∣⩽112√3∣∣hT(ℓ−ℓ′)∣∣⩽112√3∥h∥∥∥ℓ−ℓ′∥∥

which completes the proof of Lemma 6.2. ∎

### 6.2. Proof of Theorem 4.1.

We are now in the position to proceed to the proof of the almost sure convergence (4.1). By a Taylor expansion of the twice continuously differentiable functional , there exists such that

 (6.11) G(ˆθn+1)=G(ˆθn)+∇G(ˆθn)T(ˆθn+1−ˆθn)+12(ˆθn+1−ˆθn)T∇2G(ξn)(ˆθn+1−ˆθn).

We clearly have from (2.3) that

Hence, we obtain from (3.5) together with (6.11) that

 G(ˆθn+1) ⩽ G(ˆθn)+∇G(ˆθn)T(ˆθn+1−ˆθn)+18E[∥Φ∥2]∥∥ˆθn+1−ˆθn∥∥2, = G(ˆθn)−∇G(ˆθn)TS−1nZn+1+18E[∥Φ∥2]∥∥S−1nZn+1∥∥2, ⩽ G(ˆθn)−∇G(ˆθn)TS−1nZn+1+18E[∥Φ∥2](λmin(Sn))−2∥∥Zn+1∥∥2

where . Since , it implies that

 (6.12) G(ˆθn+1)⩽G(ˆθn)−∇G(ˆθn)TS−1nZn+1+18E[∥Φ∥2](λmin(Sn))−2∥∥Φn+1∥∥2.

Let be the filtration given, for all , by . We clearly have . Consequently, we obtain from (6.12) that

 (6.13)

Our goal is now to apply the Robbins-Siegmund theorem ([4], page 18) to the three positive sequences , and given by ,

It clearly follows from (6.13) that

 E[Vn+1|Fn]⩽Vn+An−Bna.s.

Moreover, we already saw from (6.3) that

 (6.14) ∞∑n=1An<∞a.s

Consequently, we can deduce from the Robbins-Siegmund theorem that convergences almost surely to a finite random variable and

 (6.15) ∞∑n=1Bn<∞a.s

Furthermore, since , we get from (6.15) that

 (6.16)

In addition, we obtain from (3.7) together with (3.8) that

 λmax(Sn)⩽1+14λmax(n∑k=1ΦkΦTk)

since, for all , . Therefore, (6.5) ensures that for large enough

 λmax(Sn)⩽Λna.s

where is for the maximum eigenvalue of the positive definite deterministic matrix . It implies that

 (6.17) ∞∑n=1(λmax(Sn))−1=+∞a.s

Hence, it follows from the conjunction of (6.16) and (6.17) that converges to almost surely. It means that converges almost surely to the unique zero of the gradient, which is exactly what we wanted to prove. It remains to prove the almost sure convergence (4.2). We infer from (3.8) that

 (6.18) ¯¯¯¯Sn=1nn∑k=1(αk−α(Φk,ˆθk−1))ΦkΦTk+1nn∑k=1α(Φk,ˆθk−1)ΦkΦTk+1nId+1.

We now give the convergence of the two terms on the right-hand side of (6.18). For the first one, one can observe that as soon as . Consequently,

It implies that

However, one can easily check from Lemma 6.1 that

 limn→∞1nn∑k=11kβ∥∥Φk∥∥2=0a.s

It means that the first term on the right-hand side of (6.18) goes to almost surely. We now study the convergence of the second term on the right-hand side of (6.18) which can be rewritten as

 1nn∑k=1α(Φk,ˆθk−1)ΦkΦTk=1nn∑k=1α(Φk,θ)ΦkΦTk.+1nn∑k=1(α(Φk,ˆθk−1)−α(Φk,θ))ΦkΦTk

On the one hand, thanks to the standard strong law of large numbers, we clearly have

 (6.19)

On the other hand, denote by the remainder

We can split into two terms where, for some positive constant ,

 Pn = n∑k=1(α(Φk,ˆθk−1)−α(Φk,θ))ΦkΦTkI{∥Φk∥⩽M} Qn = n∑k=1(α(Φk,ˆθk−1)−α(Φk,θ))ΦkΦTkI{∥Φk∥>M}.

It follows from the Lipschitz property of the function given is Lemma 6.2 that

Hence, we deduce from (4.1) together with (6.5) that

 (6.20) limn→∞1nPn=0a.s

Furthermore, we also have

 ∥∥1nQn∥∥⩽12nn∑k=1∥Φk∥2I{∥Φk∥>M}.

We deduce once again from the strong law of large numbers that

which implies via (6.20) that for any positive constant ,

 (6.21) limsupn→∞∥∥1nRn∥∥⩽12E[∥Φ∥2I{∥Φ∥>M}]a.s

Nonetheless, we obtain from the Lebesgue dominated convergence theorem that

Consequently, we find from (6.21)

 limn→∞1nRn=0a.s

Finally, (4.2) follows from (6.18) and (6.19), which achieves the proof of Theorem 4.1.

### 6.3. Proof of Theorem 4.2.

It follows from equation (3.5) that for all ,

 ˆθn+1−θ=ˆθn−θ−1n(¯¯¯¯S−1n−S−1)Zn+1−1nS−1Zn+1

where and . Consequently,

 (6.22) ˆθn+1−θ=ˆθn−θ−1n(¯¯¯¯S−1n−S−1)Zn+1−1nS−1(∇G(ˆθn)+εn+1)

where . We already saw that which clearly implies that is a martingale difference sequence, . Denote by the remainder of the Taylor’s expansion of the gradient

 δn=∇G(ˆθn)−∇2G(θ)(ˆθn−θ)=∇G(ˆθn)−S(ˆθn−θ).

We deduce from (6.22) that for all ,

 (6.23) ˆθn+1−θ=(1