# Logarithmic Regret for parameter-free Online Logistic Regression

We consider online optimization procedures in the context of logistic regression, focusing on the Extended Kalman Filter (EKF). We introduce a second-order algorithm close to the EKF, named Semi-Online Step (SOS), for which we prove a O(log(n)) regret in the adversarial setting, paving the way to similar results for the EKF. This regret bound on SOS is the first for such parameter-free algorithm in the adversarial logistic regression. We prove for the EKF in constant dynamics a O(log(n)) regret in expectation and in the well-specified logistic regression model.

## Authors

• 4 publications
• 15 publications
03/18/2020

### Efficient improper learning for online logistic regression

We consider the setting of online logistic regression and consider the r...
02/07/2020

### Logistic Regression Regret: What's the Catch?

We address the problem of the achievable regret rates with online logist...
10/08/2021

### Mixability made efficient: Fast online multiclass logistic regression

Mixability has been shown to be a powerful tool to obtain algorithms wit...
07/14/2021

### Oblivious sketching for logistic regression

What guarantees are possible for solving logistic regression in one pass...
03/05/2020

### Logistic regression with total variation regularization

We study logistic regression with total variation penalty on the canonic...
05/27/2019

### On approximating dropout noise injection

This paper examines the assumptions of the derived equivalence between d...
10/03/2019

### Minimax Bounds for Distributed Logistic Regression

We consider a distributed logistic regression problem where labeled data...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the convex online optimization literature (Hazan, 2016), a crucial issue is the tuning of parameters. Our aim is to develop parameter-free algorithms in the context of logistic regression. One observes recursively through time . At each instance , the objective is to construct a prediction of the next value . In hand, we have explanatory variables in along with the past pairs . We reduce the prediction to a

-dimensional optimization problem thanks to the logistic loss function

 ℓt(yt,^θt)=log(1+exp(−yt^θTtXt)),t=1,2,…,

where , , are provided by a recursive algorithm. The aim of online convex optimization is to provide regret bounds on the cumulative losses for algorithms whose recursive update step is of constant complexity.

The logistic loss is exp-concave, property that guarantees the existence of online procedures achieving regret in the adversarial setting. The seminal paper of Hazan et al. (2007) proposed two such algorithms, Online Newton Step and Follow The Approximate Leader, that achieve this rate of convergence. Both methods require the knowledge of some constants unknown in practice, namely the constant of exp-concavity and an upper-bound on the gradients of the losses. They also require a projection step on a convex set of finite diameter. We consider these methods as localized ones because they use the strongly convex paraboloid local approximation of any exp-concave functions stated in Lemma 3 in Hazan et al. (2007).

On the contrary, some recent papers (Bach and Moulines, 2013, Gadat and Panloup, 2017, Godichon-Baggioni, 2018) propose global algorithms in the stochastic setting. Bach and Moulines (2013)

provide sharp regret bounds for a two-step procedure where the crucial step is the averaging of a Stochastic Gradient Descent (SGD) with constant learning rate that has to be tuned. In

Gadat and Panloup (2017), Godichon-Baggioni (2018) the authors propose non-asymptotic regret bounds with large constants of the averaging of a SGD with more robust learning rates that does not need to be tuned. Our results have the same flavor on a very popular online algorithm, the Extended Kalman Filter (EKF), whose non asymptotic properties have not yet been studied.

For linear regression, Kalman filters as originally described in

Kalman and Bucy (1961)

present a Bayesian perspective. The idea is to estimate the conditional expectation of the future state and its variance, given a prior on the initial state and past observations that follow a dynamic model. Kalman recursion is exactly the ridge regression estimator, see

Diderrich (1985), so Kalman filter achieves a regret for quadratic losses in adversarial setting. Note that the global strong convexity of the loss is crucial in the analysis of the regret in Cesa-Bianchi and Lugosi (2006).

The Extended Kalman Filter (EKF) of Fahrmeir (1992) yields an online parameter-free algorithm for logistic regression. More generally, EKF works in any misspecified Generalized Linear Model as defined in Rigollet (2012). Recently, the equivalence between Kalman filtering under constant dynamics and Online Natural Gradient has been noticed by Ollivier (2018). It is our belief that Kalman filtering offers an optimal way to choose the step-size in an online gradient descent algorithm. Up to our knowledge, regret bounds have been derived for the batch Maximum Likelihood Estimator only, also called Follow The Leader in the online learning literature. The complexity of this batch algorithm is prohibitive, see the discussion in Hazan et al. (2007). In our paper, we view the EKF as an approximation of FTL in order to derive a regret bound.

As an intermediate step, we prove a regret in the logistic regression problem for a second-order algorithm between FTL and EKF. We name it the Semi-Online Step (SOS) algorithm as it requires computations at each step, i.e. its complexity is quadratic in the number of iterations. Despite its inefficiency, SOS analysis is interesting as the non-asymptotic guarantee is valid in any adversarial setting. One can also interpret the extra computations per iteration compared to the EKF as the cost of the estimation of the local strong convexity constant of the paraboloid approximation.

The EKF is the natural online approximation of the SOS. It is efficient (constant time per iteration) and we prove a

regret, in expectation and in the well-specified logistic regression setting only. The analysis of the regret splits in two steps. When the algorithm is close to the optimum, its regret is logarithmic with high probability. This logarithmic rate is due to the nice martingale properties of the gradients of the losses. The conditional expectation of the gradient is proportional to its quadratic variation. The logarithmic regret bound follows from the local paraboloid approximation of

Hazan et al. (2007). The other phase, when the algorithm explores the optimization space, is much more problematic to analyze because the local paraboloid approximation does not apply uniformly. To circumvent this issue, we appeal at more robust potential arguments as in Gadat and Panloup (2017). We got a logarithmic control on the number of iterations spent in the first phase in expectation only. It is an open question whether this number of iterations can be controlled with high probability.

The paper is organized as follows. In Section 2, we introduce the SOS algorithm and we give its regret in Theorem 1 followed by its proof. In Theorem 6 of Section 3, we present our result in expectation for the EKF. We present the main steps of the proof of Theorem 6 in Section 4. Finally we discuss the results and future work in Section 5.

## 2 Semi-Online Step algorithm

In Section 2.1, we introduce the SOS algorithm as a semi-online approximation of the batch FTL algorithm

 θ∗t∈argminθt−1∑s=1ls(ys,θ). (1)

We see in Section 2.2 that SOS is also very close to the EKF but with higher complexity. Then we prove a bound on the regret of SOS in Section 2.3.

### 2.1 Construction of the SOS algorithm

The Semi-Online Step is described in Algorithm 1. We derive it from the Taylor approximation

 ∂∂θ[t∑s=1ls(ys,θ)]≈∂∂θ[t∑s=1ls(ys,θ)]∣∣θ=θ∗t+∂2∂θ2[t∑s=1ls(ys,θ)]∣∣θ=θ∗t(θ−θ∗t),

which transforms the first order condition of the optimization problem (1) realized by into

 ∂∂θ[t∑s=1ls(ys,θ)]∣∣θ=θ∗t+∂2∂θ2[t∑s=1ls(ys,θ)]∣∣θ=θ∗t(θ∗t+1−θ∗t)≈0.

Using the definition of we have

 ∂∂θ[t−1∑s=1ls(ys,θ∗t)]=0.

Combining this identity and the definition of the derivatives of the logistic loss we obtain

 ∂∂θ[t∑s=1ls(ys,θ)]∣∣θ=θ∗t=∂∂θlt(yt,θ)∣∣θ=θ∗t=−ytXt1+eytθ∗TtXt,
 ∂2∂θ2[t∑s=1ls(ys,θ)]∣∣θ=θ∗t=t∑s=1XsXTs(1+eθ∗TtXs)(1+e−θ∗TtXs).

Therefore satisfies approximately

 (t∑s=1XsXTs(1+eθ∗TtXs)(1+e−θ∗TtXs))(θ∗t+1−θ∗t)≈ytXt1+eytθ∗TtXt.

If the Hessian matrix were invertible, we would obtain

 θ∗t+1≈θ∗t+(t∑s=1XsXTs(1+eθ∗TtXs)(1+e−θ∗TtXs))−1ytXt1+eytθ∗TtXt.

This relation approximately satisfied by the optima sequence motivates the introduction of the SOS algorithm as defined in Algorithm 1. The computation of relies on the Sherman-Morrison formula: if and ,

 (A+uvT)−1=A−1−A−1uvTA−11+vTA−1u. (2)

We introduce the regularization matrix which guarantees the positive definiteness of in Algorithm 1. A good choice is for instance , . SOS then corresponds to the approximation

 ~θt≈argminθ(t−1∑s=1ls(ys,θ)+12p1∥θ∥2),t=1,2,….

### 2.2 Comparison with EKF

The Extended Kalman Filter was introduced by Fahrmeir (1992) for any Dynamic Generalized Linear Model. For constant dynamics, the EKF is shown to be equivalent to the Online Natural Gradient algorithm in Ollivier (2018), yielding the recursion

 P−1t+1 =P−1t+XtXTt(1+e^θTtXt)(1+e−^θTtXt), ^θt+1 =^θt−Pt+1∂∂θlt(yt,θ)∣∣θ=^θt.

This EKF recursion departs from SOS in the update of the matrix which satisfies

 Pt+1=⎛⎝P−11+t∑s=1XsXTs(1+e^θTsXs)(1+e−^θTsXs)⎞⎠−1,t=1,2,….

In EKF, we add a rank-one matrix to get from in order to update the matrix efficiently. On the contrary, the matrix in SOS is recomputed at each step because the Hessian has to be computed at the current estimate . Despite the similarity between and we were not able to control their differences. Our analysis of SOS and EKF are distinct and the obtained regret bounds are different in nature.

Thanks to the Sherman-Morrison formula (2), we describe the EKF in Algorithm 2 avoiding any inversion of matrices. The spatial complexity of the two algorithms is due to the storage of the matrices and . In term of running time, at each step of the SOS algorithm we have to compute recursively for and then . Each recursion on in

requires the computation of a rank-one matrix (product vector-vector) and its addition to the sum, its complexity is

. Thus, the complexity of step in SOS is . As a comparison, the EKF updates online and therefore requires only operations at each step.

### 2.3 The regret bound for SOS and its proof

In what follows, we denote

 DX=max1≤t≤n∥Xt∥, Dθ=max1≤t≤n∥~θt∥, D=max1≤t≤n|~θTtXt|.

SOS offers the advantage to be easier to analyse than EKF. We prove a regret bound on SOS in Theorem 1. Note that the leading constant is the inverse square of the exp-concavity constant times . The localized algorithms of Hazan et al. (2007) satisfy finer regret bounds with the inverse of the exp-concavity constant times as the leading constant. We believe that Theorem 1 could be improved to get a constant proportional to the inverse of the exp-concavity constant instead of the square inverse, see the end of the proof of Lemma 2 where we use a very loose bound bringing a . Up to our knowledge, SOS is the first parameter-free algorithm that achieves a regret bound in the adversarial logistic regression setting.

###### Theorem 1.

Starting from and , for any and , the SOS algorithm achieves the regret bound

 n∑t=1(lt(yt,~θt)−lt(yt,θ))≤(√dDX(Dθ+∥θ∥)(1+eD)4+1)1+eD2dlog(1+(n−1)p1D2X)+∥~θ1∥2+∥θ∥22p1+DX(Dθ+∥θ∥),n≥1.
###### Proof.

We first apply a telescopic sum argument

 n∑t=1(lt(yt,~θt)−lt(yt,θ)) =n∑t=1(t∑s=1ls(ys,~θt)−t−1∑s=1ls(ys,~θt)−lt(yt,θ)) =n−1∑t=1(t∑s=1ls(ys,~θt)−t∑s=1ls(ys,~θt+1))+n∑s=1(ls(ys,~θn)−ls(ys,θ)) =n−1∑t=1(t∑s=1ls(ys,~θt)+12~θTt~P−11~θt−t∑s=1ls(ys,~θt+1)−12~θTt+1~P−11~θt+1) +n∑s=1ls(ys,~θn)+12~θTn~P−11~θn−n∑s=1ls(ys,θ)−12θT~P−11θ +12θT~P−11θ−12~θT1~P−11~θ1.

Then, defining , we use the convexity of to obtain linear bounds:

 n∑t=1(lt(yt,~θt)−lt(yt,θ))≤ n−1∑t=1(St(~θt)+∂lt(yt,θ)∂θ∣∣~θt)T(~θt−~θt+1) +(Sn(~θn)+∂ln(yn,θ)∂θ∣∣~θn)T(~θn−θ) +12θT~P−11θ−12~θT1~P−11~θ1.

We apply another telescopic argument in order to get

 n−1∑t=1St(~θt)T(~θt−~θt+1)=n−1∑t=1(St+1(~θt+1)−St(~θt))T~θt+1+S1(~θ1)T~θ1−Sn(~θn)T~θn.

As , we sum up our findings to achieve the regret bound

 n∑t=1(lt(yt,~θt)−lt(yt,θ))≤n−1∑t=1(St+1(~θt+1)−St(~θt))T~θt+1−Sn(~θn)Tθ+12~θT1~P−11~θ1+12θT~P−11θ+n−1∑t=1(∂lt(yt,θ)∂θ∣∣~θt)T(~θt−~θt+1)+(∂ln(yn,θ)∂θ∣∣~θn)T(~θn−θ). (3)

Next we use the following Lemma proved in Appendix A.

###### Lemma 2.

For any , we have

 ∥∥St+1(~θt+1)−St(~θt)∥∥≤√dDX(1+eD)4XTt~Pt+1Xt(1+eyt~θTtXt)2.

Applying Lemma 2 on the norm of the first term in the previous regret bound (3), we get

 ∥∥ ∥∥n−1∑t=1(St+1(~θt+1)−St(~θt))T~θt+1∥∥ ∥∥ ≤n−1∑t=1∥∥St+1(~θt+1)−St(~θt)∥∥∥~θt+1∥ ≤√dDXDθ(1+eD)4n−1∑t=1XTt~Pt+1Xt(1+eyt~θTtXt)2.

Similarly, we estimate the second term of the regret bound (3) as

 ∥∥Sn(~θn)Tθ∥∥≤n−1∑t=1∥∥St+1(~θt+1)−St(~θt)∥∥∥θ∥≤√dDX∥θ∥(1+eD)4n−1∑t=1XTt~Pt+1Xt(1+eyt~θTtXt)2.

Finally, we easily control the last two terms of (3) as we identify

 n−1∑t=1(∂lt(yt,θ)∂θ∣∣~θt)T(~θt−~θt+1)=n−1∑t=1XTt~Pt+1Xt(1+eyt~θTtXt)2,

and we use the upper-bound

 ∥∥ ∥∥(∂ln(yn,θ)∂θ∣∣~θn)T(~θn−θ)∥∥ ∥∥≤∥∥∥∂ln(yn,θ)∂θ∣∣~θn∥∥∥(∥~θn∥+∥θ∥)≤DX(Dθ+∥θ∥).

Therefore,

 n∑t=1(lt(yt,~θt)−lt(yt,θ))≤(√dDX(Dθ+∥θ∥)(1+eD)4+1)n−1∑t=1XTt~Pt+1Xt(1+eyt~θTtXt)2+12~θT1~P−11~θ1+12θT~P−11θ+DX(Dθ+∥θ∥).

In order to conclude, we follow ideas from Cesa-Bianchi and Lugosi (2006) (in particular Lemma 11.11) to prove in Appendix A the following proposition which yields the result of Theorem 1.

###### Proposition 3.

For any sequence we have

 n−1∑t=1XTt~Pt+1Xt(1+eyt^θTtXt)2≤1+eD2dlog(1+(n−1)p1D2X).

## 3 Extended Kalman Filter

We were not able to bound the regret of the EKF algorithm in the adversarial setting as we did not control the difference between the matrices and . Thus, our EKF regret analysis holds in a restrictive well-specified stochastic setting.

### 3.1 Discussion on the assumptions

We assume that the stochastic sequence follows the logistic regression model: there exists such that

 p(yt|Xt,θtrue)=11+e−ytθTtrueXt,t=1,2,…. (4)

We do not make any assumption on the dependence of the stochastic process so far. We consider the regret in term of the expected loss conditionally on

: for any random variable

, we note the conditional expectation (we know the past pairs along with the explanatory variables at time ). We first observe that for any , is a convex function minimized in . Even if is a stochastic sequence, we apply a convexity argument on the expected losses in order to bound the regret by a linear regret

 n∑t=1(Et[lt(yt,^θt)]−Et[lt(yt,θtrue)])≤n∑t=1Et[ytXTt1+eyt^θTtXt](θtrue−^θt).

All the regret bounds on EKF provided hereafter actually come from identical bounds on the linear regret. We identify the expected gradients in the linear regret as . We observe a key property satisfied by the logistic gradients, proved in Appendix B.

###### Proposition 4.

For any , there exists satisfying and

 Ey∼p(y|X,θtrue)[yXT(θ%true−θ)1+eyθTX]=c(θtrue−θ)TXXT(θtrue−θ)(1+eθTX)(1+e−θTX).

Such Bernstein’s type conditions yield fast rates of convergence. However, the constant in Proposition 4 is relative to the error and the fast rate holds only locally: If there exists some so that for any then an application of Corollary 7 and Theorem 8 (with and so that ) yields the following regret bound

 n+τ∑t=τ+1Et[ytXTt1+eyt^θTtXt](θtrue−^θt)≤30 (20(1+eD)log(1δ)+1+eD4dlog(1+np1D2X) +12p1∥θtrue∥2),

with probability at least , .

In order to get the global regret bound, we need two extra assumptions on the law of :

###### Assumption 1.

There exists such that for any ,

 m1It≺E[Pt+1XtXTt∣X1,y1,…,Xt−1,yt−1].
###### Assumption 2.

There exists such that for any ,

 E[XTtP2t+1Xt]≤M2t2.

One checks these assumptions under the invertibility of the matrix for bounded iid :

###### Proposition 5.

In the iid case, if and if a.s. then we have

 λmint(1+D2X)2 ≤λmin(E[Pt+1XtXTt∣X1,y1,…,Xt−1,yt−1]), λmax(E[P2t+1]) ≤16(1+eD)2λ2mint2⎛⎝1+1t2de−3(3D4X+D2Xλmin/2)3(λ2min/8)2⎞⎠.

The results of Proposition 5 imply Assumption 1 and Assumption 2. Proposition 5 is proved in Appendix B.

### 3.2 Regret bound in expectation for the EKF

In what follows we assume that

 DX≥max1≤t≤n∥Xt∥, Dθ≥max(max1≤t≤n∥^θt∥,∥θtrue∥) and D≥max1≤t≤n|^θTtXt|a.s.

It is important to note that these constants are not used in the EKF Algorithm 2, making it parameter-free.

Assume that