1 Introduction
In the convex online optimization literature (Hazan, 2016), a crucial issue is the tuning of parameters. Our aim is to develop parameterfree algorithms in the context of logistic regression. One observes recursively through time . At each instance , the objective is to construct a prediction of the next value . In hand, we have explanatory variables in along with the past pairs . We reduce the prediction to a
dimensional optimization problem thanks to the logistic loss function
where , , are provided by a recursive algorithm. The aim of online convex optimization is to provide regret bounds on the cumulative losses for algorithms whose recursive update step is of constant complexity.
The logistic loss is expconcave, property that guarantees the existence of online procedures achieving regret in the adversarial setting. The seminal paper of Hazan et al. (2007) proposed two such algorithms, Online Newton Step and Follow The Approximate Leader, that achieve this rate of convergence. Both methods require the knowledge of some constants unknown in practice, namely the constant of expconcavity and an upperbound on the gradients of the losses. They also require a projection step on a convex set of finite diameter. We consider these methods as localized ones because they use the strongly convex paraboloid local approximation of any expconcave functions stated in Lemma 3 in Hazan et al. (2007).
On the contrary, some recent papers (Bach and Moulines, 2013, Gadat and Panloup, 2017, GodichonBaggioni, 2018) propose global algorithms in the stochastic setting. Bach and Moulines (2013)
provide sharp regret bounds for a twostep procedure where the crucial step is the averaging of a Stochastic Gradient Descent (SGD) with constant learning rate that has to be tuned. In
Gadat and Panloup (2017), GodichonBaggioni (2018) the authors propose nonasymptotic regret bounds with large constants of the averaging of a SGD with more robust learning rates that does not need to be tuned. Our results have the same flavor on a very popular online algorithm, the Extended Kalman Filter (EKF), whose non asymptotic properties have not yet been studied.For linear regression, Kalman filters as originally described in
Kalman and Bucy (1961)present a Bayesian perspective. The idea is to estimate the conditional expectation of the future state and its variance, given a prior on the initial state and past observations that follow a dynamic model. Kalman recursion is exactly the ridge regression estimator, see
Diderrich (1985), so Kalman filter achieves a regret for quadratic losses in adversarial setting. Note that the global strong convexity of the loss is crucial in the analysis of the regret in CesaBianchi and Lugosi (2006).The Extended Kalman Filter (EKF) of Fahrmeir (1992) yields an online parameterfree algorithm for logistic regression. More generally, EKF works in any misspecified Generalized Linear Model as defined in Rigollet (2012). Recently, the equivalence between Kalman filtering under constant dynamics and Online Natural Gradient has been noticed by Ollivier (2018). It is our belief that Kalman filtering offers an optimal way to choose the stepsize in an online gradient descent algorithm. Up to our knowledge, regret bounds have been derived for the batch Maximum Likelihood Estimator only, also called Follow The Leader in the online learning literature. The complexity of this batch algorithm is prohibitive, see the discussion in Hazan et al. (2007). In our paper, we view the EKF as an approximation of FTL in order to derive a regret bound.
As an intermediate step, we prove a regret in the logistic regression problem for a secondorder algorithm between FTL and EKF. We name it the SemiOnline Step (SOS) algorithm as it requires computations at each step, i.e. its complexity is quadratic in the number of iterations. Despite its inefficiency, SOS analysis is interesting as the nonasymptotic guarantee is valid in any adversarial setting. One can also interpret the extra computations per iteration compared to the EKF as the cost of the estimation of the local strong convexity constant of the paraboloid approximation.
The EKF is the natural online approximation of the SOS. It is efficient (constant time per iteration) and we prove a
regret, in expectation and in the wellspecified logistic regression setting only. The analysis of the regret splits in two steps. When the algorithm is close to the optimum, its regret is logarithmic with high probability. This logarithmic rate is due to the nice martingale properties of the gradients of the losses. The conditional expectation of the gradient is proportional to its quadratic variation. The logarithmic regret bound follows from the local paraboloid approximation of
Hazan et al. (2007). The other phase, when the algorithm explores the optimization space, is much more problematic to analyze because the local paraboloid approximation does not apply uniformly. To circumvent this issue, we appeal at more robust potential arguments as in Gadat and Panloup (2017). We got a logarithmic control on the number of iterations spent in the first phase in expectation only. It is an open question whether this number of iterations can be controlled with high probability.The paper is organized as follows. In Section 2, we introduce the SOS algorithm and we give its regret in Theorem 1 followed by its proof. In Theorem 6 of Section 3, we present our result in expectation for the EKF. We present the main steps of the proof of Theorem 6 in Section 4. Finally we discuss the results and future work in Section 5.
2 SemiOnline Step algorithm
In Section 2.1, we introduce the SOS algorithm as a semionline approximation of the batch FTL algorithm
(1) 
We see in Section 2.2 that SOS is also very close to the EKF but with higher complexity. Then we prove a bound on the regret of SOS in Section 2.3.

Initialization: is any positive definite matrix, is any initial parameter in .

Iteration: at each time step

Compute the matrix :

Update
2.1 Construction of the SOS algorithm
The SemiOnline Step is described in Algorithm 1. We derive it from the Taylor approximation
which transforms the first order condition of the optimization problem (1) realized by into
Using the definition of we have
Combining this identity and the definition of the derivatives of the logistic loss we obtain
Therefore satisfies approximately
If the Hessian matrix were invertible, we would obtain
This relation approximately satisfied by the optima sequence motivates the introduction of the SOS algorithm as defined in Algorithm 1. The computation of relies on the ShermanMorrison formula: if and ,
(2) 
We introduce the regularization matrix which guarantees the positive definiteness of in Algorithm 1. A good choice is for instance , . SOS then corresponds to the approximation
2.2 Comparison with EKF
The Extended Kalman Filter was introduced by Fahrmeir (1992) for any Dynamic Generalized Linear Model. For constant dynamics, the EKF is shown to be equivalent to the Online Natural Gradient algorithm in Ollivier (2018), yielding the recursion
This EKF recursion departs from SOS in the update of the matrix which satisfies
In EKF, we add a rankone matrix to get from in order to update the matrix efficiently. On the contrary, the matrix in SOS is recomputed at each step because the Hessian has to be computed at the current estimate . Despite the similarity between and we were not able to control their differences. Our analysis of SOS and EKF are distinct and the obtained regret bounds are different in nature.

Initialization: is any positive definite matrix, is any initial parameter in .

Iteration: at each time step

Update
with .

Update

Thanks to the ShermanMorrison formula (2), we describe the EKF in Algorithm 2 avoiding any inversion of matrices. The spatial complexity of the two algorithms is due to the storage of the matrices and . In term of running time, at each step of the SOS algorithm we have to compute recursively for and then . Each recursion on in
requires the computation of a rankone matrix (product vectorvector) and its addition to the sum, its complexity is
. Thus, the complexity of step in SOS is . As a comparison, the EKF updates online and therefore requires only operations at each step.2.3 The regret bound for SOS and its proof
In what follows, we denote
SOS offers the advantage to be easier to analyse than EKF. We prove a regret bound on SOS in Theorem 1. Note that the leading constant is the inverse square of the expconcavity constant times . The localized algorithms of Hazan et al. (2007) satisfy finer regret bounds with the inverse of the expconcavity constant times as the leading constant. We believe that Theorem 1 could be improved to get a constant proportional to the inverse of the expconcavity constant instead of the square inverse, see the end of the proof of Lemma 2 where we use a very loose bound bringing a . Up to our knowledge, SOS is the first parameterfree algorithm that achieves a regret bound in the adversarial logistic regression setting.
Theorem 1.
Starting from and , for any and , the SOS algorithm achieves the regret bound
Proof.
We first apply a telescopic sum argument
Then, defining , we use the convexity of to obtain linear bounds:
We apply another telescopic argument in order to get
As , we sum up our findings to achieve the regret bound
(3) 
Next we use the following Lemma proved in Appendix A.
Lemma 2.
For any , we have
Applying Lemma 2 on the norm of the first term in the previous regret bound (3), we get
Similarly, we estimate the second term of the regret bound (3) as
Finally, we easily control the last two terms of (3) as we identify
and we use the upperbound
Therefore,
In order to conclude, we follow ideas from CesaBianchi and Lugosi (2006) (in particular Lemma 11.11) to prove in Appendix A the following proposition which yields the result of Theorem 1.
Proposition 3.
For any sequence we have
∎
3 Extended Kalman Filter
We were not able to bound the regret of the EKF algorithm in the adversarial setting as we did not control the difference between the matrices and . Thus, our EKF regret analysis holds in a restrictive wellspecified stochastic setting.
3.1 Discussion on the assumptions
We assume that the stochastic sequence follows the logistic regression model: there exists such that
(4) 
We do not make any assumption on the dependence of the stochastic process so far. We consider the regret in term of the expected loss conditionally on
: for any random variable
, we note the conditional expectation (we know the past pairs along with the explanatory variables at time ). We first observe that for any , is a convex function minimized in . Even if is a stochastic sequence, we apply a convexity argument on the expected losses in order to bound the regret by a linear regretAll the regret bounds on EKF provided hereafter actually come from identical bounds on the linear regret. We identify the expected gradients in the linear regret as . We observe a key property satisfied by the logistic gradients, proved in Appendix B.
Proposition 4.
For any , there exists satisfying and
Such Bernstein’s type conditions yield fast rates of convergence. However, the constant in Proposition 4 is relative to the error and the fast rate holds only locally: If there exists some so that for any then an application of Corollary 7 and Theorem 8 (with and so that ) yields the following regret bound
with probability at least , .
In order to get the global regret bound, we need two extra assumptions on the law of :
Assumption 1.
There exists such that for any ,
Assumption 2.
There exists such that for any ,
One checks these assumptions under the invertibility of the matrix for bounded iid :
Proposition 5.
In the iid case, if and if a.s. then we have
3.2 Regret bound in expectation for the EKF
In what follows we assume that
It is important to note that these constants are not used in the EKF Algorithm 2, making it parameterfree.
Theorem 6.
Assume that
Comments
There are no comments yet.