## 1 Introduction

Online learning has been extensively studied these last decades in game theory and statistics

(see CBL2006, and references therein). The problem can be described as a sequential game: a blackbox reveals at each time some . Then, the forecaster predicts the next value based on the past observations and possibly other available information. Unlike the classical statistical framework, the sequence is not assumed to be a realization of some stochastic process. Research efforts in online learning began in the framework of prediction with expert advice. In this setting, the forecaster has access to a set of experts’ predictions, whereis a finite set of experts (such as deterministic physical models, or stochastic decisions). Predictions made by the forecaster and experts are assessed with a loss function

. The goal is to build a sequence (denoted by in the sequel) of predictions which are nearly as good as the best expert’s predictions in the first time rounds,*i.e.*, satisfying uniformly over any sequence the following regret bound

where is a remainder term. This term should be as small as possible and in particular sublinear in . When is finite, and the loss is bounded in and convex in its first argument, an explicit is given by CBL2006. The optimal forecaster is then obtained by forming the exponentially weighted average of all experts. For similar results, we refer the reader to LW1994, CFH1997.

Online learning techniques have also been applied to the regression framework. In particular, sequential ridge regression has been studied by

Vov2001. For any , we now assume that . At each time , the forecaster gives a prediction of , using only newly revealed side information and past observations . Let denote the scalar product in . A possible goal is to build a forecaster whose performance is nearly as good as the best linear forecaster ,*i.e.*, such that uniformly over all sequences ,

(1) |

where is a remainder term. This setting has been addressed by numerous contributions to the literature. In particular, AW2001 and Vov2001 each provide an algorithm close to the ridge regression with a remainder term . Other contributions have investigated the Gradient-Descent algorithm (CLW1996; KW1997) and the Exponentiated Gradient Forecasters (KW1997; CB1999). Ger2011 generalized the notation in (1) to , where is a dictionary of base forecasters. In the so-called high dimensional setting (), a sparsity regret bound with a remainder term growing logarithmically with and is proved by Ger2011.

The purpose of the present work is to generalize the aforecited framework to the clustering problem, which has attracted attention from the machine learning and streaming communities. As an example,

GMM2003, BF2008 and LSS2015 study the so-called data streaming clustering problem. It amounts to cluster online data to a fixed number of groups in a single pass, or a small number of passes, while using little memory. From a machine learning perspective, CM2012 aggregate online clustering algorithms, with a fixed number of centers. The present paper investigates a more general setting since we aim to perform online clustering with an unfixed and changing number of centers. To the best of our knowledge, this is the first attempt of the sort in the literature. Let us stress that our approach only requires an upper bound to , which can be either a constant or an increasing function of the time horizon .Our approach strongly relies on a quasi-Bayesian methodology. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory which originates in the machine learning community in the late 1990s, in the seminal works of STW1997 and McA1998; McA1999 (see also See2002; See2003). In the statistical learning community, the PAC-Bayesian approach has been extensively developed by Cat2004; Cat2007, Aud2004b and Alq2006, and later on adapted to the high dimensional setting DT2007; DT2008, AL2011, AB2013, GA2013, GR2015 and AG2016. In a parallel effort, the online learning community has contributed to the PAC-Bayesian theory in the online regression setting (KW1999). Aud2009 and Ger2011 have been the first attempts to merge both lines of research. Note that our approach is *quasi-Bayesian* rather than PAC-Bayesian, since we derive regret bounds (on quasi-Bayesian predictors) instead of PAC oracle inequalities.

Our main contribution is to generalize algorithms suited for supervised learning to the unsupervised setting. Our online clustering algorithm is adaptive in the sense that it does not require the knowledge of the time horizon

to be used and studied. The regret bounds that we obtain have a remainder term of magnitude and we prove that they are asymptotically minimax optimal.The quasi-posterior which we derive is a complex distribution and direct sampling is not available. In Bayesian and quasi-Bayesian frameworks, the use of Markov Chain Monte Carlo (MCMC) algorithms is a popular way to compute estimates from posterior or quasi-posterior distributions. We refer to the comprehensive monograph RG2004 for an introduction to MCMC methods. For its ability to cope with transdimensional moves, we focus on the Reversible Jump MCMC algorithm from Gre1995, coupled with ideas from the Subspace Carlin and Chib algorithm proposed by DFN2002 and PD2012. MCMC procedures for quasi-Bayesian predictors were firstly considered by Cat2004 and dalalyan2012sparse. AB2013, GA2013 and GR2015 are the first to have investigated the RJMCMC and Subspace Carlin and Chib techniques and we show in the present paper that this scheme is well suited to the clustering problem.

The paper is organised as follows. Section 2 introduces our notation and our online clustering procedure. Section 3 contains our mathematical claims, consisting in regret bounds for our online clustering algorithm. Remainder terms which are sublinear in are obtained for a model selection-flavored prior. We also prove that these remainder terms are minimax optimal. We then discuss in Section 4 the practical implementation of our method, which relies on an adaptation of the RJMCMC algorithm to our setting. In particular, we prove its convergence towards the target quasi-posterior. The performance of the resulting algorithm, called PACBO, is evaluated on synthetic data. For the sake of clarity, proofs are postponed to Section 5. Finally, Appendix A contains an extension of our work to the case of a multivariate Student prior along with additional numerical experiments.

## 2 A quasi-Bayesian perspective to online clustering

Let be a sequence of data, where . Our goal is to learn a time-dependent parameter and a partition of the observed points into cells, for any . To this aim, the output of our algorithm at time

is a vector

of centers in , depending on the past information and . A partition is then created by assigning any point in to its closest center. When is newly revealed, the instantaneous loss is computed as(2) |

where is the -norm in . In what follows, we investigate regret bounds for cumulative losses. Given a measurable space (embedded with its Borel -algebra), we let

denote the set of probability distributions on

, and for some reference measure , we let be the set of probability distributions absolutely continuous with respect to . For any probability distributionsthe Kullback-Leibler divergence

is defined asNote that for any bounded measurable function and any probability distribution such that ,

(3) |

This result, which may be found in C75 and Cat2004, is critical to our scheme of proofs. Further, the infimum is achieved at the so-called Gibbs quasi-posterior , defined by

We now introduce the notation to our online clustering setting. Let for some integer . We denote by a discrete probability distribution on the set . For any , let denote a probability distribution on . For any vector of cluster centers , we define as

(4) |

Note that (4) may be seen as a distribution over the set of partitions of : any corresponds to a partition of with at most cells. In the sequel, we denote by a partition of and by a prior over this set. Let be some (inverse temperature) parameter. At each time , we observe and a random partition is sampled from the Gibbs quasi-posterior

(5) |

This quasi-posterior distribution will allow us to sample partitions with respect to the prior defined in (4) and bent to fit past observations through the following cumulative loss

where the latter term is introduced in order to cope with the non-convexity of the loss and is essential to make the *online variance inequality* hold true (as discussed in Aud2009, Section 4.2). consists in the cumulative loss of in the first

rounds and a term that controls the variance of the next prediction. Note that since

is deterministic, no likelihood is attached to our approach, hence the terms "quasi-posterior" for and "quasi-Bayesian" for our global approach. The resulting estimate is a realization of with a random number of cells. This scheme is described in Algorithm 1. Note that this algorithm is an instantiation of Audibert’s online SeqRand algorithm (Aud2009, Section 4) to the special case of the loss defined in (2). However SeqRand does not account for adaptive rates , as discussed in the next section.## 3 Minimax regret bounds

Let stands for the expectation with respect to the distribution of (abbreviated as where no confusion is possible). We start with the following pivotal result.

###### Proposition 1

For any sequence , for any prior distribution and any , the procedure described in Algorithm 1 satisfies

Proposition 1 is a straightforward consequence of Aud2009 applied to the loss function defined in (2), the partitions , and any prior .

### 3.1 Preliminary regret bounds

In the following, we instantiate the regret bound introduced in Proposition 1. Distribution in (4) is chosen as the following discrete distribution on the set

(6) |

When , the larger the number of cells , the smaller the probability mass. Further, in (4) is chosen as a product of

independent uniform distributions on

-balls in :(7) |

where , is the Gamma function and

(8) |

is an -ball in , centered in with radius . Finally, for any and any , let

###### Corollary 1

For any sequence and any , consider defined by (4), (6) and (7) with and . If , the procedure described in Algorithm 1 satisfies

Note that is a non-increasing function of the number of cells while the penalty is linearly increasing with . Small values for (or equivalently, large values for ) lead to small values for . The additional term induced by the complexity of is . The calibration yields a sublinear remainder term in the following corollary.

###### Corollary 2

Under the previous notation with and , the procedure described in Algorithm 1 satisfies

(9) |

Let us assume that the sequence is generated from a distribution with clusters. We then define the expected cumulative loss (ECL) and oracle cumulative loss (OCL) as

ECL | |||

OCL |

Then Corollary 2 yields

(10) |

where is a constant depending on , and . In (10) the regret of our randomized procedure, defined as the difference between ECL and OCL is sublinear in . However, whenever , the term emerges and the bound in Corollary 2 is deteriorated.

Finally, note that the dependency in in the right-hand side of (9) may be improved by choosing and assuming . This allows to achieve the optimal dependency in instead of in (9) and (10), *i.e.*,

However the assumption may appear unrealistic in the online clustering setting, as may grow with at a faster rate than . The dependency in in (10) is the price to pay for a general framework.

### 3.2 Adaptive regret bounds

The time horizon is usually unknown, prompting us to choose a time-dependent inverse temperature parameter . We thus propose a generalization of Algorithm 1, described in Algorithm 2.

This adaptive algorithm is supported by the following more involved regret bound.

###### Theorem 1

For any sequence , any prior distribution on , if is a non-increasing sequence of positive numbers, then the procedure described in Algorithm 2 satisfies

If is chosen in Proposition 1 as , the only difference between Proposition 1 and Theorem 1 lies on the last term of the regret bound. This term will be larger in the adaptive setting than in the simpler non-adaptive setting since is non-increasing. In other words, here is the price to pay for the adaptivity of our algorithm. However, a suitable choice of allows, again, for a refined result.

###### Corollary 3

For any deterministic sequence , if and in (4) are taken respectively as in (6) and (7) with and , if for any and , then the procedure described in Algorithm 2 satisfies

Therefore, the price to pay for not knowing the time horizon (which is a much more realistic assumption for online learning) is a multiplicative factor in front of the term . This does not degrade the rate of convergence .

### 3.3 Minimax regret

This section is devoted to the study of the minimax optimality of our approach. The regret bound in Corollary 3 has a rate , which is not a surprising result. Indeed, many online learning problems give rise to similar bounds depending also on the properties of the loss function. However, in the online clustering setting, it is legitimate to wonder wether the upper bound is tight, and more generally if there exists other algorithms which provide smaller regrets. The sequel answers both questions in a minimax sense.

Let us first denote by the number of cells for a partition . We also introduce the following assumption.

Assumption : Let and . There exists an index such that , where

and .

Note that several partitions may achieve this infimum. In that case, we adopt the convention that is any such partition with the smallest number of cells. Assumption means that could be well summarized by cells since the infimum is reached for the partition . We introduce the set

For Algorithm 2, we have from Corollary 3 that

Then for any , , our goal is to obtain a lower bound of the form

The first infimum is taken over all distributions whose support is , where is defined in (8). Next, we obtain

(11) |

where , are i.i.d with distribution defined on and

stands for the joint distribution of

. Unfortunately, in (11), since the infimum is taken over any distribution , there is no restriction on the number of cells of each partition . Then, the left hand side of (11) could be arbitrarily small or even negative and the lower bound does not match the upper bound of Corollary 3. To handle this, we need to introduce a penalized term which accounts for the number of cells of each partition to the loss function . The upcoming theorem provides minimax results for an augmented value defined as(12) |

In (12), we add a term which penalizes the number of cells of each partition. To capture the asymptotic behavior of , we derive an upper bound for the penalized loss in (12). This is done in the following theorem which combines an upper and lower bound for the regret, hence proving that it is minimax optimal.

###### Theorem 2

Let , such that

(13) |

where represents the largest integer that is smaller than . If satisfies , then

(14) |

The lower bound on is asymptotically of order . Note that BLL1998 obtained the less satisfying rate , however holding with no restriction on the number of cells retained in the partition whereas our claim has to comply with (13). This is the price to pay for our additional factor. Note however that this price is mild as is no longer upper bounded whenever or grow to , casting our procedure onto the online setting where the time horizon is not assumed finite and the number of clusters is evolving along time.

As a conclusion to the theoretical part of the manuscript, let us summarize our results. Regret bounds for Algorithm 1 are produced for our specific choice of prior (Corollary 1) and with an involved choice of (Corollary 2). For the adaptive version Algorithm 2, the pivotal result is Theorem 1, which is instantiated for our prior in Corollary 3. Finally, the lower bound is stated in Theorem 2, proving that our regret bounds are minimax whenever the number of cells retained in the partition satisfies (13). We now move to the implementation of our approach.

## 4 The Pacbo algorithm

Since direct sampling from the Gibbs quasi-posterior is usually not possible, we focus on a stochastic approximation in this section, called PACBO (available in the companion eponym R package from R-PACBO). Both implementation and convergence (towards the Gibbs quasi-posterior) of this scheme are discussed. This section also includes a short numerical experiment on synthetic data to illustrate the potential of PACBO compared to other popular clustering methods.

### 4.1 Structure and links with RJMCMC

In Algorithm 1 and Algorithm 2, it is required to sample at each from the Gibbs quasi-posterior . Since is defined on the massive and complex-structured space (let us recall that is a union of heterogeneous spaces), direct sampling from is not an option and is much rather an algorithmic challenge. Our approach consists in approximating through MCMC under the constraint of favouring local moves of the Markov chain. To do it, we will use resort to Reversible Jump MCMC (Gre1995), adapted with ideas from the Subspace Carlin and Chib algorithm proposed by DFN2002 and PD2012. Since sampling from is similar for any , the time index is now omitted for the sake of brevity.

Let , be the states of the Markov Chain of interest of length , where and . At each RJMCMC iteration, only local moves are possible from the current state to a proposal state , in the sense that the proposal state should only differ from the current state by at most one covariate. Hence, and may be in different spaces (). Two auxiliary vectors and with are needed to compensate for this dimensional difference, *i.e.*, satisfying the dimension matching condition introduced by Gre1995

such that the pairs and are of analogous dimension. This condition is a preliminary to the detailed balance condition that ensures that the Gibbs quasi-posterior is the invariant distribution of the Markov chain. The structure of PACBO is presented in Figure 1.

Let denote the multivariate Student distribution on

(15) |

where denotes a normalizing constant. Let us now detail the proposal mechanism. First, a local move from to is proposed by choosing with probability . Next, choosing , , we sample from in (15). Finally, the pair is obtained by

where is a one-to-one, first order derivative mapping. The resulting RJMCMC acceptance probability is

since the determinant of the Jacobian matrix of is . The resulting PACBO algorithm is described in Algorithm 3.

### 4.2 Convergence of Pacbo towards the Gibbs quasi-posterior

We prove that Algorithm 3 builds a Markov chain whose invariant distribution is precisely the Gibbs quasi-posterior as goes to . To do so, we need to prove that the chain is -irreducible, aperiodic and Harris recurrent, see RG2004 and RR2006.

Recall that at each RJMCMC iteration in Algorithm 3, the chain is said to propose a "between model move" if and a "within model move" if and . The following result gives a sufficient condition for the chain to be Harris recurrent.

###### Lemma 1

Let be the event that no "within-model move" is ever accepted and be the support of . Then the chain generated by Algorithm 3 satisfies

for any and .

Lemma 1 states that the chain must eventually accept a "within-model move". It remains true for other choices of in Algorithm 3, provided that the stationarity of is preserved.

###### Theorem 3

Let denote the support of . Then for any , the chain generated by Algorithm 3 is -irreducible, aperiodic and Harris recurrent.

Theorem 3 legitimates our approximation PACBO to perform online clustering, since it asymptotically mimics the behavior of the computationally unavailable . To the best of our knowledge, this kind of guarantee is original in the PAC-Bayesian literature.

Finally, let us stress that obtaining an explicit rate of convergence is beyond the scope of the present work. However, in most cases the chain converges rather quickly in practice, as illustrated by Figure 2. At time , we advocate for setting as from round , as a warm start.

### 4.3 Numerical study

This section is devoted to the illustration of the potential of our quasi-Bayesian approach on synthetic data. Let us stress that all experiments are reproducible, thanks to the PACBO R package (R-PACBO). We do not claim to be exhaustive here but rather show the (good) behavior of our implementation on a toy example.

#### 4.3.1 Calibration of parameters and mixing properties

We set to be the maximum -norm of the observations. Note that a too small value will yield acceptance ratios to be close to zero and will degrade the mixing of the chain. As advised by the theory, we advise to set . Recall that large values will enforce the quasi-posterior to account more for past data, whereas small values make the quasi-posterior alike the prior. We illustrate in Figure 2 the mixing behavior of PACBO. The convergence occurs quickly, and the default length of the RJMCMC runs is set to 500 in the PACBO package: this was a ceiling value in all our simulations.