# Quasi-Bayes properties of a recursive procedure for mixtures

Bayesian methods are attractive and often optimal, yet nowadays pressure for fast computations, especially with streaming data and online learning, brings renewed interest in faster, although possibly sub-optimal, solutions. To what extent these algorithms may approximate a Bayesian solution is a problem of interest, not always solved. On this background, in this paper we revisit a sequential procedure proposed by Smith and Makov (1978) for unsupervised learning and classification in finite mixtures, and developed by M. Newton and Zhang (1999), for nonparametric mixtures. Newton's algorithm is simple and fast, and theoretically intriguing. Although originally proposed as an approximation of the Bayesian solution, its quasi-Bayes properties remain unclear. We propose a novel methodological approach. We regard the algorithm as a probabilistic learning rule, that implicitly defines an underlying probabilistic model; and we find this model. We can then prove that it is, asymptotically, a Bayesian, exchangeable mixture model. Moreover, while the algorithm only offers a point estimate, our approach allows us to obtain an asymptotic posterior distribution and asymptotic credible intervals for the mixing distribution. Our results also provide practical hints for tuning the algorithm and obtaining desirable properties, as we illustrate in a simulation study. Beyond mixture models, our study suggests a theoretical framework that may be of interest for recursive quasi-Bayes methods in other settings.

## Authors

• 6 publications
• 4 publications
02/12/2018

### Identifiability of Nonparametric Mixture Models and Bayes Optimal Clustering

Motivated by problems in data clustering, we establish general condition...
07/17/2020

### Optimal Bayesian estimation of Gaussian mixtures with growing number of components

We study posterior concentration properties of Bayesian procedures for e...
11/02/2021

### Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees

We formulate natural gradient variational inference (VI), expectation pr...
05/09/2017

### Frequentist Consistency of Variational Bayes

A key challenge for modern Bayesian statistics is how to perform scalabl...
12/10/2012

### MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The classical mixture of Gaussians model is related to K-means via small...
03/16/2021

### Optimal stratification of survival data via Bayesian nonparametric mixtures

The stratified proportional hazards model represents a simple solution t...
11/10/2020

### A Variational Infinite Mixture for Probabilistic Inverse Dynamics Learning

Probabilistic regression techniques in control and robotics applications...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian methods have always been attractive, for their internal coherence, their rigorous way of quantifying uncertainty through probability, their optimal properties in many problems. Analytic difficulties have been overcome by efficient computational methods and Bayesian procedures are nowadays widely and successfully used in many fields. However, fast computations remain a challenge, that hampers an even wider application of Bayesian methods among practitioners; the more so with streaming data and online learning, where inference and prediction have to be continuously updated as new data become available. In the modern trade off between information and computational efficiency, slightly misspecified but computationally more tractable methods receive renewed interest, as a reasonable compromise. Popular algorithms, such as Approximate Bayesian Computation (ABC) and variational Bayes, arise as approximations of an optimal Bayesian solution. Indeed, one could expect that a method which performs well is at least

approximately Bayes. For a Bayesian statistician, the capacity of a learning scheme to be, at least approximately, a Bayesian learning scheme should be a minimal requirement for its validation.

This is the underlying theme of this paper, that we study and illustrate in a specific, important, case, namely sequential learning in mixture models. We revisit a recursive algorithm initially proposed by Smith and Makov (1978) for unsupervised sequential learning and classification in finite mixtures, and extended by M. Newton and collaborators (Newton et al. (1998), Newton and Zhang (1999), Quintana and Newton (2000), Newton (2002)), to provide a recursive fast approximation of the computationally expensive Bayesian estimate of the mixing density in nonparametric mixture models. Recent interesting developments are in Hahn et al. (2018). Convergence results have validated these algorithms as consistent frequentist estimators (Newton and Zhang (1999), Martin and Ghosh (2008), Ghosh and Tokdar (2006), Tokdar et al. (2009)). However, to what extend they provide an approximation of a Bayesian procedure, as for the original motivation, is not fully understood. In this paper we address this question, shedding light on their quasi-Bayes properties. Beyond mixture models, we believe that our results may offer a methodological viewpoint of interest for quasi-Bayes recursive computations in other settings.

Let us start by reminding the motivating problem in Smith and Makov (1978)

, sequential unsupervised learning and classification by mixtures. One wants to recursively classify observations

in one of populations (e.g., pattern types, or signal sources, etc.), with no feedback about correctness of previous classifications. A finite mixture model for this task assumes

 Xn∣πi.i.d∼k∑j=1πjfj(x). (1)

Here the mixture components are known (extensive studies may be available on the specific components), but the mixing proportions are unknown. The classical Bayesian solution assigns a Dirichlet prior distribution on the unknown proportions, , and proceeds by Bayes rule. Learning is solved through the posterior distribution and classification through the predictive probabilities that , given . Unfortunately, computations, especially sequential computations, are involved.

The above model is a special case, with a discrete mixing distribution having atoms and unknown masses , of a general mixture model

 Xi∣~Gi.i.d∼f~G(x)=∫f(x∣θ)d~G(θ). (2)

Here, a problem of interest is recursive estimation of the mixing distribution . The mixture model (2) can be equivalently expressed in terms of a latent exchangeable sequence ), such that the are conditionally independent, given , with , and the are a random sample from , that is, . Then, the Bayesian estimate of with respect to quadratic loss, , coincides with the predictive distribution of given the data. Thus, the problem can be rephrased as recursive prediction.

In a Bayesian nonparametric approach, a prior with large support is assigned on the random mixing distribution ; typically, a Dirichlet process (DP), with parameters and , . DP mixtures have expanded in a floury of applications in many areas, and although many extensions have been developed, they remain a basic reference in Bayesian nonparametrics. Computational difficulties can be addressed by MCMC methods, or by variational Bayes or ABC approximations. If the observations arrive sequentially, one may resort to sequential Monte Carlo methods, sequential importance sampling (MacEachern et al. (1999)), or more recent sequential variational Bayes methods (Lin (2013), Broderick et al. (2013)), or combination of them (Christian A. Naesseth et al. (2018)

). Still, these methods have a computational cost (for example, in the optimization steps), or only have a heuristic derivation; simple and fast recursive algorithms remain attractive.

Motivated by computations in DP mixture models, M. Newton and collaborators (Newton and Zhang (1999); Newton et al. (1998), Quintana and Newton (2000), Newton (2002)) proposed a simple recursive estimate for , that starts at an initial guess and for recursively computes the estimate

 Gn(A)=(1−αn)Gn−1(A)+αn∫Af(xn∣θn)dGn−1(θn)∫Θf(xn∣θn)dGn−1(θn), (3)

where is a sequence of real numbers in and it is usually assumed that as , with and . A standard choice, in analogy with DP mixtures, is with . For finite mixtures, as in (1), the rule (3) corresponds to the sequential procedure of Smith and Makov (1978).

Newton et al. (1998) propose the recursive rule (3

) in applications to interval censored data and mixtures of Markov chains, further developed and extended, together with theoretical properties, by

Newton and Zhang (1999). Theoretical properties have been studied from a frequentist viewpoint: that is, regarding as an estimator of the mixing distribution, under the assumption that the data are independent and identically distributed (i.i.d.) according to a true (identifiable) mixture model . Smith and Makov (1978) prove frequentist consistency of their recursive estimator, for finite mixtures, using stochastic approximation techniques. Martin and Ghosh (2008) shed light on the connection with stochastic approximation, thus relating frequentist consistency of the algorithm to the convergence properties of stochastic approximation sequences. They prove consistency of for a discrete with known atoms, and extend to the case of mixture kernels with an unknown common parameter. Ghosh and Tokdar (2006) and Tokdar et al. (2009) prove frequentist weak consistency of the estimator (3), under conditions on the mixture kernels, and give results on convergence in probability for a permutation-invariant version of it.

Again, these results regard Newton’s algorithm (3) as a frequentist estimator. Its theoretical properties as an approximation of the computationally expensive Bayesian solution remain quite unsolved. One could argue that, when consistent, Newtons’ estimator will asymptotically agree, almost surely with respect to the law of i.i.d. observations from a true (-a.s.), with a consistent Bayesian estimator for . But, also, with any other consistent estimator for ; which flattens their different nature. These results are important, and Newton’s recursive estimator has the advantage of being computationally faster than other consistent estimators for the mixing distribution; but its Bayesian motivation is lost.

We take another, we believe more properly Bayesian, approach. Our starting point is that, when using (3), a researcher is changing the learning rule on ; therefore, implicitly, using a probabilistic model that is different from the Bayesian exchangeable model (2). What is this model? Is it quasi-Bayes? This remark potentially applies to other approximation algorithms which, as well, more or less implicitly, use a probabilistic model different from the stated Bayesian one.

To address these questions, we need to formalize a notion of a quasi-Bayes procedure. The term quasi-Bayes is used under many meanings in the literature (see e.g. Li et al. (2018)). We mutuate this term from Smith and Makov (1978), and formalize its meaning as follows. Details are given in Section 3. Mixture models specify an exchangeable probability law for the infinite sequence . A quasi-Bayes learning rule for mixtures should preserve this invariance property, at least asymptotically. This holds more generally, beyond mixture models. A first requirement we ask for a probabilistic learning rule to be a quasi-Bayes approximation of a Bayesian exchangeable learning rule is asymptotic exchangeability. Informally, the observations’ labels do not matter if we look at far enough pieces of the sequence . Yet, asymptotic exchangeability is only a minimal property. A refinement refers to a specific exchangeable law. Let be an exchangeable probability law for . We say that is a quasi-Bayes approximation of if it is asymptotically exchangeable, and the exchangeable limit sequence has probability law .

On this basis, we address the following questions:

1. If a researcher uses the recursive procedure (3) as a probabilistic learning rule, that is, she uses (3) as the predictive distribution for the latent parameter , given , what probabilistic model is she implicitly assuming for the observable ? Is it an approximation, at least asymptotically, of a Bayesian, exchangeable, mixture model?

We prove that the probabilistic model underlying the recursive rule (3) is indeed asymptotically exchangeable. More precisely, it implies that the sequence is conditionally identically distributed (c.i.d.) (Kallenberg (1988), Berti et al. (2004)); roughly speaking, for any , future observations , are identically distributed, given . For stationary sequences, the c.i.d. property is equivalent to exchangeability; in general, a c.i.d. sequence is not exchangeable, but it is such asymptotically. This first result says that a researcher using (3) as the predictive rule is implicitly assuming some form of non stationarity in the data, that tends to vanish in the long run. A c.i.d. model could actually be the appropriate model in situations where exchangeability is broken by competition, selection or other forms of non stationarity, but the system converges to a stationary, exchangeable steady state. If, instead, it is used as an approximation of a honest exchangeable model, it guarantees the minimal property of being asymptotically exchangeable: converge in distribution to an exchangeable sequence, say . See Section 3.

Asymptotic exchangeability alone does not explicitly provide the statistical model of the limit sequence ; which is also, by the above result, the asymptotic statistical model for the c.i.d. sequence . We thus refine the result by finding such model. Namely, we prove that there exist a random distribution , which is the almost sure weak limit of the sequence , such that the are conditionally i.i.d. given , according to a distribution with density of the form (2). Therefore, so are the , asymptotically; roughly speaking, for large, where means approximately i.i.d. In this sense, Newton’s recursive learning rule arises from a quasi-Bayes mixture model.

These results shed light on an open question by Martin and Ghosh (2008). Although Newton’s recursive rule is motivated by computations in DP mixture models, they show two examples where the Bayesian estimate of with a DP prior and the recursive estimate have different performance. Thus, they pose an open question: If Newton’s recursive algorithm is not an approximation of the DP prior Bayes estimate, for what prior does the recursive estimate approximate the corresponding Bayes estimate? Our results explain the reason why the random distribution may have a probability law far apart from a DP. The recursive rule (3) underlies a sequence which is not exchangeable; thus, there is no random mixing distribution such that (one might rather think, in a state-space fashion, of a sequence of random distributions , such that ; see Section 3). However, we show that such exists asymptotically, and for large. The probability law of (“the prior”) is implicitly determined by the c.i.d. sequence , through its so-called directing random measure ; and, if the mixture is identifiable, it is unique. (Notice that we are denoting with the same symbol a distribution function (d.f.) and the corresponding measure). Results on the explicit law of the directing random measure of c.i.d. sequences are known only in limited, simple cases. We do obtain that the directing random measure for is a mixture of the form , but we do not have explicit results on the distribution of . However, we can prove that, under fairly mild conditions, the random distribution is absolutely continuous, a.s.; thus, its law is rather a smoothing of a Dirichlet process, the latter a.s. selecting discrete distributions. In fact, Newton’s algorithm was originally given in terms of densities, assuming that the unknown mixing distribution is absolutely continuous, with density . Notice that, if is absolutely continuous with respect to a sigma-finite measure on (denoted ), then also , a.s., for every , and its density satisfies the recursive equations

 gn(θ) =(1−αn)gn−1(θ)+αngn−1(θ)f(xn∣θ)∫f(xn∣θ′)gn−1(θ′)dλ(θ′). (4)

We have given Newton’s rule (3) for the distributions, as they are better defined objects when we read the algorithm as a probabilistic learning rule.

2. The second main question we address is as follows. As an algorithm, Newton’s recursive rule (3) only gives a point estimate of the mixing distribution . Can one provide a more complete description of the uncertainty? A Bayesian approach would fully describe the uncertainty through the posterior distribution. Can one enrich Newton’s algorithm, by providing a posterior distribution on ?

Our key to address this question is, again, to regard Newton’s rule (3) as a probabilistic learning rule. We can then formally prove that , thus, it is, indeed, the point estimate of under quadratic loss; moreover, although the prior distribution of is only implicitly defined, we can approximate the posterior distribution of , that results from such implicit prior. More specifically, we provide convergence rates and an asymptotic Gaussian approximation of the finite dimensional conditional distribution , for any fixed measurable sets , . Thus, not only one has a quasi-Bayes point estimate, but may provide asymptotic credible regions.

These results illuminate on the probabilistic model implied by Newton et al.’s recursive rule. Interestingly, this understanding also gives new insights on the role of the weights and of other main parameters of the model. As we illustrate in Section 5, it provides practical advise about tuning the algorithm for obtaining desirable properties; notably, for attenuating the effect of the lack of exchangeability on the estimates, a problem usually addressed, with higher computational cost, by taking averages over permutations of the sample.

Section 2 recalls the basic structure of DP mixture models, and introduces our viewpoint on Newton et al.’s recursive algorithm as a probabilistic learning rule. We prove its quasi-Bayes properties in Section 3. Section 4

provides rates and asymptotic results, together with asymptotic credible intervals. A simulation study for location mixtures of Gaussian distributions, in Section

5, illustrates how the undesirable sensitivity of the estimates to the ordering of the observation can be attenuated, by tuning the weights and the main parameters of the model. All the proofs are collected in Section 6. Directions for developments are finally discussed.

## 2 Dirichlet process mixtures and a new look at Newton’s sequential procedure

We first recall the basic structure of Bayesian inference for DP mixture models, in order to motivate in more details the recursive rule (

3), and to introduce some further notation.

Again, the DP mixture model has a hierarchical formulation in terms of a latent exchangeable sequence

 Xi∣θiindep∼f(x∣θi) (5) θi∣~Gi.i.d∼~G, ~G∼DP(αG0),

where (5) is a short notation for , for every , and is a density with respect to a sigma-finite measure on the sample space . Integrating the out, one has the mixture model (2), with a DP prior on . We denote by the probability law on the process so defined. We assume that the and the

are, respectively, random variables with values in

and , equipped by their Borel sigma-fields and ; but the results hold more generally, for Polish spaces. Throughout the paper, we refer to conditional distributions as regular versions. We use the short notation , and for . Unless explicitly stated, convergence of distributions is in the topology of weak convergence. Weak convergence of to is denoted by .

Inference on in a DP mixture model moves from the conditional distribution

 ~G∣θ1:n,x1:n∼DP(αG0+n∑i=1δθi),

to get the posterior distribution as a mixture of DPs

 ~G∣x1:n∼∫DP(αG0+n∑i=1δθi)d~P(θ1:n∣x1:n). (6)

The Bayesian point estimate of , with respect to quadratic loss, is the conditional expectation of , and coincides with the predictive distribution of , given . By the Pólya urn structure characterizing the Dirichlet process

 ~P(θn+1∈⋅∣θ1:n,x1:n)=αG0(⋅)+∑ni=1δθi(⋅)α+n, (7)

therefore

 ~G(Bayes)n(⋅) = E(~G(⋅)∣x1:n)=~P(θn+1∈⋅∣x1:n) = αG0(⋅)+∑ni=1~P(θi∈⋅∣x1:n)α+n = α+n−1α+nαG0(⋅)+∑n−1i=1~P(θi∈⋅∣x1:n)α+n−1+1α+n~P~GBayesn−1(θn∈⋅∣xn),

where

 ~P~G(Bayes)n−1(θn∈A∣xn)=∫Af(xn∣θ)d~G(Bayes)n−1(θ)∫Θf(xn∣θ)d~G(Bayes)n−1(θ). (9)

In the Bayesian estimate, as a new observation becomes available, the information on all the past , is updated. This efficiently exploits the sample information, but is computationally expensive. Instead, Newton’s algorithm (3) does not update the estimate , and only enters in inference on , in an empirical Bayes flavor. The two estimates coincide only for . However, even for , Newton’s rule makes a simplification of the posterior distribution of , replacing the mixture of Dirichlet processes , as from (6), with a DP (. For , Newton’s estimate loses in efficiency, not fully exploiting the sample information, but, on the other hand, is very fast; if one evaluates (3) on a grid of points and calculates the integral that appears at the denominator using, say, a trapezoid rule, then the computational complexity is .

### 2.1 Newton’s algorithm as a probabilistic learning rule

As underlined, in DP mixture models, the point estimate corresponds to the predictive distribution of . Our point is that, similarly, Newton’s recursive rule should be regarded as a probabilistic learning rule for model (5), that expresses a different, computationally simpler, predictive distribution for the :

 Xn∣θnindep∼f(x∣θn)θn+1∣x1:n∼Gn(⋅),n≥1, (10)

with and given by (3).

A known result in probability theory is that the predictive rule for a sequence of random variables characterizes the probability law of the sequence (Ionescu-Tulcea Theorem). On this basis, a researcher using the predictive rule (

3) is implicitly using a different probabilistic model for the sequence , in place of the exchangeable mixture model (2). Our aim is to make this model explicit. Such model may be of autonomous interest in some experimental circumstances. When we regard Newton’s sequential procedure as a probabilistic learning rule, it provides the formal framework for understanding its quasi-Bayes properties.

Let us denote by a probability law on that is consistent with assumptions (10). Newton’s recursive formulae can now be read as probabilistic implications, under the law . The estimate can be written in a prediction-error correction form

 Gn(⋅)=Gn−1(⋅)+αn[P(θn∈⋅∣x1:n)−P(θn∈⋅∣x1:n−1)],

where the correction term can now be interpreted as a difference between predictive distributions, computed according to . Moreover, we can appreciate the different information conveyed in the predictive rule, with respect to DP mixtures. Simple computations show that one can write as

 Gn(⋅)=αG0(⋅)+∑nk=1γkP(θk∈⋅∣x1:k)α+∑nk=1γk,n≥1, (11)

where ; , and for . For , one has for all . In this case, one has a direct comparison with the corresponding formula (2) for DP mixtures. The latter originates from the Pólya urn structure characterizing the Dirichlet process. This suggests that Newton’s recursions are based on a different urn scheme, possibly urns of distributions; see Quintana and Newton (2000).

A further, immediate implication of the assumptions (10) is that, under the law , the , and consequently the , are no longer exchangeable. In fact, we show that Newton’s model (10) replaces exchangeability with a weaker form of dependence; namely, with being a conditionally identically distributed sequence. This is an interesting stochastic dependence structure, noticed by Kallenberg (1988), and developed by Berti et al. (2004). It implies that the are asymptotically exchangeable. Before proceeding, let us remind some basic definitions and properties of c.i.d. sequences.

### 2.2 Conditionally identically distributed sequences

Kallenberg (1988) (Proposition 2.1) proves that a stationary sequences that satisfies

 (X1,…,Xn,Xn+2)d=(X1,…,Xn,Xn+1),n≥1 (12)

is exchangeable. Clearly, the converse is true, thus condition (12) is equivalent to exchangeability for stationary sequences. Notice that (12) implies that , for any and ; informally, for any ,

 Xn+k∣X1:nd=Xn+1∣X1:n,for any k≥1,

where means equal in distribution. Berti et al. (2004) extend this notion and introduce the term conditionally identically distributed sequences with respect to a filtration.

###### Definition 2.1

Let be a filtration. A sequence of random variables is conditionally identically distributed with respect to the filtration (-c.i.d.) if it is adapted to and, a.s.,

 E[h(Xn+k)∣Fn]=E[h(Xn+1)∣Fn],

for all , and all bounded measurable functions

Less formally, a sequence adapted to is -c.i.d. if the are identically distributed and . When is the natural filtration of , the sequence is said to be c.i.d. An -c.i.d. sequence is also c.i.d. C.i.d. sequences preserve main properties of exchangeable sequences. In particular, as for exchangeable sequences, the sequence of the empirical distributions and the sequence of the predictive distributions converge a.s., to the same random distribution, say. If is c.i.d. with probability law , then

 ^Fn≡∑ni=1δXin⇒F% andPn≡P(Xn+1∈⋅∣X1:n)⇒F,P-a.% s.. (13)

For exchangeable sequences, the limit is called the directing random measure (the statistical model, in Bayesian inference) and the probability law of is the de Finetti measure (the prior distribution). The limit in (13) is referred as the directing random measure for c.i.d. sequences, too.

An exchangeable sequence is clearly c.i.d., but the reverse is not generally true. However, c.i.d. sequences are asymptotically exchangeable.

###### Definition 2.2

A sequence of random variables is asymptotically exchangeable, with directing measure , if

 (Xn+1,Xn+2,…)d→(Z1,Z2,…)

for an exchangeable sequence , with directing random measure .

For a sequence , convergence of the predictive distributions to a random probability measure, say, implies that the sequence is asymtotically exchangeable, with directing random measure (Aldous (1985) Lemma 8.2). Thus, by (13), a c.i.d. sequence is asymptotically exchangeable, with directing random measure . Informally, , for large .

Fundamental results and limit theorems for -c.i.d. sequences are given by Berti et al. (2004). Applications in Bayesian nonparametric inference include Bassetti et al. (2010) and the c.i.d. hierarchical model proposed by Airoldi et al. (2014).

## 3 Quasi-Bayes properties of Newton’s model

Newton’s model (10) does not fully specify the probability law of the process , as it only assigns the predictive distribution of conditionally on the observable , while not enough restrictions are made on the conditional distribution of given and . Still, model (10) has interesting implications, that we study in this section. Clearly, an obvious way to obtain a full specification is to additionally assume that is conditionally independent on , given . This stronger assumption might be motivated by the non-stationary nature of the sequence in (10), and would considerably simplify the analysis; yet, it is not necessary, and our results are developed under the only assumptions (10).

The first remark is that Newton’s model (10) implies that the are no longer exchangeable. We will show that is c.i.d. In some applications, exchangeability may actually be broken by forms of non-stationarity, and a c.i.d. model may offer a sensible description of the phenomena.

Example. As a simple example, suppose that data are observed over time, after an intervention that affects the population under investigation, but whose effect is ‘unpredictable’ and tends to vanish. One still assumes that , but a disequilibrium is introduced, so that the sequence is no longer exchangeable. One may rather envisage, in a state-space fashion, a sequence of random distributions, say, such that . Unpredictability of the dynamics may be expressed by further assuming that, for any , the conditional law of the new random distribution , given , only depends on and is Dirichlet process, centered on the current estimate

 ~Gn∣X1:n−1,θ1:n−1∼DP((α+n−1)Gn−1). (14)

Then, as becomes available, Newton’s one-step-ahead update (3) is exact, that is, it is the Bayesian estimate of from the DP prior (14)

 P(θn+1∈⋅∣θ1:n,x1:n)=E(P(θn+1∈⋅∣~Gn)∣θ1:n,x1:n)=E(~Gn∣x1:n)=Gn.

As we will show, the sequence converges a.s. to a random distribution . By properties of the DP, this implies that the conditional law (14) of converges a.s. to a measure degenerate on . This fact expresses the idea that the disequilibrium tends to vanish. Intuitively, for large. We will indeed prove that the are asymptotically exchangeable.

In more standard setting, stationarity holds; the researcher would regard the as exchangeable, judging that the order of the observations does not matter. The lack of exchangeability implied by assumptions (10) is thus a misspecification, only motivated by the need of fast computations. Then a minimal requirement is that is at least asymptotically exchangeable; informally, the order does not matter if we look at , for large enough. Moreover, one would want a mixture model of the form , at least asymptotically. To prove the latter property, we start by showing that such exists, and is the a.s. limit of the sequence of the predictive distributions . Furthermore, as for exchangeable sequences, is the expectation, given , of such limit distribution . The proof of the theorem below, as well as the proofs of all the following results, are collected in Section 6.

###### Theorem 3.1

Let the process have probability law that satisfies assumptions (10). Then

• the sequence converges to a random probability measure , a.s.;

• for every and measurable set , , for all .

An immediate consequence of the weak convergence of to is that a.s. for any continuous and bounded function on . We prove that the convergence can be extended to functions that are integrable with respect to .

###### Proposition 3.1

Let satisfy the assumptions (10), and let be a measurable function on , such that a.s. Then, for ,

 ∫h(θ)dGn(θ)→∫h(θ)dG(θ),P% -a.s.

The condition a.s. holds, in particular, if is measurable and .

The following theorem proves that Newton’s learning rule (10) implies that the sequence is c.i.d., thus asymptotically exchangeable, and that its directing random measure has a mixture density of the form . In this sense, Newton’s model is a quasi-Bayes mixture model.

###### Theorem 3.2

Let satisfy the assumptions (10). Then

• The sequence () is c.i.d.;

• The sequence of predictive densities converges in to , -a.s., where is the a.s. weak limit of ;

• is asymptotically exchangeable, and its directing random measure has density with respect to .

Informally, the above results say that , for large. Notice that plays the role of the (infinite-dimensional) parameter of the directing random measure of , and as such, it is a function of . If the mixture is identifiable, uniquely determines . Moreover, by properties of c.i.d. sequences, is also the a.s. weak limit of the sequence of empirical distributions . Therefore, it is measurable with respect to the tail sigma-field of .

Intuitively, asymptotic exchangeability of the sequence implies that is also asymptotically exchangeable. The following theorem provides a formal proof. In fact, if we assume the additional condition that is independent on , given , then is easily proved to be c.i.d., thus asymptotically exchangeable. Again, the theorem below does not use this assumption.

###### Theorem 3.3

If the mixture is identifiable, then Newton’s learning scheme (10) implies that the sequence is asymptotically exchangeable, with directing random measure corresponding to the a.s. limit of the sequence .

### 3.1 On the prior distribution of G

In its original derivation, Newton’s algorithm, although moving from DP mixtures, is proposed as a recursive estimate of the mixing density, implicitly assuming that the unknown mixing distribution is absolutely continuous, with density . Regarding Newton’s algorithm as a probabilistic learning rule, we have shown that it implies a quasi-Bayes mixture model, where, asymptotically, the are a random sample from a mixture . Yet, compared with DP mixtures, Newton’s model changes the prior on , which is no longer (in general) a DP. Giving explicit results on the law of the limit is difficult. In the literature on c.i.d. processes, there are very few results, for very simple cases. Even for exchangeable Bayesian nonparametric methods, giving explicit results on the prior implied by predictive constructions often is a non trivial problem. Although we cannot explicit the prior distribution on implied by Newton’s model (10), we can prove that, under fairly mild sufficient conditions, is absolutely continuous, a.s. Moreover, although the prior is only implicitly defined, in the next section we give an asymptotic Gaussian approximation of the (finite-dimensional) posterior distribution that results from such prior.

As noticed, if is absolutely continuous with respect to a sigma-finite measure on , , then , and its density satisfies Newton’s recursive rule (4). It is immediate to verify that, for any fixed , the sequence is a martingale. Since is non-negative, there exists a function such that, for every , converges to a.s. However, this fact is not sufficient to conclude that . The latter property requires that converges in , or, equivalently, converges to in total variation. This fact is made precise in the following Lemma 3.1. Then, Lemma 3.2 gives sufficient conditions for in total variation, and , a.s. The two lemmas are extensions of Theorems 1 and 4 in Berti et al. (2013), the main difference being that we do not assume that the involved sequence of random variables is c.i.d. To be -c.i.d., a sequence has to be adapted to the filtration , and this condition is not satisfied, in general, by the sequence in Newton’s model. Still, one can fairly easily extend the proofs of Theorems 1 and 4 in Berti et al. (2013), by directly requiring the martingale property of the sequence of random measures , which is otherwise implied by the -c.i.d. property, to obtain the two lemmas below, whose proofs are therefore omitted. The proofs of the following Theorem 3.4 is instead provided in Section 6.

###### Lemma 3.1

Let be a sigma-finite measure on a Polish space . For any , let be a random measure on such that the sequence is a measure-valued martingale with respect to a filtration , and let be its limit. Then a.s. if and only if, a.s., for every and converges to in total variation.

###### Lemma 3.2

Let and be as in Lemma 3.1. Assume that a.s. for every , with density . Then a.s. if and only if for every compact such that , is, a.s., a function on uniformly integrable with respect to , where is the restriction of on .
In particular, a.s. if, for every compact, there exists such that, a.s.,

 supn∫Kqn(x)pdλ(x)<∞. (15)

A sufficient condition for (15) is

 supnE(∫Kqn(x)pdλ(x))<∞.

We can now provide sufficient conditions for the a.s. absolutely continuity of the limit mixing distribution in Newton’s model.

###### Theorem 3.4

Let be the a.s. limit of the sequence of predictive rules defined by (3). If the following conditions hold

 ∑nα2n<∞;∫Kg0(θ)2dλ(θ)<∞,for every K compact; (16)

and

 supθ1,θ2∈K∫f(x∣θ1)2f(x∣θ2)dμ(x)<∞, for every K compact such that λ(K)<∞, (17)

then , a.s. Moreover, a.s, converges in to .

Assumptions (16) are quite natural. They hold, for example, if and is continuous or bounded. Assumption (17) is more delicate. It holds, for example, if

is a Poisson density or a Gaussian density with fixed variance or a Gamma density with fixed shape parameter. A similar assumption is considered in

Tokdar et al. (2009).

In the next subsection, we give a simple comparison between the DP and the law of arising from Newton’s recursions, through a simulation study.

### 3.2 Empirical study

DP mixtures are widely used for clustering, based on the random partition of the induced by the DP. For model (10), the asymptotic mixing distribution is absolutely continuous under fairly mild conditions; thus, one rather has multiple shrinkage effects in the estimation of , governed by the modes of the mixing density . Nevertheless, one may expect that, if provides strong information on , then Newton’s predictive rule implies a negligible loss of information with respect to the predictive rule corresponding to a DP mixture (compare (2) and (11), with ). Consequently, in this case, the law of should be close to the DP.

This behavior is illustrated in Figure 1. We consider a location mixture of Gaussian distributions , with known. In this case, the assumptions of Theorem 3.4 are satisfied, and is absolutely continuous, a.s. The prior guess is . The weights in Newton’s recursions are , with . Were , then, for any set , the law of

would be a Beta distribution, with parameters

. The panels in the first column of Figure 1 compare such Beta density (gray dotted line), for , with a Monte Carlo approximation of the law of arising from the c.i.d. model. The Monte Carlo sample is obtained by generating samples , from the c.i.d. model (10), with . For each sample, we compute , that we take as a fairly reasonable proxy of the realization of the random . The first raw in Figure 1 corresponds to . In the second raw, . Clearly, the results do not contradict the conjecture that, if is (very) small, the law of is close to the DP. The shrinkage effect that replaces the DP random partition of is illustrated in the right panels of Figure 1, which show the density estimate in the two cases. Clearly, for , closely describes a random partition of the .

## 4 Asymptotic posterior laws

By part (ii) of Theorem 3.1, Newton’s rule can be regarded as a point estimate, with respect to quadratic loss, of the limit mixing distribution , in a quasi-Bayes mixture model. Yet, a Bayesian mixture model would offer more than a point estimate; it would give a description of the uncertainty through the posterior distribution of . The probabilistic framework we have provided for Newton’s learning rule allows to go beyond point estimation, studying the conditional distribution of , given . We first obtain an asymptotic Gaussian approximation of the conditional distribution of , given , for any measurable set

. We then extend the results to the joint conditional distribution of a random vector

, for any measurable .

### 4.1 Asymptotic posterior distribution and credible intervals.

Let us recall that is a probability law for , consistent with the assumptions (10). Here we give an asymptotic Gaussian approximation of the conditional law , for a measurable set . The almost sure conditional convergence involved is a strong form of convergence (Crimaldi (2009)), that implies stable convergence (Renyi (1963), Aldous (1985), Häusler and Luschgy (2015)) and convergence in distribution of the unconditional law.

Notice that, although having a similar flavor, these results differ from Bernstein-von-Mises types of theorems, which are stated a.s. with respect to the law . The law describes an evolutionary process and the results inform about the rate of convergence of to the limit distribution , for any in a set of -probability one. Berstein-von-Mises results are beyond the aims of this paper, but we will give some hints in the simulation study in Section 5.

Let us denote by the Gaussian law with mean and variance , and by its d.f. evaluated at . A law is interpreted as the law degenerate at zero. The -dimensional Gaussian law will be denoted by , with d.f. . Without loss of generality, we can assume that for every . This implies

 fGn(x)≠0for every x∈X and n≥0. (18)

Our first result finds a sequence such that the conditional distribution of , given , is asymptotically a zero-mean Gaussian law, with variance

 VA=∫{x:fG(x)≠0}PG(A∣x)2dFG(x)−G(A)2. (19)

We remind the notation , for any d.f. on .

Before stating the theorem, we give the following Lemma. Let us define, for any and ,

 VA,n=∫XPGn(A∣x)2dFGn(x)−Gn(A)2. (20)

Notice that can be written as , expressing the prior-to-posterior variability, given , when plays the role of the prior and of the posterior.

###### Lemma 4.1

For any , converges to a.s. as .

We can then give the main results of this section.

###### Theorem 4.1

Let satisfy and and let be a monotone sequence of positive numbers such that as . If

 √rnsupk≥nαk→0 (21)

and

 ∑k≥1r2kα4k+1<∞, (22)

then, for every ,

 P(√rn(G(A)−Gn(A))≤t∣X1:n)→Φ(t∣0,VA), (23)

with as in (19). If with and , then (23) holds with .

###### Remark 4.1

Assumptions (21) and (22) hold for most choices of satisfying and . In particular, if is definitively decreasing, then (21) is a consequence of (22). A sufficient condition for (22) is

 αn=(nbn)−1

for a sequence which is definitively non increasing. Indeed, in this case

 limsupn→∞rn−1αn=limsupn→∞(nbn)−1∑k≥n(kbk)−2≤limsupn→∞(nbn)−1b−2n∑k≥nk−2≤limsupn→∞bn<∞.

In turn, this implies that , for large enough, and therefore (22).

###### Remark 4.2

If , then Theorem 4.1 gives convergence to a degenerate distribution on zero. From the definition of , it is immediate to see that if and only if , which happens if and only if is zero or one.

In Theorem 4.1, the limit variance is unknown, depending on . By Lemma 4.1, a convergent estimator is provided by . Replacing the random with its consistent estimate, to get an asymptotic distribution that allows to compute asymptotic credible intervals for (as done through Cramér-Slutzky Theorem in standard i.i.d. settings) is here delicate, given the kind of convergence we are studying. Yet, we can prove the following

###### Theorem 4.2

Let . Then, with defined as in Theorem 4.1 and under the same assumptions, for almost all such that ,

 P(√rnG(A)−Gn(A)√VA,n≤t∣x1:n)→Φ(t∣0,1). (24)

If with and , then (24) holds with .

Theorems 4.1 and 4.2 allow to obtain asymptotic credible intervals. For a fixed set , Theorem 4.2 gives that, for almost all such that ,

 P(Gn(A)−z1−γ/2√VA,n/rn

where is the

-quantile of the standard Gaussian distribution. If

, then Theorem 4.1 implies that the limit distribution is degenerate on zero, therefore, for any