 # An Online Sample Based Method for Mode Estimation using ODE Analysis of Stochastic Approximation Algorithms

One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply the optimization techniques to find mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from the samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In this work we provide a computationally effective, on-line iterative algorithm that estimates the mode of a unimodal smooth density given only the samples generated from the density. Asymptotic convergence of the proposed algorithm using an ordinary differential equation (ODE) based analysis is provided. We also prove the stability of estimates by utilizing the concept of regularization. Experimental results further demonstrate the effectiveness of the proposed algorithm.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

There are many metrics that are used to represent the central tendency of the data. Among them, the popular ones are mean, median and mode. Mean is extensively studied due to its simplicity, linearity and ease of estimation via sample averages. However mean is susceptible to outliers. For example, when we are estimating the mean from a finite number of samples, one bad outlier can shift the estimate far away from the original mean. Also, in some of the applications, mean may not be the desired quantity to analyze. For example, it is often interesting to know the income that majority of the population in a country earns rather than the average income of the country which is a skewed quantity.

In this work, we focus on finding the mode of a distribution when the analytical form of it is not known. That is, we are given only the samples of the distribution and we need to estimate the mode from these samples. We first discuss some of the works that have been reported in the literature for estimating the mode. This problem of estimation of the mode of a unimodal density has been first considered in . In 

a sequence of density functions is iteratively constructed from the samples and the respective modes are calculated as maximum likelihood estimates. It is shown that these estimates of mode converge in probability to the actual mode.

We can broadly classify the solution techniques of mode estimation problem into two groups. The first group comprises a non-parametric way of estimating the mode where the mode is estimated directly from the sample data without constructing the density function. The second group of methods comprises the parametric way of estimation in which the density function is approximately constructed and the mode is computed using optimization techniques.

In , a non-parametric estimator (popularly known as Grenander’s estimate) for estimating the mode directly from the samples is proposed. The later developments of mode estimation methods are based on the idea that the mode is situated at the center of an interval of selected length that contains majority of observed points. A sequence of nested intervals and intermediate mode estimates is constructed based on the above described idea and mode is taken to be the point that these intermediate estimates of mode converge to. Different ways of selecting the interval lengths are studied in [3, 4] along with their convergence properties. A variant of this idea involves selecting the interval of shortest length that contains some desired number of points instead of deciding the lengths of interval. The estimation methods in [5, 6, 7] are based on this idea. Detailed survey of the above discussed techniques along with their robustness is extensively studied in [8, 9].

In 

, a parametric method of estimating the mode is proposed. The idea here is to fit the samples to a normal distribution. Then the mode is estimated by calculating the mean of this fitted normal distribution. This idea is recently extended to find the multivariate mode in

. In , multivariate mean shift algorithm is proposed for estimating the multivariate mode from the samples. The idea here is to iteratively shift the estimated mode towards the actual mode using Gaussian kernels. In , minimum volume peeling method is proposed to estimate the multivariate mode from the sample data. The idea here is to iteratively construct subsets of the set of samples with minimum volume and discard the remaining points. The mode is then calculated by averaging the points in the constructed subset. This is based on the observation that mode is generally situated in the minimum volume set of a given fixed number of samples. An effective way of selecting the subset of points is discussed in .

Most of the algorithms considered in the literature so far make the assumption that all the samples are available upfront. These techniques cannot be extended to the case of streaming data where the samples arrive online one at a time. Also, the non-parametric techniques (refer ) require the samples to be in a sorted order.

Our proposed algorithm is fundamentally different from the above algorithms in the sense that ours is an online algorithm that works with the data as it becomes available. This enables us to work with online samples without storing them in the memory. Also, we do not resort to any computationally expensive operations like sorting. In addition, our algorithm works for both univariate and multivariate distributions. We provide a convergence analysis of our proposed technique and show the robustness of our technique using simulation results.

Our work is closest to . In  a gradient-like recursive procedure for estimating the mode is proposed and convergence is provided utilizing the theory of martingales. In our work to mitigate the lack of analytical form of density, we construct a kernel based representation (refer section II) of the density function and use stochastic gradient ascent to calculate mode. Our work is different from  in the following ways.

• Our proposed algorithm is based on assumptions different from those of . Moreover, we prove the stability of the mode estimates by introducing the concept of regularization .

• We demonstrate the effectiveness of our algorithm by providing empirical evaluation on well-known distributions.

• Our convergence proof utilizes the well-known ODE based analysis of stochastic approximation algorithms . To the best of our knowledge, ours is the first work that makes use of ODE based analysis in the context of mode estimation.

The rest of the paper is organized as follows. Section II explains the necessary background and preliminaries. We provide the motivation for our work as well as present our algorithm in Section III. Section IV describes the convergence analysis. Section V presents the results of our numerical experiments. Finally Section VI presents conclusions and future work.

## Ii Background and Preliminaries

To begin, suppose we have a probability space

and a random vector

with a smooth density function A mode of the random vector is defined as an argument of the maximum of Suppose we have a unique mode i.e. has a unique maximizer then the mode is a measure of central tendency of the distribution of A natural problem that arises is the estimation of the mode given a sequence of independent and identically distributed samples of denoted by In what follows we provide necessary formal definitions and prove some properties that are utilized in the motivation and the convergence of our proposed algorithm to solve this problem.

### Ii-a Approximation of Identity

###### Definition 1

The convolution of two functions and on is defined as

 (u∗v)(x):=∫Rpu(x−t)v(t)dt   (x∈Rp).
###### Definition 2

Given a function and we define

 Kϵ(x):=ϵ−pK(xϵ).

For a given function the family of functions is called approximation of the identity.

###### Lemma 1

Suppose . Then, given any

1. as , for any fixed

For statement 1 choose So Then we have

 ∫RpKϵ=1ϵp∫RpK(xϵ)dx=∫RpK(y)dy=∫RpK.

Again for statement 2, let and choose Now

 ∫||x||>δ|Kϵ(x)|dx =1ϵp∫||x||>δ∣∣K(xϵ)∣∣dx =∫||y||>δ/ϵ|K(y)|dy

Since and as , the proof is complete.

###### Theorem 1

Let , where and as . Then as at each point of continuity of

Let be continuous at By the definition of continuity, given there exists such that if Since from Lemma 1 and hypothesis we have,

 |uϵ(x)−u(x)|=∣∣∣∫u(x−t)Kϵ(t)dt−u(x)∫Kϵ(t)dt∣∣∣ +∫\displaylimits||t||≥δ(u(x−t)−u(x))Kϵ(t)dt∣∣∣ ≤η∫\displaylimits||t||<δ|Kϵ(t)|dt+∫\displaylimits||t||≥δ|u(x−t)−u(x)||Kϵ(t)|dt ≤η∥K∥1+∫\displaylimits||t||≥δ|u(x−t)||Kϵ(t)|dt +|u(x)|∫\displaylimits||t||≥δ|Kϵ(t)|dt

The third term approaches zero with by Lemma 1. It is enough to show that the second term approaches zero. From the hypothesis for some non-negative where as We have

 ∫\displaylimits||t||≥δ|u(x−t)||Kϵ(t)dt| = ∫\displaylimits||t||≥δ|u(x−t)|μ(tϵ)||t||−pdt ≤ δ−p{sup||t||≥δμ(tϵ)}∫\displaylimits||t||≥δ|u(x−t)|dt ≤ δ−p{sup||t||≥δμ(tϵ)}∥u∥1

Again from the hypothesis as So as . This concludes the proof.

###### Corollary 1

Let , where and as . Then as at each point of continuity of Here the convolution, , is performed component wise.

The result is obtained by applying Theorem 1 to each component of .

### Ii-B Stochastic Optimization

Stochastic optimization  deals with the study of iterative algorithms of the type

 xk+1=xk+ak[∇h(xk)+Nk+1]. (1)

Here are the parameters that are updated according to (1). The function is an underlying cost function whose maximum we are interested in finding. Also, is a prescribed step-size sequence. Further, constitute the noise terms. We state here a theorem that is utilized in the convergence analysis of our algorithm. Consider the following assumptions ,.

1. The step-sizes , satisfy the requirements:

 ak>0 ∀k,∑kak=∞,∑ka2k<∞.
2. The sequence is a martingale difference sequence with respect to the following increasing sequence of sigma fields:

 Fk:=σ{x0,N1,⋯,Nk},k≥0.

Thus, in particular, ,

 E[Nk+1|Fk]=0 a.s.

Further are square integrable and

 E[∥Nk+1∥2|Fk]≤C(1+∥xk∥2) a.s.

for a given constant

3. The function is Lipschitz continuous.

4. The functions , satisfy as , uniformly on compacts. Furthermore, the o.d.e

 ˙x(t)=∇h∞(x(t)) (2)

has the origin as the unique globally stable equilibrium .

Consider the ordinary differential equation

 ˙x(t)=∇h(x(t)). (3)

Let denote the compact set of asymptotically stable equilibrium points of the ODE (3).

###### Theorem 2

Under (A1)-(A4), (stability) a.s. Further almost surely as .

Follows as a consequence of Theorem 2 and 7 of .

## Iii Motivation and Algorithm

In this section we motivate and present our iterative algorithm for estimating the mode of a unimodal density. The idea of computing the mode is described below. Let denote the unimodal density function. As mode is the maximizer of the density function, we can estimate the mode by gradient ascent as follows:

 mn+1=mn+an∇f(mn), (4)

where and are the step-size and current mode estimate, respectively at time .

We introduce a function defined as follows:

 g(m)=f(m)−12λ∥m∥2, (5)

where is the regularization coefficient . The idea is to find a that maximizes the function . This is done to maintain the stability of the estimates in our algorithm (refer proposition 3 in Section IV). Therefore the gradient ascent update is performed as follows:

 mn+1 =mn+an∇g(mn) (6) =mn+an(∇f(mn)−λmn) (7)

It remains to be shown that solution obtained using this update equation (7) converges to the mode obtained using the update equation (4) as . Let

 ^m(λ):=argmaxg(m)

and

 m∗:=argmaxf(m).

It is easy to see that

 ^m(0)=m∗

From the continuity of function given by the Maximum Theorem  we have as ,

 ^m(λ)→^m(0)=m∗.

The update equation (7), however needs the information of , which is not known. We therefore make use of the ideas in the section II to estimate as follows. To make the notations easy, we replace with and derive Applying Corollary 1 to with the kernel we get for small

 ∇f(m)≈∇fϵ(m)=∫Rp∇f(m−t)Kϵ(t)dt. (8)

By the properties of convolution

 ∫Rp∇f(m−t)Kϵ(t)dt =∫Rp∇Kϵ(m−t)f(t)dt =EX∼f[∇Kϵ(m−X)]. (9)

Note that there are several valid choices of functions (also called kernels) to obtain approximation of identity.

Now, (7) can be re-written using stochastic gradient ascent as follows:

 mn+1=mn+an(∇Kϵ(mn−Xn+1)−λmn), (10)

where is the sample obtained at time .

In the following table, we indicate some of the popular kernels .

It is easily verified that the kernels defined in Table I satisfy the hypotheses of Corollary 1. The full algorithm for estimating the mode from online streaming data is described in Algorithm 1.

Let denote an initial mode estimate and , a small constant. The algorithm works as follows. At time , the algorithm takes as input the current mode estimate and the sample . It then computes the direction and updates current approximation of the mode along as shown in step 2. The output of the algorithm is the updated mode estimate computed from samples obtained so far, i.e., . We prove the convergence of the algorithm in the next section.

## Iv Convergence Analysis

Let be a sequence of sigma fields. Observe that forms a filtration. Let Note that is -measurable. Moreover is integrable i.e. under the assumption that is Lipschitz continuous. Now the basic algorithm can be written as

 mk+1=mk+akdk+1=mk+ak(E[dk+1|Fk]+Nk+1) (11)

where is a mean zero term. Also, constitutes a martingale difference sequence (see proof of Proposition 1). Here is the expectation with respect to the density

Our convergence analysis rests on Theorem 2. Our algorithm is in the form of the general iterative scheme (1) with and . We choose Gaussian kernel for our analysis of the algorithm. Similar analysis can be carried out for other choice of kernels. We validate the assumptions of Theorem 2 and apply them below.

The choice assures assumption A1. The following proposition validates assumption A2.

###### Proposition 1

is a martingale difference sequence with

 E[∥Nk+1∥2|Fk]≤C(1+∥mk∥2),

for all and for some .

It is easy to see that So clearly is a martingale difference sequence. Now

 E[∥Nk+1∥2|Fk] ≤2(E[∥dk+1∥2|Fk]+E[∥E[dk+1|Fk]∥2|Fk]) ≤4E[∥dk+1∥2|Fk].

The first inequality follows from the simple identity , while the second inequality follows from a simple application of Jensen’s inequality. Since the higher derivatives and in particular Hessian of is bounded, it follows that is Lipschitz continuous. We have

 ∥∇Kϵ(m)∥−∥∇Kϵ(0)∥≤∥∇Kϵ(m)−∇Kϵ(0)∥≤L∥m∥,

where is the Lipschitz constant. Hence for all ,

 ∥∇Kϵ(m)∥≤C0(1+∥m∥), (12)

where Therefore,

 E[∥Nk+1∥2|Fk] ≤4E[∥dk+1∥2|Fk] ≤8E[∥∇Kϵ(mk−Xk+1)∥2+∥λmk∥2|Fk] ≤8E[C20(1+∥(mk−Xk+1)∥)2+λ2∥mk∥2|Fk] ≤8E[(2C20+(4C20+λ2)∥mk∥2+4C20∥Xk+1∥2)|Fk] =8(2C20+(4C20+λ2)∥mk∥2+4C20C1) =C(1+∥mk∥2)

where and This completes the proof. The following lemma is useful in proving assumption A3 (see Proposition 2).

###### Lemma 2

Now Also is analytic and has a power series expansion around Using power series of , linearity of the expectation and independence of from we obtain Owing to Lemma 2 our iterative update (11) transforms into and we validate assumption A3 below.

###### Proposition 2

is Lipschitz continuous.

Now for any and

 ∥∇fϵ(x)−∇fϵ(y)−λ(x−y)∥ ≤ ∥E[∇Kϵ(x−T)]−E[∇Kϵ(y−T)]∥+λ∥x−y∥ ≤ E∥[∇Kϵ(x−T)]−∇Kϵ(y−T)∥+λ∥x−y∥ ≤ (L+λ)∥x−y∥

where is the Lipschitz constant of The following proposition proves assumption A4.

###### Proposition 3

The ODE has the origin as its unique globally asymptotically stable equilibrium point.

From the definition of , see assumption A4, we have

 h∞(m) =limc→∞∇fϵ(cm)−λcmc =limc→∞E[∇Kϵ(cm−X)]c−λm =limc→∞1c∫Rp2(x−cm)ϵ3√πe−∥cm−x∥2ϵ2f(x)dx−λm =limc→∞1c∫Rp2xϵ3√πe−∥cm−x∥2ϵ2f(x)dx −limc→∞∫Rp2mϵ3√πe−∥cm−x∥2ϵ2f(x)dx−λm =−λm

Here

 ∥∥∥1c∫Rp2xϵ3√πe−∥cm−x∥2ϵ2f(x)dx∥∥∥ ≤1c∫Rp∥∥2xϵ3√πe−∥cm−x∥2ϵ2f(x)dx∥∥ ≤1c∫Rp∥∥2xϵ3√πf(x)dx∥∥ →0 as c→∞

where the facts that and are utilized. By the application of Dominated Convergence Theorem  we have

 =2mϵ3√π∫Rplimc→∞e−∥cm−x∥2ϵ2f(x)dx=0.

So we have that
Now for the system , clearly origin is an equilibrium point. Also for any initial point , is the solution of the system and as . Therefore origin is the unique globally asymptotically stable equilibrium point. This concludes the proof.

###### Remark 1

Note that the regularization coefficient plays a key role in establishing the stability of mode estimates. To see the effect of regularization term consider the iterates

 mn+1=mn+an∇fϵ(mn).

The corresponding to this update equation is identically 0 there by violating assumption A4.

Consider now the following ODE:

 ˙x=∇fϵ(x(t))−λx (13)

Let be the set of asymptotically stable equilibrium points of (13).

###### Remark 2

From our assumptions as ,

We have the following as our main result.

###### Theorem 3

obtained from Algorithm 1 satisfies a.s.

The result follows from the foregoing and Theorem 2.

## V Experiments

In this section, we discuss the numerical performance of our algorithm. We implement our algorithm on known popular distributions. We collect samples from a known distribution and apply our algorithm for estimating the mode. The initial mode is selected as the average of initial 1000 points. We consider Gaussian kernel for univariate distributions and multivariate Gaussian kernel for bivariate normal and Dirchlet distributions (see Table I) for our experiments. The regularization coefficient is chosen to be and is set to 1. We perform 100 runs of the experiment and estimated mode is calculated as the mean of modes obtained over the 100 runs. The following Table II illustrates the performance of our algorithm (estimated mode) on standard distributions. We have also indicated the actual mode of the distribution in the Table II. The code for our experiments can be found at .

It is interesting to note that, though Exponential and Weibull densities are not smooth and do not satisfy our assumptions, the estimated mode obtained by our algorithm is closer to the actual mode.

In the Figure 1, we inidicate the performance of our algorithm with different initial points. For this purpose, we select Normal distribution with mean 10. We implement our algorithm with initial points 5,10 and 15 and plot the estimated mode over initial 50,000 iterations. We observe that the estimates of the mode in all the three cases converge towards the actual mode having value 10 as the number of iterations increase. This shows that the proposed algorithm is not very sensitive with respect to the initial mode point. These results confirm the practical utility of our algorithm.

## Vi Conclusions and Future Work

In this paper, we proposed an online computationally efficient algorithm for computing the mode from the samples of an unknown density. We have provided the proofs for the stability of the iterates and convergence of our algorithm. Next, we did numerical experiments on standard distributions and showed the effectiveness of our algorithm in practice.

In future, we wish to propose second order updates using say the Newton’s method in the place of gradient descent. Newton’s method is known to converge faster than the gradient descent method. This estimation of mode may possibly lead to new ideas in Bayesian Statistics. For example the core idea in Maximum Aposteriori Estimate (MAP) is to find the mode of the posterior distribution from the given samples. Another interesting future direction is to obtain finite sample error bounds and rate of convergence results for our algorithm.

## References

• 

E. Parzen, “On estimation of a probability density function and mode,”

The annals of mathematical statistics, vol. 33, no. 3, pp. 1065–1076, 1962.
•  U. Grenander, “Some direct estimates of the mode,” The Annals of Mathematical Statistics, pp. 131–138, 1965.
•  H. Chernoff, “Estimation of the mode,” Annals of the Institute of Statistical Mathematics, vol. 16, no. 1, pp. 31–41, 1964.
•  E. J. Wegman, “A note on the estimation of the mode,” The Annals of Mathematical Statistics, pp. 1909–1915, 1971.
•  T. Dalenius, “The mode–a neglected statistical parameter,” Journal of the Royal Statistical Society. Series A (General), pp. 110–117, 1965.
•  J. Venter, “On estimation of the mode,” The Annals of Mathematical Statistics, pp. 1446–1455, 1967.
•  T. Robertson and J. D. Cryer, “An iterative procedure for estimating the mode,” Journal of the American Statistical Association, vol. 69, no. 348, pp. 1012–1016, 1974.
•  D. R. Bickel, “Robust estimators of the mode and skewness of continuous data,” Computational statistics & data analysis, vol. 39, no. 2, pp. 153–163, 2002.
•  D. R. Bickel and R. Frühwirth, “On a fast, robust estimator of the mode: comparisons to other robust estimators with applications,” Computational Statistics & Data Analysis, vol. 50, no. 12, pp. 3500–3530, 2006.
•  D. R. Bickel, “Robust and efficient estimation of the mode of continuous data: the mode as a viable measure of central tendency,” Journal of statistical computation and simulation, vol. 73, no. 12, pp. 899–912, 2003.
•  C.-Y. Hsu and T.-J. Wu, “Efficient estimation of the mode of continuous multivariate data,” Computational Statistics & Data Analysis, vol. 63, pp. 148–159, 2013.
•  L. D. Griffin and M. Lillholm, “A multiscale mean shift algorithm for mode estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005.
•  T. Kirschstein, S. Liebscher, G. C. Porzio, and G. Ragozini, “Minimum volume peeling: A robust nonparametric estimator of the multivariate mode,” Computational Statistics & Data Analysis, vol. 93, pp. 456–468, 2016.
•  A. B. Tsybakov, “Recursive estimation of the mode of a multivariate distribution,” Problemy Peredachi Informatsii, vol. 26, no. 1, pp. 38–45, 1990.
•  A. N. Tikhonov, A. Goncharsky, V. Stepanov, and A. G. Yagola, Numerical methods for the solution of ill-posed problems.   Springer Science & Business Media, 2013, vol. 328.
•  L. Ljung, “Analysis of recursive stochastic algorithms,” IEEE transactions on automatic control, vol. 22, no. 4, pp. 551–575, 1977.
•  H. Kushner and G. G. Yin, Stochastic approximation and recursive algorithms and applications.   Springer Science & Business Media, 2003, vol. 35.
•  V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint.   Springer, 2009, vol. 48.
•  S. Bhatnagar, H. Prasad, and L. Prashanth, Stochastic recursive algorithms for optimization: simultaneous perturbation methods.   Springer, 2012, vol. 434.
•  M. W. Hirsch, S. Smale, and R. L. Devaney, Differential equations, dynamical systems, and an introduction to chaos.   Academic press, 2012.
•  C. Berge, Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity.   Courier Corporation, 1997.
•  R. L. Wheeden, Measure and integral: an introduction to real analysis.   CRC press, 2015, vol. 308.