DeepAI

# Nonparametric Estimation of Band-limited Probability Density Functions

In this paper, a nonparametric maximum likelihood (ML) estimator for band-limited (BL) probability density functions (pdfs) is proposed. The BLML estimator is consistent and computationally efficient. To compute the BLML estimator, three approximate algorithms are presented: a binary quadratic programming (BQP) algorithm for medium scale problems, a Trivial algorithm for large-scale problems that yields a consistent estimate if the underlying pdf is strictly positive and BL, and a fast implementation of the Trivial algorithm that exploits the band-limited assumption and the Nyquist sampling theorem ("BLMLQuick"). All three BLML estimators outperform kernel density estimation (KDE) algorithms (adaptive and higher order KDEs) with respect to the mean integrated squared error for data generated from both BL and infinite-band pdfs. Further, the BLMLQuick estimate is remarkably faster than the KD algorithms. Finally, the BLML method is applied to estimate the conditional intensity function of a neuronal spike train (point process) recorded from a rat's entorhinal cortex grid cell, for which it outperforms state-of-the-art estimators used in neuroscience.

• 2 publications
• 41 publications
• 2 publications
04/23/2018

### Positive data kernel density estimation via the logKDE package for R

Kernel density estimators (KDEs) are ubiquitous tools for nonparametric ...
10/25/2017

### Asymptotic properties and approximation of Bayesian logspline density estimators for communication-free parallel methods

In this article we perform an asymptotic analysis of Bayesian parallel d...
04/23/2018

### Log-transformed kernel density estimation for positive data

Kernel density estimators (KDEs) are ubiquitous tools for nonpara- metri...
06/20/2012

### Fast Nonparametric Conditional Density Estimation

Conditional density estimation generalizes regression by modeling a full...
11/10/2020

### Efficient Interpolation of Density Estimators

We study the problem of space and time efficient evaluation of a nonpara...
05/31/2019

### Targeted Estimation of L2 Distance Between Densities and its Application to Geo-spatial Data

We examine the integrated squared difference, also known as the L2 dista...
10/03/2022

### Predictive density estimators with integrated L_1 loss

This paper addresses the problem of an efficient predictive density esti...

## 1 Significance Statement

Nonparametric estimation of probability densities has became increasingly important to make predictions about processes where parametrization is difficult. However, unlike parametric approaches, nonparametric approaches are often not optimal in the sense of maximizing likelihood, which would ensure several optimality properties of the estimator. In this study, a novel nonparametric density estimation technique that maximizes likelihood and that converges to the true pdf is presented. Simulations show that this technique outperforms and outruns sophisticated kernel density estimation techniques, and yields the fastest convergence rate to date.

When making inferences from experimental data, it is often required to model random phenomena and estimate a pdf. A common approach is to assume that the pdf belongs to a class of parametric functions (e.g., Gaussian, Poisson), and then estimate the parameters by maximizing the data likelihood function. Parametric models have several advantages. First, they are often efficiently computable. Second, the parameters may be related back to physiological and environmental variables. Finally, ML estimates have nice asymptotic properties when the actual distribution lies in the assumed parametric class. However, if the true pdf does not lie in the assumed class of functions, large errors may occur, potentially resulting in misleading inferences.

When little is known a priori about the pdf, nonparametric estimation is an option. However, maximizing the likelihood function yields spurious solutions as the dimensionality of the problem typically grows with the number of data samples, [1]. To deal with this, several nonparametric approaches penalize the likelihood function by adding a smoothness constraint. Such penalty functions have nice properties of having unique maxima that can be computed. However, when smoothness conditions are applied, the asymptotic properties of ML estimates are typically lost [1].

Other methods for nonparametric estimation assume that the pdf is a linear combination of scaled and stretched versions of a single kernel function [2, 3, 4, 5]. These methods fall under kernel density (KD) estimation, which have been studied for decades. However, choosing an appropriate kernel is still a tricky and often an arbitrary process [6]. Additionally, even the best KD estimators [6, 7, 8, 9] have slower convergence rates ( , for the second and sixth-order Gaussian kernels, respectively) than parametric ML estimation () with respect to the mean integrated squared error (MISE)[10].

Finally, some approaches require searching over nonparametric sets for which a maximum likelihood estimate exists. Some cases are discussed in [11, 12], wherein the authors construct maximum likelihood estimators for unknown but Lipschitz continuous pdfs. Although Lipschitz functions display desirable continuity properties, they can be non-differentiable. Therefore, such estimates can be non-smooth, but perhaps more importantly, they are not efficiently computable as a closed-form solution cannot be derived [11, 12].

This paper presents a case where a nonparametric ML estimator exists, is efficiently computable, consistent and results in a smooth pdf. The pdf is assumed to be band-limited (BL), which has a finite-support in the Fourier domain. The BL assumption essentially can be thought of as a smoothness constraint. However, the proposed method does not require penalizing the likelihood function to guarantee the existence of a global maximum, and therefore may preserve the asymptotic properties of ML estimators (i.e. consistency, asymptotic normality and efficiency). The BLML method is first applied to surrogate data generated from both BL and infinite-band pdfs, where in both cases it outperformes all tested KD estimators (including higher order kernels) both in convergence rate and computational time. Then the BLML estimator is applied to the neuronal data recorded from a rat’s entorhinal cortex “grid cell” and is shown to outperform state-of-the-art estimators used in neuroscience.

## 2 The BLML Estimator

We begin with a description of the BLML estimator in the following theorem.

Consider independent samples of an unknown BL pdf, , with assumed cut-off frequency Then the BLML estimator of is given as:

 ^f(x)=(1nn∑i=1^cisin(πfc(x−xi))π(x−xi))2, (1)

where and

 ^c=argmaxρn(c)=0(n∏i=11c2i). (2)

Here and
.

Proof: See supporting information (SI).

The system of equations, in (2) is monotonic, i.e., , with discontinuities at each . Therefore, there are

solutions, with each solution located in each orthant, identified by the orthant vector

. Each solution corresponds to a local maximum of the likelihood function which is also its maximum value in that orthant. Hence, the global maximum always exists and can be found by finding the maximum of these maxima. However, it is computationally exhaustive to solve (2), which entails finding the solutions of and then comparing values of for each solution.

Therefore, to compute the BLML estimator, three approximate algorithms are proposed: a binary quadratic programming (BLML-BQP) algorithm, a BLMLTrivial algorithm, and its quicker implementation - BLMLQuick. Both theory and simulations show that the BLML-BQP algorithm is appropriate when the sample size is and no additional knowledge is known other than the pdf is BL. However, in cases when and the underlying pdf is strictly positive, the BLMLTrivial and BLMLQuick algorithms are more appropriate as they are guaranteed to yield a consistent estimate (see Theorems 4, 5 and 6) and converge at a rate (), which is faster than all tested KD estimates. Further, the BLMLQuick algorithm shows a remarkable improvement in computational speed over tested KD methods.

### 2.1 Consistency of the BLML Estimator

Proving consistency of the BLML estimator is not trivial as it requires a solution to (2). However, if then consistency of BLML estimator can be established. To show this, first an asymptotic solution to is constructed (Theorem 3). Then, consistency is established by plugging into (1) to show that the ISE and hence the MISE between the resulting density, and is 0 (Theorem 4). Then, it is shown that the KL-divergence between and is also and hence is a solution to (2), which makes the BLML estimator (Theorem 5). Theorems 2-5 and their proofs are presented in SI.

### 2.2 Generalization of the BLML Estimators to Joint Pdfs

Consider the joint pdf ,

such that its Fourier transform

has the element-wise cut off frequencies in vector . Then the BLML estimator is of the following form:

 ^f(x)=(1nn∑i=1^cisincfc(x−xi))2 (3)

where, is the assumed cutoff frequency, vector are the data samples, and the vector , is given by

 ^c=argmaxρn(c)=0(∏1c2i). (4)

Here .

The multidimensional result can be derived in a very similar way as the one-dimensional result as described in SI.

### 2.3 Computing the BLML Estimator

The three algorithms, BLMLTrivial, BLMLQuick and BLML-BQP are described next.

##### BLMLTrivial Algorithm.

It is a one-step algorithm that first selects an orthant in which the global maximum may lie, and then solves in that orthant. As is monotonic, it is computationally efficient to solve in any given orthant.

As stated in Theorem 6 (see SI), the asymptotic solution of (2) lies in the orthant with indicator vector if is BL and . Therefore, the BLMLTrivial algorithm selects the orthant vector , and then is solved in that orthant to compute . It is important to note that when is indeed BL and strictly positive, then the BLMLTrivial estimator converges to BLML estimator asymptotically.

Due to its simplicity, the computational complexity of the BLMLTrivial method is very similar to KDE with complexity , where is the number of points where the value of pdf is estimated [13]); but the BLMLTrivial method has an extra step of solving equation This equation can be solved in , using gradient descent or Newton algorithms. Therefore, the computational complexity of BLMLTrivial estimator is .

##### BLMLQuick Algorithm.

The BL assumption of the true pdf allows for a quick implementation of the BLMLTrivial estimator - “BLMLQuick”. For details, see SI. Briefly, BLMLQuick first groups the observed samples into bins of size . Then, it constructs the BLMLTrivial estimator of the discrete pdf (or the probability mass function, pmf) that generated the binned data. The true pmf for the binned data has infinite-bandwidth. Hence, under the required conditions, the BLMLTrivial estimate constructed using the Nyquist frequency, converges to the continuous pdf , from which the pmf is obtained via sampling. can be made arbitrarily close to the true pdf by choosing smaller and smaller bins. In fact, if the bin size reduces as then the ISE between and is of . Therefore, the MISE for BLMLQuick is plus the MISE of the BLMLTrivial estimator. Since the MISE of the BLMLTrivial estimator has to be greater than , the BLMLQuick algorithm is as efficient as the BLMLTrivial algorithm. Specifically, the computational complexity of BLMLQuick is
, where governs the behavior of the tail of the true pdf.

##### BLML-BQP Algorithm.

To derive the BLML-BQP algorithm, it is first noted that the solutions of are equivalent to the local solutions of:

 ~c=arglocalmaxcTSc=n2(∏ic2i). (5)

here is a matrix with th element being . Now, if is an orthant indicator vector and is such that , then (5) implies:

 ∏i~c2i≥λ2n⇒∏i1~c2i≤(cT0Sc0)nn2n. (6)

Finally, the orthant where the solution of (2) lies is found by maximizing the upper bound using the following BQP:

 ^c0=argmaxc0∈{−1,1}n(cT0Sc0). (7)

BQP problems are known to be NP-hard [14]

, and hence a heuristic algorithm implemented in the gurobi toolbox

[15] in MATLAB is used to find an approximate solution in polynomial time. Once a reasonable estimate for the orthant is obtained, is solved in that orthant to find an estimate for . To further improve the estimate, the solutions to in all nearby orthants (Hamming distance equal to one) of the orthant are obtained and subsequently is evaluated in these orthants. The neighboring orthant with the largest is set as and the process was repeated. This iterative process is continued until in all nearby orthants is no greater than that of the current orthant. The BLML-BQP is computationally expensive, with complexity where is the computational complexity of solving BQP problem of size Hence, the BLML-BQP algorithm can only be used on data samples .

## 3 Results

A comparison of BLMLTrivial and BLML-BQP algorithms on surrogate data generated from known pdfs is presented first. Then, the performance of the BLMLTrivial and BLMLQuick algorithms is compared to several KD estimators. Finally, we show the application results of BLML, KD and GLM methods to neuronal spiking data.

### 3.1 Performance of BLMLTrivial versus Blml-Bqp on Surrogate Data

In Figure 1, BLMLTrivial and BLML-BQP estimates are presented assuming that the true pdfs are BL by . Panels (A, C) and (B, D) use surrogate data generated from a non-strictly positive pdf and strictly positive pdf respectively. Both pdfs are BL from . In Panels A and B, the BLML estimates () are plotted using both algorithms, and the true pdfs are overlayed for comparison. In Panels C and D, the MISE is plotted as a function of sample size for both algorithms and both pdfs. For each , data were generated 100 times to generate 100 estimates from each algorithm. The mean of the ISE was then taken over these 100 estimates to generate the MISE plots.

As expected from theory, the BLML-BQP algorithm works best for the non-strictly positive pdf, whereas the BLMLTrivial algorithm is marginally better for the strictly positive pdf. Note that as increases beyond the BLML-BQP algorithm becomes computationally expensive, therefore the BLMLTrivial and BLMLQuick algorithms are used in the remainder of this paper with the assumption that the true pdf is strictly positive.

### 3.2 BLML and KDE on Surrogate Data

The performance of the BLMLTrivial and BLMLQuick estimates is compared with adaptive KD estimators which are the fastest known nonparametric estimators with convergence rates of , and for 2nd-order Gaussian (KDE2nd), 6th-order Gaussian (KDE6th) and sinc (KDEsinc) kernels, respectively [16, 9]. Panels A and B of Figure 2 plot the MISE of the BLML estimators using the BLMLTrivial, BLMLQuick, and the adaptive KD approaches for cases in the presence of BL or non-BL pdf, respectively. In the BL case, the true pdf is strictly positive and is the same as used above, and for the infinite-band case, the true pdf is normal. For the BLMLTrivial, BLMLQuick and sinc KD estimates, and are used for the BL and infinite-band cases, respectively. For the 2nd and 6th-order KD estimates, the optimal bandwidths ( and respectively) are used. The constant ensures that MISEs are matched for .

It can be seen from the Figure that for both the BL and infinite-band cases, BLMLTrivial and BLMLQuick outperform KD methods. In addition, the BLML estimators seem to achieve a convergence rate that is as fast as the KDEsinc, which is known to have a convergence rate of . Figure 2 C plots the MISE as function of the cut-off frequency for the BL pdf. BLMLTrivial and BLMLQuick seem to be most sensitive to the correct knowledge of , as it shows larger errors when which quickly dip as approaches . When the MISE increases linearly and the BLML methods have smaller MISE as compared to KD methods.

Finally, Figure 2D plots the computational time of the BLML and KD estimators. All algorithms were implemented in MATLAB, and in-built MATLAB 2013a algorithms were used to compute the 2nd and 6th-order adaptive Gaussian KD and sinc KD estimators. The results concur with theory and illustrate that BLMLTrivial is slower than KD approaches for large number of observations, however, the BLMLQuick algorithm is remarkably quicker than all KD approaches and BLMLTrivial for both small and large .

### 3.3 BLML Applied to Neuronal Spiking Data

Neurons generate action potentials in response to external stimuli and intrinsic factors, including the activity of its neighbors. The sequence of action potentials over time can be abstracted as a point process, where the timing of action potentials or “spikes” carry important information. The stochastic point process is characterized by a conditional intensity function (CIF), denoted as  [17].

Here, the BLML, KD, and GLM methods are applied to estimate the CIF of a “grid cell” from spike train data. In the experimental set up, the Long-Evans rat was freely foraging in an open field arena of radius of 1m for a period of 30-60 minutes. Custom microelectrode drives with variable numbers of tetrodes were implanted in the rat’s medial entorhinal cortex and dorsal hippocampal CA1 area. Spikes were acquired with a sampling rate of 31.25 kHz and filter settings of 300 Hz-6 kHz. Two infrared diodes alternating at 60 Hz were attached to the drive of each animal for position tracking. Spike sorting was accomplished using a custom manual clustering program (Xclust, M.A. Wilson). All procedures were approved by the MIT Institutional Animal Care and Use Committee.

The spiking activity of grid cell is known to be place-modulated, whose peak firing locations define a grid-like array covering much of the 2-dimensional arena. A spike histogram of the selected cell as a function of the rat’s position is shown in Figure  3A, which plots the coordinates of the rat’s position when the cell generate spikes (red dots) and the rat’s trajectory inside the arena (blue trajectory). The CIF was then estimated as a function of the rat’s position (stimuli) and the neuron’s spiking history (intrinsic factors):

where is the rat’s position over time inside an arena, and the vector consists of spiking history covariates at time as in [18, 19, 20, 21, 22].

Baye’s rule [23] allows one to use nonparametric approaches to estimate as follows [24]:

 λ(t|x,y,h)≃NTf(x,y,h|spike in time Δt)f(x,y,h) (8)

where , is the total number of spikes within time interval , which is the total duration of the spike train observation. and are densities which are estimated using both KDE2nd (higher order kernels were too slow for estimation) and BLMLQuick methods. The use of the logarithm allows for a smoother dependence of on which in turn allows for capturing high frequency components in the CIF due to refractoriness (i.e., sharp decrease in after a spike) and bursting.

The bandwidths and cut-off frequencies used are and for KDE2nd and BLMLQuick respectively. These bandwidths and cut-off frequencies were chosen after testing different combinations, and the frequencies and bandwidths that best fits the test data (i.e., had the lowest KS statistics, see below) for each method were used. Since the rat cannot leave the circular arena, the estimates and are normalized to integrate to 1 within the arena. The nonparametric estimates of are also compared with two popular GLMs:

1. Gaussian GLM (GLMgauss)

 log(λ(x,y,H))=α1+α2x+α3y+α4x2+α5y2+α6xy+25∑i=1βihi (9)
2. Zernike GLM (GLMzern)

 log(λ(x,y,H))=∑γi,jχ(i,j)+25∑i=1βihi (10)

where are the parameters estimated from the data. where are the number of spikes in and are Zernike polynomials of 3rd order.

The goodness-of-fit of the four estimates of are computed using the time rescaling theorem and the Kolmogorov-Smirnov (KS) statistic [25]. Briefly, 80% of the data is used to estimate and then the empirical CDF of rescaled spike times is computed using the remaining 20% test data, which should follow a uniform CDF if the estimate of is accurate. The similarity between the two CDFs is quantified using the normalized KS-statistic and visualized using the KS-plot. A value of KS indicates that the estimated is too extreme () to generate the test data. The closer the normalized KS-statistic is to , the better the estimate.

Figure 3 B, shows the KS-plots [25] for each estimator. It is clear from the figure, that the BLMLQuick estimate is the only model for which the KS-plot remains inside the confidence bounds with KS. The KS for KDE2nd, GLM Gaussian and GLM Zernike methods were , and , respectively.

Finally, Figure 3C, plots the CIF estimates of using the four methods for a sample period of The BLMLQuick method allows for a sharper and taller , than the GLM and KDE2nd methods and it successfully captures the known behaviour of refractoriness and bursting in the neuronal activity. Although, in this instance the BLMLQuick method outperforms the KD and GLM methods, it is not yet clear whether this will always be the case. Several other model structures for the GLM methods must be tested, and there certainly will be receptive fields of neurons where the relationship to external covariates is more Gaussian-like and the GLM Gaussian or GLM Zernike may do better than BLMLQuick. However, for neurons like the grid cell, where the receptive field’s dependence on external covariates does not follow any particular known profile, nonparametric methods like BLMLQuick or KDE methods may be more appropriate.

## 4 Discussion

In this paper, a nonparametric ML estimator for BL densities is developed and its consistency is proved. In addition, three heuristic algorithms that allow for quick computation of the BLML estimator are presented. Although these algorithms are not guaranteed to generate the BLML estimate, we show that for strictly positive pdfs, the BLMLTrivial and BLMLQuick estimates converge to the BLML estimate asymptotically. Further, BLMLQuick is remarkably quicker than all tested KD methods, while maintaining convergence rates of BLML estimators. Even further, using surrogate data, it is shown that both the BLMLTrivial and BLMLQuick estimators have an apparent convergence rate of for MISE, which is equal to that of parametric methods. Finally, BLML is applied to spiking data, where it outperforms state-of-the-art estimation techniques used in neuroscience.

The BLML estimators may be motivated by quantum mechanics. The function in the development of BLML estimate (see SI) is analogous to the wave function [26] in quantum mechanics, where the square of the absolute value of both are probability density functions. In addition, in quantum mechanics the wave function of momentum is the Fourier transform of the wave function of position. Therefore, if the momentum wave function has finite support, then the position wave function is BL and vice versa. Such occurrences are frequent in the single or double slit experiment, where one observes bandlimted ( and respectively) profile for the probability of finding a particle at a distance

from the center. Also, in the thought experiment of a particle in a box: the wave function for position has finite support, making the momentum wave function BL. We suspect that a large number pdfs in the nature are BL because macro world phenomenon are a sum of quantum level phenomenon and pdfs at quantum level are shown to be BL (single and double slit experiments). Furthermore, the set of BL pdfs is complete, i.e. the sum of two random variables that each have a BL pdf is a random variable whose pdf is a convolution of original pdfs, and hence is BL. Therefore, if macro level phenomenon is a linear combination of different quantum level phenomenon with BL pdfs, then the macro level phenomenon will also generate a BL pdf. In fact, we see this at macro level where we observe Gaussian pdfs of various processes. The Gaussian pdf is almost BL, with cutoff frequency

(% of its power lies outside this band). In fact, given finite data, it is impossible to distinguish if the data is generated by a Gaussian or BL pdf.

### 4.1 Choosing A Cut-off Frequency for the BLML Estimator

The BLML method requires selecting a cut-off frequency of the unknown pdf. One strategy for estimating the true cut-off frequency is to first fit a Gaussian pdf using the data via ML estimation. Once an estimate for standard deviation is obtained, one can estimate the cut-off frequency using the formula

as this will allow most power of the true pdf to lie within the assumed band if the true pdf has Gaussian-like tails.

Another strategy is to increase the assumed cut-off frequency of BLML estimator as a function of the sample size. For such a strategy, the BLML estimator may converge even when the true pdf has an infinite frequency band, provided that the increase in cut-off frequency is slow enough and the cut-off frequency approaches infinity asymptotically, e.g. .

A more sophisticated strategy would be to look at the mean normalized log-likelihood (MNLL), as a function of assumed cut-off frequency . Figure 4 plots MNLL (calculated using BLMLTrivial algorithm) is plotted for samples from a strictly positive true pdf along with . Note that where . We see that the MNLL rapidly increases until reaches after which the rate of increase sharply declines. There is a clear “knee” in both MNLL and curves at . Therefore, can be inferred from such a plot. A more complete mathematical analysis of this “knee” is left for future work.

### 4.2 Making BLMLQuick Even Faster

There are several faster implementation of KD approaches such as those presented in [13, 27]. These approaches use numerical techniques to evaluate the sum of kernels over given points. Such techniques may also be incorporated while calculating the BLMLQuick estimator to make it even faster. Exploration of this idea will be done in a future study.

### 4.3 Asymptotic Properties of the BLML Estimator

Although, this paper proves that the BLML estimate is consistent, it is not clear whether it is asymptotically normal and efficient (i.e., achieving a Cramer-Rao-like bound). Studying asymptotic normality and efficiency is nontrivial for BLML estimators as one would need to first redefine asymptotic normality and extend the concepts of Fisher information and the Cramer-Rao lower bound to the nonparametric case. Therefore, we leave this to a future study. However, we postulate here that the curvature of MNLL plot might be related to Fisher information in the BLML case. In addition, although under simulations, the BLML estimator seems to achieve a convergence rate similar to its parametric counterparts () it is not proved theoretically.

###### Acknowledgements.
We want to acknowledge Dr. Mesrob Ohannessian for reading our initial manuscript, and Dr. Munther Dahleh, Dr. Rene Vidal and Ben Haeffele for valuable discussions. We thank Prof. M. A. Wilson and Dr. F. Kloosterman for providing the experimental data. S. V. Sarma was supported by the US National Science Foundation (NSF) Career Award 1055560, the Burroughs Wellcome Fund CASI Award 1007274 and NSF EFRI-M3C. Z. Chen was supported by the NSF-CRCNS Award 1307645.

## References

• [1] Montricher GFD T, Thompson JR (1975) Nonparametric maximum likelihood estimation of probability densities by penalty function methods. The Annals of Statistics 3:1329–48.
• [2] Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27:832–837.
• [3] Parzen E (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics 33:1065–1076.
• [4] Peristera P, Kostaki A (2008) An evaluation of the performance of kernel estimators for graduating mortality data. Journal of Population Research 22:185–197.
• [5] Scaillet O (2004) Density estimation using inverse and reciprocal inverse gaussian kernels. Nonparametric Statistics 16:217–226.
• [6] ParK B U, Marron J S (1990) Comparison of data-driven bandwidth selector. Journal of the American Statistical Society 85:66–72.
• [7] Park B U, Turlach B A (1992) Practical performance of several data driven bandwidth selectors (with discussion). Computational Statistics 7:251–270.
• [8] Hall P, Sheather S J, Jones M C, Marron J S (1991) On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78:263–269.
• [9] Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91:401–407.
• [10] Kanazawa Y (1993) Hellinger distance and kullback-leibler loss for the kernel density estimator. Statistics & Probability Letters 18:315–321.
• [11] Carandoa D, Fraimanb R, Groismana P (2009) Nonparametric likelihood based estimation for a multivariate lipschitz density.

Journal of Multivariate Analysis

100:981–992.
• [12] Coleman TP, Sarma SV (2010) A computationally efficient method for nonparametric modeling of neural spiking activity with point processes. Neural Computation 22:2002–2030.
• [13] Raykar VC, Duraiswami R, Zhao LH (2010) Fast computation of kernel estimators. Journal of Computational and Graphical Statistics 19:205–20.
• [14] Merz P, Freisleben B (2002) Greedy and local search heuristics for unconstrained binary quadratic programming. Journal of Heuristics 8:197–213.
• [15] Taylor J (2011) First look - gurobi optimization. Technical report, Decision Management Soluion.
• [16] Hall P, Marron JS (1987) Choice of kernel order in density estimation. Annals of Statistics 16:161–73.
• [17] Cox DR, Isham V (2000) Point Processes (Chapman and Hall/CRC).
• [18] Kass RE, Ventura V (2001) A spike train probability model. Neural Comput 13:1713–20.
• [19] Sarma SV, et al. (2012) The effects of cues on neurons in the basal ganglia in parkinson’s disease. Front Integr Neurosci 6.
• [20] Sarma S V, et al. (2010) Using point process models to compare neural spiking activity in the sub-thalamic nucleus of parkinson’s patients and a healthy primate. IEEE TBME 57:1297–305.
• [21] Santaniello S, Montgomery Jr EB, Gale JT, Sarma SV (2012) Non-stationary discharge patterns in motor cortex under subthalamic nucleus deep brain stimulation. Front Itegr Neurosci 6.
• [22] Kahn K, Sheiber M, Thakor N, Sarma SV (2011) Neuron selection for decoding dexterous finger movements. In Proceedings of the 33rd IEEE EMBS Conference.
• [23] Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis (CRC Press).
• [24] Kloosterman F, Layton S, Chen Z, Wilson MA (2013) Bayesian decoding of unsorted spikes in the rat hippocampus. J Neurophysiol 111:217–27.
• [25] Brown EN, Barbieri R, Ventura V, Kass RE, Frank LM (2002) The time-rescaling theorem and its application to neural spike train data analysis. Neural Comput 14:325–46.
• [26] Rae AIM (2008) Quantum Mechanics, 5th edition (Taylor & Francis Group).
• [27] Silverman B (1982) Algorithm as 176: Kernel density estimation using the fast fourier transform. Applied Statistics 31:93–97.

## 5 Supporting Information

### 5.1 Preliminaries and Formulation of the BLML Estimator

Consider a pdf, , of a random variable with Fourier transform . Let be the set of band-limited pdfs with frequency support in i.e., is the cut-off frequency of the Fourier transform of the pdf. Then,

 U(ωc)={f:R→R+ |∫f(x)dx=1,F(ω)=0 ∀|ω|>ωc} (SI1)

Since each element is a pdf, it can be written as , where defined as:

 V(ωc)={g:R→R | g(x)=√f(x), f∈U(ωc)} (SI2)

Finally, can be defined as the set of all Fourier transforms of elements in :

 W(ωc)={G:R→C | G(ω)=∫g(x)e−iωxdx, g∈V(ωc)} (SI3)

Note that since is band limited, will also be band limited in . Therefore, . Finally, and are Hilbert spaces with the inner product defined as , , respectively. The norm is defined for both spaces. Further, note that for all elements in and , .

##### The Likelihood Function for Band-limited Pdfs

Now consider a random variable, , with unknown pdf and its independent realizations . The likelihood of observing is then:

 L(x1,⋯,xn) =n∏i=1f(xi)=n∏i=1g2(xi), g∈V(ωc) (SI4a) =n∏i=1(12π∫G(ω)ejωxidω)2, G∈W(ωc) (SI4b)

Defining:

 bi(ω)≜{e−jωxi∀ ω∈(−ωc2,ωc2)0o.w.} (SI5)

gives:

 L(x1,⋯,xn)=n∏i=1()2≜L[G]. (SI6)

Further, consider which maximizes the likelihood function:

 ^G=argmaxG∈W(ωc)(L[G]). (SI7)

Then the BLML estimator is:

 ^f(x)=(12π∫^G(ω)ejωxdω)2. (SI8)

## 6 Proof of Theorem 1

Proof: Because of (SI5), (SI7) is equivalent to

 ^G(ω)=argmaxG:R→C,||G||22=1(L[G]). (SI9)

Note that Parseval’s equality [1] is applied to get the constraint . Now, the Lagrange multiplier [2] is used to convert (SI9) into the following unconstrained problem:

 ^G(ω)=argmaxG:R→C(L[G]+λ(1−||G||22)). (SI10)

can be computed by differentiating the above equation with respect to using calculus of variations [3] and equating it to zero. This gives:

 ^G(ω) =1nn∑i=1cibi(ω) (SI11a) ci =nλ∏j≠i(<^G(ω),bj(ω)>)2<^G(ω),bi(ω)> for  i=1⋯n (SI11b)

To solve for the value of is substituted back from (SI11)a into (SI11)b and both sides are multiplied by to get:

 ci n∑j=1cj=n2k % for i=1⋯n, (SI12a) k ≜1n2nλ⎛⎝n∏j=1(n∑i=1ci)2⎞⎠ (SI12b) =1n2nλ⎛⎝n∏j=1(n∑i=1cisij)2⎞⎠ (SI12c)

To go from (SI12)b to (SI12)c, observe that (here ). Now by defining,

 S≜⎡⎢ ⎢⎣s11⋯s1n⋮⋱⋮sn1⋯snn⎤⎥ ⎥⎦, (SI13)

and using (SI11)a and the constraint , one can show that . Also, summing up all constraints in (SI12)a gives , hence . Now, substituting the value of into (SI12)a and rearranging terms gives the following constraints:

 1nn∑j=1cjsij−1ci=ρni(c)=0  for  i=1⋯n. (SI14)

As mentioned in the main text, the above system of equations () is monotonic, i.e., , but with discontinuities at each . Therefore, there are solutions, with each solution located in each orthant, identified by the orthant vector . Each of these solutions can be found efficiently by choosing a starting point in a given orthant and applying numerical methods from convex optimization theory to solve for (SI14). Thus, each of these solutions corresponds to a local maximum of the likelihood functional . The global maximum of can then be found by evaluating the likelihood for each solution of (SI14). The likelihood value at each local maximum can be computed efficiently by using the following expression:

 L(c)=∏i(1n∑jcjsij)2=∏i1c2i. (SI15)

This expression is derived by substituting (SI11)a into (SI6) and then substituting (SI14) into the result. Now the global maximum can be found by solving (2). Once the global maximum is computed, we can put together (SI5),(SI8) and (SI11)a to write our solution as (1). Hence proved.

### 6.1 Consistency of the BLML Estimator

#### 6.1.1 Bounds on Bandlimited PDF

In this section the following theorem is first stated and proved.

For all .

Proof: Above theorem can be proven by finding:

 y=maxf∈U(ωc)maxx∈Rf(x). (SI16)

Because a shift in the pdf domain (e.g. ) does not change the magnitude or bandwidth of , without loss of generality one can assume that and write the above equation as

 y =maxf∈U(ωc)(f(0)) (SI17a) =maxg∈V(ωc)((g2(0)) (SI17b) =maxG∈W(ωc)((∫ωc−ωcG(ω)dω)2) (SI17c) =max||G||2=1((∫∞−∞G(ω)b(ω)dω)2) (SI17d) =max((∫G(ω)b(ω)dω)2+λ(||G||2−1)) (SI17e)

Here and is otherwise. Now by differentiating (SI17)e and subsequently setting the result equal to , gives . Therefore , which gives .

Corollary: By the definition of , one can apply Theorem 6.1.1 and show that for all .

#### 6.1.2 Sequence ¯cnj:

Now a sequence is defined and some of its properties are stated and proved. These properties will be used to prove Theorems 6.1.7 and 6.1.7 below.

 ¯cnj ≜ng(xj)2fc(√1+4nfcg2(xj)−1) ∀1≤j≤n (SI18a)

#### 6.1.3 Properties of ¯cnj

has following properties:

 (P1) 1¯cnj−¯cnjfcn=g(xj) (SI19a) (P2) ¯cnj=1g(xj)(1+O(1ng2(xj)))  for  ng2(xj)>fc (SI19b) (P3) ¯c2nj=nfc(1−¯cnjg(xj)) (SI19c) (P4) √3/2−√5/2fc≤|¯cnj|≤√nfc (SI19d) (P5) 0≤1−¯cnjg(xj)≤1 (SI19e) (P6) 1−1nn∑j=1¯cnjg(xj)0∀x (SI19f) (P7) 1n∑j≠i(sij¯cnj)=1n∑jsij¯cnj−O(1√n) =g(xi)+ϵnia.s.→g(xi) % simultaneously ∀i if g(x)>0∀x (SI19g) (P8) ¯c∞j≜limn→∞¯cnj≥¯cnj ∀n (SI19h)

#### 6.1.4 Proofs for Properties of ¯cnj

(P1) can be proved by direct substitution of into left hand side (LHS). (P2) can be derived through binomial expansion of . (P3) can again be proved by substituting and showing LHS=RHS. (P4) and (P5) can be proved by using the fact that both and are monotonic in since and . Therefore, the minimum and maximum values of and , can be found in by plugging in the minimum and maximum values of (note , from Thm 6.1.1 ).

(P6) is proved by using Kolmogorov’s sufficient criterion [4] for almost sure convergence of sample mean. Clearly, from (P5) which establish almost sure convergence. Now, let . Then multiplying each side of equations in (P1) by , respectively, adding them and the normalizing the sum by gives:

 1n∑1¯cnjg(xj) =1+1n∑¯cnjfcng(xj) (SI20a) ⇒1β ≤1+bn (SI20b) ⇒β≥ 11+bn (SI20c)

Above . To go from (SI20)a to (SI20)b, the result (Arithmetic Mean Harmonic Mean) is used. Now it can be shown that , as following:

 bn= ∑ifc¯cnin2g(xi) (SI21a) ≤ √fcn√n∑i1g(xi) (SI21b) a.s→ √fcnE(1g(xi)) (SI21c) = Oa.