1 Significance Statement
Nonparametric estimation of probability densities has became increasingly important to make predictions about processes where parametrization is difficult. However, unlike parametric approaches, nonparametric approaches are often not optimal in the sense of maximizing likelihood, which would ensure several optimality properties of the estimator. In this study, a novel nonparametric density estimation technique that maximizes likelihood and that converges to the true pdf is presented. Simulations show that this technique outperforms and outruns sophisticated kernel density estimation techniques, and yields the fastest convergence rate to date.
When making inferences from experimental data, it is often required to model random phenomena and estimate a pdf. A common approach is to assume that the pdf belongs to a class of parametric functions (e.g., Gaussian, Poisson), and then estimate the parameters by maximizing the data likelihood function. Parametric models have several advantages. First, they are often efficiently computable. Second, the parameters may be related back to physiological and environmental variables. Finally, ML estimates have nice asymptotic properties when the actual distribution lies in the assumed parametric class. However, if the true pdf does not lie in the assumed class of functions, large errors may occur, potentially resulting in misleading inferences.
When little is known a priori about the pdf, nonparametric estimation is an option. However, maximizing the likelihood function yields spurious solutions as the dimensionality of the problem typically grows with the number of data samples, [1]. To deal with this, several nonparametric approaches penalize the likelihood function by adding a smoothness constraint. Such penalty functions have nice properties of having unique maxima that can be computed. However, when smoothness conditions are applied, the asymptotic properties of ML estimates are typically lost [1].
Other methods for nonparametric estimation assume that the pdf is a linear combination of scaled and stretched versions of a single kernel function [2, 3, 4, 5]. These methods fall under kernel density (KD) estimation, which have been studied for decades. However, choosing an appropriate kernel is still a tricky and often an arbitrary process [6]. Additionally, even the best KD estimators [6, 7, 8, 9] have slower convergence rates ( , for the second and sixthorder Gaussian kernels, respectively) than parametric ML estimation () with respect to the mean integrated squared error (MISE)[10].
Finally, some approaches require searching over nonparametric sets for which a maximum likelihood estimate exists. Some cases are discussed in [11, 12], wherein the authors construct maximum likelihood estimators for unknown but Lipschitz continuous pdfs. Although Lipschitz functions display desirable continuity properties, they can be nondifferentiable. Therefore, such estimates can be nonsmooth, but perhaps more importantly, they are not efficiently computable as a closedform solution cannot be derived [11, 12].
This paper presents a case where a nonparametric ML estimator exists, is efficiently computable, consistent and results in a smooth pdf. The pdf is assumed to be bandlimited (BL), which has a finitesupport in the Fourier domain. The BL assumption essentially can be thought of as a smoothness constraint. However, the proposed method does not require penalizing the likelihood function to guarantee the existence of a global maximum, and therefore may preserve the asymptotic properties of ML estimators (i.e. consistency, asymptotic normality and efficiency). The BLML method is first applied to surrogate data generated from both BL and infiniteband pdfs, where in both cases it outperformes all tested KD estimators (including higher order kernels) both in convergence rate and computational time. Then the BLML estimator is applied to the neuronal data recorded from a rat’s entorhinal cortex “grid cell” and is shown to outperform stateoftheart estimators used in neuroscience.
2 The BLML Estimator
We begin with a description of the BLML estimator in the following theorem.
Consider independent samples of an unknown BL pdf, , with assumed cutoff frequency Then the BLML estimator of is given as:
(1) 
where and
(2) 
Here and
.
Proof: See supporting information (SI).
The system of equations, in (2) is monotonic, i.e., , with discontinuities at each . Therefore, there are
solutions, with each solution located in each orthant, identified by the orthant vector
. Each solution corresponds to a local maximum of the likelihood function which is also its maximum value in that orthant. Hence, the global maximum always exists and can be found by finding the maximum of these maxima. However, it is computationally exhaustive to solve (2), which entails finding the solutions of and then comparing values of for each solution.Therefore, to compute the BLML estimator, three approximate algorithms are proposed: a binary quadratic programming (BLMLBQP) algorithm, a BLMLTrivial algorithm, and its quicker implementation  BLMLQuick. Both theory and simulations show that the BLMLBQP algorithm is appropriate when the sample size is and no additional knowledge is known other than the pdf is BL. However, in cases when and the underlying pdf is strictly positive, the BLMLTrivial and BLMLQuick algorithms are more appropriate as they are guaranteed to yield a consistent estimate (see Theorems 4, 5 and 6) and converge at a rate (), which is faster than all tested KD estimates. Further, the BLMLQuick algorithm shows a remarkable improvement in computational speed over tested KD methods.
2.1 Consistency of the BLML Estimator
Proving consistency of the BLML estimator is not trivial as it requires a solution to (2). However, if then consistency of BLML estimator can be established. To show this, first an asymptotic solution to is constructed (Theorem 3). Then, consistency is established by plugging into (1) to show that the ISE and hence the MISE between the resulting density, and is 0 (Theorem 4). Then, it is shown that the KLdivergence between and is also and hence is a solution to (2), which makes the BLML estimator (Theorem 5). Theorems 25 and their proofs are presented in SI.
2.2 Generalization of the BLML Estimators to Joint Pdfs
Consider the joint pdf ,
such that its Fourier transform
has the elementwise cut off frequencies in vector . Then the BLML estimator is of the following form:(3) 
where, is the assumed cutoff frequency, vector are the data samples, and the vector , is given by
(4) 
Here .
The multidimensional result can be derived in a very similar way as the onedimensional result as described in SI.
2.3 Computing the BLML Estimator
The three algorithms, BLMLTrivial, BLMLQuick and BLMLBQP are described next.
BLMLTrivial Algorithm.
It is a onestep algorithm that first selects an orthant in which the global maximum may lie, and then solves in that orthant. As is monotonic, it is computationally efficient to solve in any given orthant.
As stated in Theorem 6 (see SI), the asymptotic solution of (2) lies in the orthant with indicator vector if is BL and . Therefore, the BLMLTrivial algorithm selects the orthant vector , and then is solved in that orthant to compute . It is important to note that when is indeed BL and strictly positive, then the BLMLTrivial estimator converges to BLML estimator asymptotically.
Due to its simplicity, the computational complexity of the BLMLTrivial method is very similar to KDE with complexity , where is the number of points where the value of pdf is estimated [13]); but the BLMLTrivial method has an extra step of solving equation This equation can be solved in , using gradient descent or Newton algorithms. Therefore, the computational complexity of BLMLTrivial estimator is .
BLMLQuick Algorithm.
The BL assumption of the true pdf allows for a quick implementation of the BLMLTrivial estimator  “BLMLQuick”. For details, see SI. Briefly, BLMLQuick first groups the observed samples into bins of size . Then, it constructs the BLMLTrivial estimator of the discrete pdf (or the probability mass function, pmf) that generated the binned data. The true pmf for the binned data has infinitebandwidth.
Hence, under the required conditions, the BLMLTrivial estimate constructed using the Nyquist frequency, converges to the continuous pdf , from which the pmf is obtained via sampling. can be made arbitrarily close to the true pdf by choosing smaller and smaller bins. In fact, if the bin size reduces as then the ISE between and is of . Therefore, the MISE for BLMLQuick is plus the MISE of the BLMLTrivial estimator. Since the MISE of the BLMLTrivial estimator has to be greater than , the BLMLQuick algorithm is as efficient as the BLMLTrivial algorithm. Specifically, the computational complexity of BLMLQuick is
, where governs the behavior of the tail of the true pdf.
BLMLBQP Algorithm.
To derive the BLMLBQP algorithm, it is first noted that the solutions of are equivalent to the local solutions of:
(5) 
here is a matrix with th element being . Now, if is an orthant indicator vector and is such that , then (5) implies:
(6) 
Finally, the orthant where the solution of (2) lies is found by maximizing the upper bound using the following BQP:
(7) 
BQP problems are known to be NPhard [14]
, and hence a heuristic algorithm implemented in the gurobi toolbox
[15] in MATLAB is used to find an approximate solution in polynomial time. Once a reasonable estimate for the orthant is obtained, is solved in that orthant to find an estimate for . To further improve the estimate, the solutions to in all nearby orthants (Hamming distance equal to one) of the orthant are obtained and subsequently is evaluated in these orthants. The neighboring orthant with the largest is set as and the process was repeated. This iterative process is continued until in all nearby orthants is no greater than that of the current orthant. The BLMLBQP is computationally expensive, with complexity where is the computational complexity of solving BQP problem of size Hence, the BLMLBQP algorithm can only be used on data samples .3 Results
A comparison of BLMLTrivial and BLMLBQP algorithms on surrogate data generated from known pdfs is presented first. Then, the performance of the BLMLTrivial and BLMLQuick algorithms is compared to several KD estimators. Finally, we show the application results of BLML, KD and GLM methods to neuronal spiking data.
3.1 Performance of BLMLTrivial versus BlmlBqp on Surrogate Data
In Figure 1, BLMLTrivial and BLMLBQP estimates are presented assuming that the true pdfs are BL by . Panels (A, C) and (B, D) use surrogate data generated from a nonstrictly positive pdf and strictly positive pdf respectively. Both pdfs are BL from . In Panels A and B, the BLML estimates () are plotted using both algorithms, and the true pdfs are overlayed for comparison. In Panels C and D, the MISE is plotted as a function of sample size for both algorithms and both pdfs. For each , data were generated 100 times to generate 100 estimates from each algorithm. The mean of the ISE was then taken over these 100 estimates to generate the MISE plots.
As expected from theory, the BLMLBQP algorithm works best for the nonstrictly positive pdf, whereas the BLMLTrivial algorithm is marginally better for the strictly positive pdf. Note that as increases beyond the BLMLBQP algorithm becomes computationally expensive, therefore the BLMLTrivial and BLMLQuick algorithms are used in the remainder of this paper with the assumption that the true pdf is strictly positive.
3.2 BLML and KDE on Surrogate Data
The performance of the BLMLTrivial and BLMLQuick estimates is compared with adaptive KD estimators which are the fastest known nonparametric estimators with convergence rates of , and for 2ndorder Gaussian (KDE2nd), 6thorder Gaussian (KDE6th) and sinc (KDEsinc) kernels, respectively [16, 9]. Panels A and B of Figure 2 plot the MISE of the BLML estimators using the BLMLTrivial, BLMLQuick, and the adaptive KD approaches for cases in the presence of BL or nonBL pdf, respectively. In the BL case, the true pdf is strictly positive and is the same as used above, and for the infiniteband case, the true pdf is normal. For the BLMLTrivial, BLMLQuick and sinc KD estimates, and are used for the BL and infiniteband cases, respectively. For the 2nd and 6thorder KD estimates, the optimal bandwidths ( and respectively) are used. The constant ensures that MISEs are matched for .
It can be seen from the Figure that for both the BL and infiniteband cases, BLMLTrivial and BLMLQuick outperform KD methods. In addition, the BLML estimators seem to achieve a convergence rate that is as fast as the KDEsinc, which is known to have a convergence rate of . Figure 2 C plots the MISE as function of the cutoff frequency for the BL pdf. BLMLTrivial and BLMLQuick seem to be most sensitive to the correct knowledge of , as it shows larger errors when which quickly dip as approaches . When the MISE increases linearly and the BLML methods have smaller MISE as compared to KD methods.
Finally, Figure 2D plots the computational time of the BLML and KD estimators. All algorithms were implemented in MATLAB, and inbuilt MATLAB 2013a algorithms were used to compute the 2nd and 6thorder adaptive Gaussian KD and sinc KD estimators. The results concur with theory and illustrate that BLMLTrivial is slower than KD approaches for large number of observations, however, the BLMLQuick algorithm is remarkably quicker than all KD approaches and BLMLTrivial for both small and large .
3.3 BLML Applied to Neuronal Spiking Data
Neurons generate action potentials in response to external stimuli and intrinsic factors, including the activity of its neighbors. The sequence of action potentials over time can be abstracted as a point process, where the timing of action potentials or “spikes” carry important information. The stochastic point process is characterized by a conditional intensity function (CIF), denoted as [17].
Here, the BLML, KD, and GLM methods are applied to estimate the CIF of a “grid cell” from spike train data. In the experimental set up, the LongEvans rat was freely foraging in an open field arena of radius of 1m for a period of 3060 minutes. Custom microelectrode drives with variable numbers of tetrodes were implanted in the rat’s medial entorhinal cortex and dorsal hippocampal CA1 area. Spikes were acquired with a sampling rate of 31.25 kHz and filter settings of 300 Hz6 kHz. Two infrared diodes alternating at 60 Hz were attached to the drive of each animal for position tracking. Spike sorting was accomplished using a custom manual clustering program (Xclust, M.A. Wilson). All procedures were approved by the MIT Institutional Animal Care and Use Committee.
The spiking activity of grid cell is known to be placemodulated, whose peak firing locations define a gridlike array covering much of the 2dimensional arena. A spike histogram of the selected cell as a function of the rat’s position is shown in Figure 3A, which plots the coordinates of the rat’s position when the cell generate spikes (red dots) and the rat’s trajectory inside the arena (blue trajectory). The CIF was then estimated as a function of the rat’s position (stimuli) and the neuron’s spiking history (intrinsic factors):
where is the rat’s position over time inside an arena, and the vector consists of spiking history covariates at time as in [18, 19, 20, 21, 22].
Baye’s rule [23] allows one to use nonparametric approaches to estimate as follows [24]:
(8) 
where , is the total number of spikes within time interval , which is the total duration of the spike train observation. and are densities which are estimated using both KDE2nd (higher order kernels were too slow for estimation) and BLMLQuick methods. The use of the logarithm allows for a smoother dependence of on which in turn allows for capturing high frequency components in the CIF due to refractoriness (i.e., sharp decrease in after a spike) and bursting.
The bandwidths and cutoff frequencies used are and for KDE2nd and BLMLQuick respectively. These bandwidths and cutoff frequencies were chosen after testing different combinations, and the frequencies and bandwidths that best fits the test data (i.e., had the lowest KS statistics, see below) for each method were used. Since the rat cannot leave the circular arena, the estimates and are normalized to integrate to 1 within the arena. The nonparametric estimates of are also compared with two popular GLMs:

Gaussian GLM (GLMgauss)
(9) 
Zernike GLM (GLMzern)
(10)
where are the parameters estimated from the data. where are the number of spikes in and are Zernike polynomials of 3rd order.
The goodnessoffit of the four estimates of are computed using the time rescaling theorem and the KolmogorovSmirnov (KS) statistic [25]. Briefly, 80% of the data is used to estimate and then the empirical CDF of rescaled spike times is computed using the remaining 20% test data, which should follow a uniform CDF if the estimate of is accurate. The similarity between the two CDFs is quantified using the normalized KSstatistic and visualized using the KSplot. A value of KS indicates that the estimated is too extreme () to generate the test data. The closer the normalized KSstatistic is to , the better the estimate.
Figure 3 B, shows the KSplots [25] for each estimator. It is clear from the figure, that the BLMLQuick estimate is the only model for which the KSplot remains inside the confidence bounds with KS. The KS for KDE2nd, GLM Gaussian and GLM Zernike methods were , and , respectively.
Finally, Figure 3C, plots the CIF estimates of using the four methods for a sample period of The BLMLQuick method allows for a sharper and taller , than the GLM and KDE2nd methods and it successfully captures the known behaviour of refractoriness and bursting in the neuronal activity. Although, in this instance the BLMLQuick method outperforms the KD and GLM methods, it is not yet clear whether this will always be the case. Several other model structures for the GLM methods must be tested, and there certainly will be receptive fields of neurons where the relationship to external covariates is more Gaussianlike and the GLM Gaussian or GLM Zernike may do better than BLMLQuick. However, for neurons like the grid cell, where the receptive field’s dependence on external covariates does not follow any particular known profile, nonparametric methods like BLMLQuick or KDE methods may be more appropriate.
4 Discussion
In this paper, a nonparametric ML estimator for BL densities is developed and its consistency is proved. In addition, three heuristic algorithms that allow for quick computation of the BLML estimator are presented. Although these algorithms are not guaranteed to generate the BLML estimate, we show that for strictly positive pdfs, the BLMLTrivial and BLMLQuick estimates converge to the BLML estimate asymptotically. Further, BLMLQuick is remarkably quicker than all tested KD methods, while maintaining convergence rates of BLML estimators. Even further, using surrogate data, it is shown that both the BLMLTrivial and BLMLQuick estimators have an apparent convergence rate of for MISE, which is equal to that of parametric methods. Finally, BLML is applied to spiking data, where it outperforms stateoftheart estimation techniques used in neuroscience.
The BLML estimators may be motivated by quantum mechanics. The function in the development of BLML estimate (see SI) is analogous to the wave function [26] in quantum mechanics, where the square of the absolute value of both are probability density functions. In addition, in quantum mechanics the wave function of momentum is the Fourier transform of the wave function of position. Therefore, if the momentum wave function has finite support, then the position wave function is BL and vice versa. Such occurrences are frequent in the single or double slit experiment, where one observes bandlimted ( and respectively) profile for the probability of finding a particle at a distance
from the center. Also, in the thought experiment of a particle in a box: the wave function for position has finite support, making the momentum wave function BL. We suspect that a large number pdfs in the nature are BL because macro world phenomenon are a sum of quantum level phenomenon and pdfs at quantum level are shown to be BL (single and double slit experiments). Furthermore, the set of BL pdfs is complete, i.e. the sum of two random variables that each have a BL pdf is a random variable whose pdf is a convolution of original pdfs, and hence is BL. Therefore, if macro level phenomenon is a linear combination of different quantum level phenomenon with BL pdfs, then the macro level phenomenon will also generate a BL pdf. In fact, we see this at macro level where we observe Gaussian pdfs of various processes. The Gaussian pdf is almost BL, with cutoff frequency
(% of its power lies outside this band). In fact, given finite data, it is impossible to distinguish if the data is generated by a Gaussian or BL pdf.4.1 Choosing A Cutoff Frequency for the BLML Estimator
The BLML method requires selecting a cutoff frequency of the unknown pdf. One strategy for estimating the true cutoff frequency is to first fit a Gaussian pdf using the data via ML estimation. Once an estimate for standard deviation is obtained, one can estimate the cutoff frequency using the formula
as this will allow most power of the true pdf to lie within the assumed band if the true pdf has Gaussianlike tails.Another strategy is to increase the assumed cutoff frequency of BLML estimator as a function of the sample size. For such a strategy, the BLML estimator may converge even when the true pdf has an infinite frequency band, provided that the increase in cutoff frequency is slow enough and the cutoff frequency approaches infinity asymptotically, e.g. .
A more sophisticated strategy would be to look at the mean normalized loglikelihood (MNLL), as a function of assumed cutoff frequency . Figure 4 plots MNLL (calculated using BLMLTrivial algorithm) is plotted for samples from a strictly positive true pdf along with . Note that where . We see that the MNLL rapidly increases until reaches after which the rate of increase sharply declines. There is a clear “knee” in both MNLL and curves at . Therefore, can be inferred from such a plot. A more complete mathematical analysis of this “knee” is left for future work.
4.2 Making BLMLQuick Even Faster
There are several faster implementation of KD approaches such as those presented in [13, 27]. These approaches use numerical techniques to evaluate the sum of kernels over given points. Such techniques may also be incorporated while calculating the BLMLQuick estimator to make it even faster. Exploration of this idea will be done in a future study.
4.3 Asymptotic Properties of the BLML Estimator
Although, this paper proves that the BLML estimate is consistent, it is not clear whether it is asymptotically normal and efficient (i.e., achieving a CramerRaolike bound). Studying asymptotic normality and efficiency is nontrivial for BLML estimators as one would need to first redefine asymptotic normality and extend the concepts of Fisher information and the CramerRao lower bound to the nonparametric case. Therefore, we leave this to a future study. However, we postulate here that the curvature of MNLL plot might be related to Fisher information in the BLML case. In addition, although under simulations, the BLML estimator seems to achieve a convergence rate similar to its parametric counterparts () it is not proved theoretically.
Acknowledgements.
We want to acknowledge Dr. Mesrob Ohannessian for reading our initial manuscript, and Dr. Munther Dahleh, Dr. Rene Vidal and Ben Haeffele for valuable discussions. We thank Prof. M. A. Wilson and Dr. F. Kloosterman for providing the experimental data. S. V. Sarma was supported by the US National Science Foundation (NSF) Career Award 1055560, the Burroughs Wellcome Fund CASI Award 1007274 and NSF EFRIM3C. Z. Chen was supported by the NSFCRCNS Award 1307645.References
 [1] Montricher GFD T, Thompson JR (1975) Nonparametric maximum likelihood estimation of probability densities by penalty function methods. The Annals of Statistics 3:1329–48.
 [2] Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27:832–837.
 [3] Parzen E (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics 33:1065–1076.
 [4] Peristera P, Kostaki A (2008) An evaluation of the performance of kernel estimators for graduating mortality data. Journal of Population Research 22:185–197.
 [5] Scaillet O (2004) Density estimation using inverse and reciprocal inverse gaussian kernels. Nonparametric Statistics 16:217–226.
 [6] ParK B U, Marron J S (1990) Comparison of datadriven bandwidth selector. Journal of the American Statistical Society 85:66–72.
 [7] Park B U, Turlach B A (1992) Practical performance of several data driven bandwidth selectors (with discussion). Computational Statistics 7:251–270.
 [8] Hall P, Sheather S J, Jones M C, Marron J S (1991) On optimal databased bandwidth selection in kernel density estimation. Biometrika 78:263–269.
 [9] Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91:401–407.
 [10] Kanazawa Y (1993) Hellinger distance and kullbackleibler loss for the kernel density estimator. Statistics & Probability Letters 18:315–321.

[11]
Carandoa D, Fraimanb R, Groismana P (2009) Nonparametric likelihood based
estimation for a multivariate lipschitz density.
Journal of Multivariate Analysis
100:981–992.  [12] Coleman TP, Sarma SV (2010) A computationally efficient method for nonparametric modeling of neural spiking activity with point processes. Neural Computation 22:2002–2030.
 [13] Raykar VC, Duraiswami R, Zhao LH (2010) Fast computation of kernel estimators. Journal of Computational and Graphical Statistics 19:205–20.
 [14] Merz P, Freisleben B (2002) Greedy and local search heuristics for unconstrained binary quadratic programming. Journal of Heuristics 8:197–213.
 [15] Taylor J (2011) First look  gurobi optimization. Technical report, Decision Management Soluion.
 [16] Hall P, Marron JS (1987) Choice of kernel order in density estimation. Annals of Statistics 16:161–73.
 [17] Cox DR, Isham V (2000) Point Processes (Chapman and Hall/CRC).
 [18] Kass RE, Ventura V (2001) A spike train probability model. Neural Comput 13:1713–20.
 [19] Sarma SV, et al. (2012) The effects of cues on neurons in the basal ganglia in parkinson’s disease. Front Integr Neurosci 6.
 [20] Sarma S V, et al. (2010) Using point process models to compare neural spiking activity in the subthalamic nucleus of parkinson’s patients and a healthy primate. IEEE TBME 57:1297–305.
 [21] Santaniello S, Montgomery Jr EB, Gale JT, Sarma SV (2012) Nonstationary discharge patterns in motor cortex under subthalamic nucleus deep brain stimulation. Front Itegr Neurosci 6.
 [22] Kahn K, Sheiber M, Thakor N, Sarma SV (2011) Neuron selection for decoding dexterous finger movements. In Proceedings of the 33rd IEEE EMBS Conference.
 [23] Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis (CRC Press).
 [24] Kloosterman F, Layton S, Chen Z, Wilson MA (2013) Bayesian decoding of unsorted spikes in the rat hippocampus. J Neurophysiol 111:217–27.
 [25] Brown EN, Barbieri R, Ventura V, Kass RE, Frank LM (2002) The timerescaling theorem and its application to neural spike train data analysis. Neural Comput 14:325–46.
 [26] Rae AIM (2008) Quantum Mechanics, 5th edition (Taylor & Francis Group).
 [27] Silverman B (1982) Algorithm as 176: Kernel density estimation using the fast fourier transform. Applied Statistics 31:93–97.
5 Supporting Information
5.1 Preliminaries and Formulation of the BLML Estimator
Consider a pdf, , of a random variable with Fourier transform . Let be the set of bandlimited pdfs with frequency support in i.e., is the cutoff frequency of the Fourier transform of the pdf. Then,
(SI1) 
Since each element is a pdf, it can be written as , where defined as:
(SI2) 
Finally, can be defined as the set of all Fourier transforms of elements in :
(SI3) 
Note that since is band limited, will also be band limited in . Therefore, . Finally, and are Hilbert spaces with the inner product defined as , , respectively. The norm is defined for both spaces. Further, note that for all elements in and , .
The Likelihood Function for Bandlimited Pdfs
Now consider a random variable, , with unknown pdf and its independent realizations . The likelihood of observing is then:
(SI4a)  
(SI4b) 
Defining:
(SI5) 
gives:
(SI6) 
Further, consider which maximizes the likelihood function:
(SI7) 
Then the BLML estimator is:
(SI8) 
6 Proof of Theorem 1
(SI9) 
Note that Parseval’s equality [1] is applied to get the constraint . Now, the Lagrange multiplier [2] is used to convert (SI9) into the following unconstrained problem:
(SI10) 
can be computed by differentiating the above equation with respect to using calculus of variations [3] and equating it to zero. This gives:
(SI11a)  
(SI11b) 
To solve for the value of is substituted back from (SI11)a into (SI11)b and both sides are multiplied by to get:
(SI12a)  
(SI12b)  
(SI12c) 
(SI13) 
and using (SI11)a and the constraint , one can show that . Also, summing up all constraints in (SI12)a gives , hence . Now, substituting the value of into (SI12)a and rearranging terms gives the following constraints:
(SI14) 
As mentioned in the main text, the above system of equations () is monotonic, i.e., , but with discontinuities at each . Therefore, there are solutions, with each solution located in each orthant, identified by the orthant vector . Each of these solutions can be found efficiently by choosing a starting point in a given orthant and applying numerical methods from convex optimization theory to solve for (SI14). Thus, each of these solutions corresponds to a local maximum of the likelihood functional . The global maximum of can then be found by evaluating the likelihood for each solution of (SI14). The likelihood value at each local maximum can be computed efficiently by using the following expression:
(SI15) 
This expression is derived by substituting (SI11)a into (SI6) and then substituting (SI14) into the result. Now the global maximum can be found by solving (2). Once the global maximum is computed, we can put together (SI5),(SI8) and (SI11)a to write our solution as (1). Hence proved.
6.1 Consistency of the BLML Estimator
6.1.1 Bounds on Bandlimited PDF
In this section the following theorem is first stated and proved.
For all .
Proof: Above theorem can be proven by finding:
(SI16) 
Because a shift in the pdf domain (e.g. ) does not change the magnitude or bandwidth of , without loss of generality one can assume that and write the above equation as
(SI17a)  
(SI17b)  
(SI17c)  
(SI17d)  
(SI17e) 
Here and is otherwise. Now by differentiating (SI17)e and subsequently setting the result equal to , gives . Therefore , which gives .
Corollary: By the definition of , one can apply Theorem 6.1.1 and show that for all .
6.1.2 Sequence :
Now a sequence is defined and some of its properties are stated and proved. These properties will be used to prove Theorems 6.1.7 and 6.1.7 below.
(SI18a) 
6.1.3 Properties of
has following properties:
(SI19a)  
(SI19b)  
(SI19c)  
(SI19d)  
(SI19e)  
(SI19f)  
(SI19g)  
(SI19h) 
6.1.4 Proofs for Properties of
(P1) can be proved by direct substitution of into left hand side (LHS). (P2) can be derived through binomial expansion of . (P3) can again be proved by substituting and showing LHS=RHS. (P4) and (P5) can be proved by using the fact that both and are monotonic in since and . Therefore, the minimum and maximum values of and , can be found in by plugging in the minimum and maximum values of (note , from Thm 6.1.1 ).
(P6) is proved by using Kolmogorov’s sufficient criterion [4] for almost sure convergence of sample mean. Clearly, from (P5) which establish almost sure convergence. Now, let . Then multiplying each side of equations in (P1) by , respectively, adding them and the normalizing the sum by gives:
(SI20a)  
(SI20b)  
(SI20c) 
Above . To go from (SI20)a to (SI20)b, the result (Arithmetic Mean Harmonic Mean) is used. Now it can be shown that , as following:
(SI21a)  
(SI21b)  
(SI21c)  