1 Introduction
This work is concerned with estimation and tracking of dynamic probability density functions in real time, motivated by a nanoscience application. The introduction of
in situ transmission electron microscope (TEM) technology [zheng2009observation] allows the growth of nanoparticles to be captured in real time and has the potential to enable precise control in nanoparticle selfassembly processes. Part of the underlying nanoscience problem is framed into a learning problem with the following characteristics [qian2019fast]: (1) Estimation and tracking of a timevarying probability density function that reflects the collective changes across ensembles of the nano objects. (2) It seems inevitable to adopt a nonparametric approach in the density tracking, because there is no settled parametric density function that can adequately describe growth mechanisms in a multistage nanoparticle growth process
[woehl2014direct, zheng2009observation]. (3) In order to be useful for inprocess decision making, the density estimation and tracking needs to be conducted in real time. By "realtime" we mean that the learning and computation speed ought to be fast enough relative to the imaging rate (or the data arrival rate in general), which is frames per second (fps) in [zheng2009observation]. While the research is motivated by the dynamic nano imaging, we believe that the aforementioned characteristics are rather common in many types of dynamic streaming data, brought forth in various applications by fastpace data collection capability. The objective of this research is to present one competitive solution for dynamic density estimation and tracking.On the subject of density estimation, kernel density estimator has had great success in accuracy for density estimation of static datasets [scott2015multivariate]. Direct adaptation of kernel density estimator to dynamic density estimation [kristan2010online] is infeasible as the memory and computation cost constantly scale along with the total number of incoming data points. [hang2018kernel] further shows that even with unlimited computation and storage resources, a traditional kernel density estimator will only be a consistent estimator for a few specific dynamic systems. [qian2019fast] also shows that traditional kernel density estimation falls short in practice in dynamic density estimation due to limited data availability.
To address the disadvantages of traditional kernel density estimator in dynamic density estimation, most researchers resort to the "sliding window" mechanism [zhou2003m, heinz2008cluster, qahtan2016kde]. For example, [zhou2003m] proposed the Mkernel algorithm, where the contribution of each data point in the "sliding window" is approximated as an additional weight added to the kernel density at the closest grid point. This approach manages to keep the memory and computation costs within budget despite the growth of the total number of data points. However, with a poor choice of grid points, it can suffer from either overfitting or underfitting. [heinz2008cluster] employed cluster kernel and resampling technique to improve the merger performance. This approach uses the exponentially decaying weight scheme to capture the dynamic of the true density. [boedihardjo2008framework]
proposed the local region kernel density estimator (LRKDE), where the kernel bandwidth varies in different regions. The regions are divided such that the sum of data variances in each region is minimized. LRKDE also uses a "sliding window" to capture the dynamic of the true density.
[qahtan2016kde]further improved upon the previous works by using linear interpolation with kernel densities at grid points to approximate the kernel density estimator and then updating the kernel densities at the grid points with data points within a "sliding window".
The "sliding window" kernel density estimators do not only use the data points at the current time stamp, and they take into account older data points for inferring the current distribution. Intuitively, this mechanism provides two improvements that allow the kernel density estimator to work well in dynamic density estimation. First, defining a window size according to the computation and memory limit of the learning machine can alleviate the scalability issue of the kernel density estimator as old data points that are irrelevant to the current distribution can be discarded. Second, including older data points in the window can help alleviate the low data volume issue for most streaming data applications. However, to the best of our knowledge, all "sliding window" kernel density estimators proposed so far focus on modifying the kernel density estimator itself, and less attention has been given to the "sliding window" mechanism. As the only component that addresses the "dynamic" part of dynamic density estimation, there is no answer regarding how this mechanism affects the performance of the estimation.
We note that there also exists another line of works that model the dynamic density transition using a dynamical system with a fixed number of parameters. One class of frameworks is based on Bayesian learning [caron2007generalized, rodriguez2008bayesian, mena2016dynamic], which models the prior with an evolving Dirichlet process called dependent Dirichlet process, where the dependence between a class of Dirichlet processes is defined by a covariate. When using the covariate to describe time, the dependent Dirichlet process can be used to model the evolution of the dynamic distribution. The computation and memory costs are also maintained at a constant level. Another approach [qian2019fast]
couples Bspline with Kalman filter to capture the density evolution with a state space model. It imposes space continuity with Bspline smoothing and time continuity with Kalman filter to develop a fast density estimator for realtime process control. However, these estimators always need a normalization process with numerical operations to return a proper density function. For realtime density estimation tasks that require a model update cycle in the order of subsecond, these methods may not be ideal as we will later show in simulations.
In this paper, we propose the temporal adaptive kernel density estimator (TAKDE), a novel kernel density estimator for realtime dynamic density estimation that is theoretically optimal in terms of the worstcase asymptotic mean integrated squared error (AMISE). For the first time, we derive the AMISE upper bound for the "sliding window" kernel density estimator in a dynamic density estimation context. The minimizer of the upper bound entails a novel sequence for bandwidth selection and data weighting, which forms the basis of TAKDE. We provide numerical experiments on synthetic datasets to support our theoretical claim, and we then use three realworld datasets to show that TAKDE outperforms other stateoftheart fast dynamic density estimators, such as the Bspline Kalman Filter [qian2019fast] and KDEtrack [qahtan2016kde] in terms of mean test loglikelihood metric. Interestingly, TAKDE also dominates these algorithms in terms of achieving smaller runtime.
The organization of the paper is as follows. We present in Section 2 the preliminaries, including definitions and notations used throughout the paper. In Section 3, we present the details for TAKDE design, which addresses three important questions, i.e., the selection of window size, bandwidth and the data weights. We provide in Section 4 numerical experiments with synthetic and real datasets to demonstrate the performance of TAKDE. Finally, we draw conclusions and discuss the potential and limitations of TAKDE in Section 5.
2 Preliminaries
2.1 Kernel Density Estimation: A Brief Overview
The kernel density estimator for a given set of data points is as follows
(1) 
where is the kernel function with the bandwidth . Throughout this paper, denotes a standard kernel function with a unit kernel bandwidth. We have that . We further impose the following mild assumptions on the kernel function .
Assumption 1.
[wand1994kernel] The bandwidth sequence (the subscript n shows the dependence of with respect to the number of data points) has the following properties
(2)  
which implies that the bandwidth decays slower than and converges to . The standard kernel function
is a bounded, symmetric probability density function with a zero first moment and a finite second moment. That is, the following properties hold
(3)  
The convergence to for bandwidth is rather intuitive, in that when we have infinitely many data points at hand, our estimator can be as flexible as possible without having to be concerned about overfitting. It is easy to verify that many commonly used kernels (e.g., the Gaussian kernel ) satisfy (3).
2.2 Problem Formulation
In dynamic estimation, the density evolves over time. The evolution might be continuous in nature, but we only observe samples from time to time. Here, we consider the case where the streaming data comes in batches. We first define the dynamic streaming dataset, where we observe one new batch of data points at a new time stamp . This data structure applies to most realworld streaming datasets. An important example is estimating density information in video datasets [qian2019fast] like the dynamic nano imaging problem mentioned in the introduction. An image processing tool extracts the sizes of nanoparticles as the sample points for estimating the normalized particle size distribution (NSPD), which is an indicator to anticipate and detect phase change in nanoparticle growth. This data structure further applies to many timeseries datasets [UCRArchive2018]. For the cases where streaming data comes in on a per point basis, one can convert those types of data into our defined structure through combining consecutive data points into batches.
We assume data points are generated independently from . Also, the data points and in different time stamps () are independent from each other. We impose the following assumption on the true density function.
Assumption 2.
The true density function at any time stamp is twice differentiable, and its second derivative is continuous and square integrable.
Assumption 2 is commonly used for continuous density functions [wand1994kernel]. We further define the difference between density functions as
(4) 
where the index in the superscript is the time stamp of the true density we would like to estimate.
Following (1), we write the traditional kernel density estimator of the density as follows
(5) 
The "sliding window" kernel density estimator, popularly used in dynamic density estimation [zhou2003m, heinz2008cluster, qahtan2016kde], takes the following form
(6) 
where represents the set of batches within the moving window (memory), is defined following (5), and is a nonnegative weight sequence that satisfies , to ensure that the output is a proper density function. The window size is , i.e. , so that can be naturally written as . The superscripts on and are omitted hereafter for the presentation clarity.
In order to develop a fast realtime estimator, we need to address the following three problems.
Problem 1.
How do we choose the set to have a good enough "memory" for estimating the density at time ?
Problem 2.
How do we design the weight sequence in (6)?
Problem 3.
How do we devise a kernel bandwidth selector in (5) for realtime processing? For the TEM video applications, this means that the density estimation is done in the order of subsecond.
3 Algorithm Design
In this section, we derive the AMISE upper bound for the general "sliding window" kernel density estimator in (6). We then present a novel weight and bandwidth sequence, entailed by the upper bound minimizer (Problems 23). We use these sequences to design the TAKDE algorithm.
3.1 Asymptotic Mean Integrated Squared Error Upper Bound
AMISE is a popular metric used to theoretically evaluate the performance of a density estimator [wand1994kernel]. For a given density estimator of a density function , the mean integrated squared error (MISE) is defined as follows
(7)  
where the expectation is taken with respect to the randomness of data points. MISE is the integration of the mean squared error of the density estimator over the support. [wand1994kernel] shows that the asymptotic expression (with respect to the sample size ) of the MISE for a standard kernel density estimator with kernel bandwidth is
(8) 
where
(9)  
We can see that the conditions in (2) guarantee that AMISE converges to zero as . In the following theorem, we generalize the theoretical upper bound of AMISE for the "sliding window" kernel density estimator given in (6) in the context of dynamic density estimation, where the underlying density evolves over time.
Theorem 1.
The asymptotic integrated mean squared error of a "sliding window" kernel density estimator at time with window size , weight sequence , and bandwidth sequence has the following upper bound
(10)  
The proof of Theorem 1 is provided in Appendix A. Let us call the three lines in the right hand side of (10) as term , term , and term , respectively. Term is due to the variance of the estimator, and terms and are the bias terms. Terms and are asymptotically vanishing in the sense that when , they both go to zero per condition (2). We can make several observations about the upper bound expression (10). First, the dynamic density estimation with "sliding window" kernel density estimators will have a nonvanishing error term , induced by keeping biased distributions in the memory. Second, when term is small, which is to say that the distribution evolution is mild, there is a theoretical advantage in including previous samples in the memory to reduce term and term . Later simulations will show that the advantage in keeping the biased samples is significant in practice. Third, when the previous distributions are very different from the current distribution, it is desirable to only keep one batch (the current batch) in the memory, i.e., and . In this case, by definition (4) and the upper bound (10) exactly recovers the AMISE for the traditional kernel density estimator in (8).
3.2 Window Generator
In the existing literature, kernel density estimators are modified using arbitrary "sliding windows" to adapt to the dynamic estimation. This approach performs better than the traditional kernel density estimator, as a static kernel density estimator works poorly for dynamic density estimation [qian2019fast]. However, this heuristic approach lacks a theoretical justification. In fact, based on the theoretical upper bound of AMISE (10), it is intuitive that the window size should depend on the density evolution to keep the AMISE small. For example, when the true density changes drastically, it is ideal to decrease the window size to adapt to the fast density change. Therefore, we propose a histogrambased window size generator that will allow the kernel density estimator to be adaptive to dynamic changes.
We observe in (10) that compared to the static AMISE, the worstcase AMISE for dynamic density estimation depends on one more quantity, namely the difference function . In principle, we can use this quantity as an indicator to adapt the dynamic kernel density estimator to the changes in the underlying density function.
We define a cutoff threshold to determine the number of batches (sliding window size) to be kept in the memory of the dynamic kernel density estimator. In doing so, we first define the temporal adaptive (TA) distance between two density functions. Here, we use histograms to approximate the density functions as true density functions are unavailable. We denote the number of bins in the histograms by , set using the Sturges’ rule [sturges1926choice]
(11) 
where is the smallest batch size among all batches in the current memory. Sturges’ rule is a widely adopted, simple binning algorithm in the literature.
The temporal adaptive distance between two histograms and is then expressed as
(12) 
where denotes the norm and
is the probability mass vector on bins in batch
, i.e., . This TA distance serves as a measure proportional to , the approximation of in (10), i.e.,(13) 
To control the bias, one can set a cutoff threshold for the TA distance. Upon receiving batch , the number of batches to be kept in the memory can be set as that satisfies the following two inequalities
(14) 
Note that from a practical standpoint, the cutoff threshold should not be the only criterion for memory window selection, because when the true density goes through a long static period, it is possible that (14) will induce a large memory window that exceeds the computational limit for realtime density estimation. Therefore, there should exist a hard cap to account for computational limits. Combining both considerations, the actual number of batches in the memory should be set as .
Remark 1.
We will see in the experiments that choosing a good cutoff value does affect the performance of the kernel density estimator. However, synthetic data simulation suggests the performance difference is not too sensitive to the cutoff value, so one can heuristically choose it in favor of fast processing rather than through intensive crossvalidation.
3.3 Bandwidth and Weight Generator
The dynamic nature of the underlying true density makes it practically impossible to understand the actual difference functions and the second derivative of the true densities. However, using the AMISE upper bound in Theorem 1, we can find theoretically optimal sequences for kernel bandwidths and weights, which in turn helps in the algorithm design. In view of Theorem 1, we present the following corollary.
Corollary 2.
The optimal sequences of weights and bandwidths that minimize the AMISE upper bound of the dynamic kernel density estimator are as follows
(15)  
where the sequence (with superscript omitted) is such that
(16) 
Remark 2.
Corollary 2 provides some insights concerning the bandwidth and weight choices.

The bandwidth sequence suggests that we should make the kernel more flexible as more batches of data points are included in the estimation. This aligns with the intuition from the traditional kernel density estimator, where the estimator can be more flexible with more sample points.

The weight sequence provides the following insights. First, when the number of data points at a particular batch is considerably large, we should assign more weight to that batch with the hope of extracting more information to infer the current density. Second, the quantity provides a countermeasure to prevent us from assigning a large weight to data points coming from a very different distribution compared to the current batch. Third, we should assign more weights to the batch with a larger kernel width, which means we are favoring smoother estimators in principle.
3.4 Kernel Bandwidth Selector
The bandwidth sequence in Corollary 2 presents a principle for choosing the kernel bandwidth. However, the quantity is unknown in practice, and we still need to find a kernel bandwidth selector to calculate the actual kernel bandwidth values. There exist extensive studies for the choice of bandwidth in traditional kernel density estimation. One popular choice is the crossvalidation approach [bowman1984alternative, hall1991optimal, robert1976choice, rudemo1982empirical]. However, the computational complexity of crossvalidation prohibits its application in highfrequency density estimation as every new batch of data points needs to be crossvalidated for a new kernel bandwidth.
Minimizing AMISE in (8) reveals a simple expression for the optimal kernel bandwidth. [wand1994kernel] characterized the optimal kernel bandwidth based on (8) as follows
(17) 
We notice that (17) coincides with the optimal kernel bandwidth sequence we derived in Corollary 2 but differs by a factor of . This relationship allows us to directly adopt any existing kernel bandwidth selection methods for optimal AMISE.
Expression (17) is still dependent on the unknown , but there exist a number of studies that explore different methods for estimating . For example, [shimazaki2010kernel] approximates the AMISE objective function assuming the density is Poisson and then proceeds to find the minimizer as the optimal kernel bandwidth. However, this method is not applicable in realtime dynamic density estimation as the optimization process is expensive. [qahtan2016kde] provides an iterative update framework by estimating through , which is the numerical square integration of the second derivative of the density estimator. This approach does not impose any strict assumption on the underlying distribution, which offers a robust estimation of . However, the iterative algorithm still requires numerical operations like numerical derivatives and numerical integration, which may not be efficient enough for realtime density estimation.
In TAKDE, we adopt the normal rule introduced in [silverman2018density]. Assuming the true density is Gaussian, the optimal kernel bandwidth can be approximated as follows
(18) 
where is the smoothness parameter depending on the kernel function and the underlying true density, and
is the sample standard deviation of the data points. The normal rule is particularly appealing for the design of TAKDE due to its simple structure, which allows a direct plugin of smoothness parameter
and enables realtime processing.There are two commonly used recommendations for the smoothness parameter in (18). The first choice given in [wand1994kernel] is as follows
(19) 
where is the estimated standard deviation assuming the true density is normal. The smoothness parameter of Gaussian Kernel in this setting is .
The second recommendation [terrell1990maximal] comes from the upper bound of the AMISEoptimal kernel bandwidth using or triweight density function, that is,
(20) 
This bandwidth provides an oversmoothed density estimator that might not perform well with respect to metrics like loglikelihood or MSE. However, an oversmoothed density estimator is often preferred for realworld applications, because the results are visually plausible. In this case, the smoothness parameter of Gaussian Kernel is .
Remark 3.
The only reason for adopting the normal rule in TAKDE is its computation simplicity. We must note that the weight sequence given in Corollary 2 is compatible with any existing approximation method.
3.5 Algorithm Design
In this subsection, we present the final form of TAKDE. The algorithm requires as input a cutoff value , a hard cap , a smoothness parameter , and a kernel function . Upon receiving the batch of data points at time , the window generator decides the set of batches to be used for the density estimation. The window generator will also return the sequence of approximated as in (13) for all batches in the memory. Then, all batches within the memory will be fed into the bandwidth generator to generate the sequence of kernel bandwidths as in Corollary 2. Then, the approximated and bandwidth sequence will be fed into the weight generator to generate the sequence as in Corollary 2. Finally, all parameters will be put together to generate a proper kernel density estimator for estimating the density at time . Fig. 1 illustrates the workflow of TAKDE. The algorithmic presentation of TAKDE is outlined in Algorithm 1.
Input: Kernel function , cutoff value , hard cap , smoothness parameter .
For
(21) 
(22) 
(23) 
(24) 
(25) 
(26)  
Output: The Temporal Adaptive Kernel Density Estimator given as
(27)  
4 Experiment
We now present numerical experiments to verify the efficiency of TAKDE both on synthetic data and realworld data. All experimental results established in this section are based on Gaussian kernel function.
4.1 Algorithm Design Evaluation
Before we compare TAKDE with other established benchmark algorithms, we evaluate the design of TAKDE on synthetic data. The specific question that we aim to address is that whether our proposed weighting scheme, derived from the AMISE upper bound, outperforms other heuristic weight sequences such as uniform (or average) weighting and exponentially decaying weighting.
4.1.1 Synthetic Dataset Design
We create a synthetic dataset to test the performance of TAKDE in dynamic density estimation. We design the synthetic dataset following some general principles.

The true densities involved in the generation of the dataset need to have analytical forms and have already been established in the literature.

Each batch of data points has a size in the range of , so that the dataset does not differ too drastically in the scale of the data amount.

The number of testing points for all batches should be the same for comparison purposes.

The dynamics of the underlying densities varies for different batches.
Following the above principles, we adopt the Gaussian mixture densities recommended by [marron1992exact] as the baseline densities for our synthetic dataset design. The densities are shown in Fig. 2.
To design the true density, we first consider sections, where each section consists of multiple batches. Let us denote the Gaussian mixtures with and represent the sections with , where equals to the total number of batches in the dataset. To be specific, section has consecutive batches of data points in it, and the first batch of data in section will start after the last batch in section . For batch in section , where , the density function is defined as follows
(28) 
To be consistent with our previous notation, for . Notice that in section , the
th Gaussian mixture linearly transforms to the
th Gaussian mixture. After we move on to section , none of previous Gaussian mixtures will appear in the section. Given the density of batch in section , we sample a random number between to as the number of training points and for testing points to perform the comparison. To account for the randomness in partitioning the batches into sections and the randomness in samples, we generate synthetic datasets for MonteCarlo simulations.4.1.2 TAKDE Evaluation
We now compare the weight generator in TAKDE with two heuristic approaches in the literature. One approach is to assign uniform weights to the batches, assuming older data points are of the same importance as the new data points, and the other one is to assign exponentially decaying weights, assuming the new points are much more important [heinz2008cluster, qahtan2016kde]. To ensure a fair comparison, we only change the weight generator of TAKDE to uniform and exponential weighting, and we keep the other components of the algorithm unchanged. The uniform weight sequence is set as follows
(29) 
The exponential weight sequence is set as follows
(30) 
and , where is the decay ratio. We compare the above to correspond to the expression in (26). In our simulation, in general yields the best result under different settings; therefore, the decay ratio for exponential weight sequence is set to .
Our comparison is performed under several kernel bandwidth selectors including the normal selector and oversmooth selector mentioned in Section 3.4 and under various cutoff values.
First, we consider normal bandwidth selector (19) and oversmooth bandwidth selector (20). For each bandwidth selector, we conduct the comparison with datasets having from to batches of data to reflect different underlying dynamics. Notice that for the data with batches, the dynamic change is more drastic than that of the data with batches.
The simulation result is shown in Fig. 3. We can observe that TAKDE with AMISEbased weight sequence dominates the uniform and exponential weight sequences. We also see that when using the heuristic weight sequences, increasing the memory window (i.e., larger cutoff value) exacerbates the density estimation performance. The results show that the performance difference between TAKDE and other two versions is larger when the total number of batches is smaller. This suggests that TAKDE with AMISEbased weight sequence is better at adapting to more drastic dynamic changes. The smaller differences in batch simulations are consistent with our theoretical results where the weighting sequence in Corollary 2 gets closer to uniform weighting as converges to , equivalent to a static density estimation. We observe that changes in the cutoff value do not have a significant effect on TAKDE performance compared to others. This verifies our discussion in Remark 1.
Second, we conduct the comparison using a synthetic dataset with 100 batches of data for different bandwidth selectors, i.e., varying the smoothness parameter in (18). The simulation results are shown in Fig. 4. Again, we observe the same performance trend for the algorithms. These simulations empirically verify that the performance advantage of our proposed weight sequence against the heuristic weight sequences is robust to different kernel bandwidths and different memory windows.
4.2 Comparison with Benchmark Algorithms
Next, we compare TAKDE with three density estimation methods on realworld datasets. We compare both the mean test loglikelihood and the runtime to show the advantages of TAKDE.
4.2.1 Benchmark Algorithms

Kernel Density Estimator (KDE): The first benchmark algorithm is the traditional kernel density estimator. The main reason to include kernel density estimator in the comparison is to show why a traditional density estimator is not ideal for dynamic density estimation. The kernel density estimator is formulated as (1). The bandwidth selector is
(31) where we use crossvalidation to choose rather than the actual bandwidth for easy comparison with TAKDE.

Bspline Kalman Filter (BKF) [qian2019fast]: Bspline Kalman filter models the underlying density function as a count measure defined on the partitions of the density support. The density estimator is defined as
(32) where is the normalization constant calculated with numerical integration, is the number of partitions, and is the Bspline basis. The algorithm updates its states using a Bspline matrix evaluated on the centers of the density support partitions and the count vector at each batch.

KDEtrack [qahtan2016kde]: KDEtrack partitions the support of the density using a collection of grid points. The set of grid points and the density values at the grid points are updated after each new batch of data points is received and evaluated. The density evaluation at a test point will be the linear interpolation at the test point using the closest grid points.
Remark 4.
We do not include the Mkernel and LRKDE methods since [qahtan2016kde] has showed that KDEtrack is superior to these two methods.
4.2.2 Datasets

In situ TEM video data: The first dataset we use is in situ TEM dataset introduced in Section 1. It is the second in situ TEM video published in [zheng2009observation]. It has a total of frames of images and particle counts in each frame.

CinCECGTorso (ECG) data: CinCECGTorso dataset is an ECG dataset taken from multiple torso surface sites of four patients from the Computers in Cardiology challenges. This dataset is available on UCR timeseries data archive [UCRArchive2018].
The dataset consists of ECG measurements of four patients. We use the ECG signal sequence of one person to highlight the density dynamic over time. Note that simulations on all four patients yield similar results. There are ECG signals (data points) available at each batch, and there are a total of batches of data points over time. The batches are collected at kHz frequency, which requires the density estimator to be updated times per second. For each batch of data points at a time stamp, we randomly sample to data points to train and use the rest of the data points to evaluate the algorithms. The number of training data points at each batch is determined only once throughout all the MonteCarlo simulations. However, the set of training points are sampled randomly in each MonteCarlo simulation.

Wafer data: Wafer dataset is a collection of sensor readings in a semiconductor wafer manufacturing process over time, available on UCR timeseries data archive [UCRArchive2018]. Unlike the previous two datasets, a wafer manufacturing process is a rather slow process that could span over weeks. However, this dataset is still illustrative for evaluating the accuracy of TAKDE. We use the readings in the normal state wafer manufacturing process to conduct our analysis. There are readings (data points) available at each batch, and there are a total of batches of data points over time. Again, we adopt the same train test split approach as in the ECG dataset.
4.2.3 Experimental Settings
In comparing across different density estimators, we only present the best performance achieved for Bspline Kalman filter, where the noise prior parameters are crossvalidated using a grid search with an interval size of . For the traditional kernel density estimator, we report its best performance, but even that is significantly inferior to other density estimators. For KDEtrack and TAKDE, we report the best performance settings of them in terms of smoothness parameter and cutoff value . Notice we do not adopt the iterative bandwidth update in KDEtrack for the computation reason explained in Section 3.4, but instead we use the same bandwidth generator as in TAKDE. All the simulations are conducted over
MonteCarlo simulations for random training testing split to generate the standard errors of the performance. The performance metric is the mean test loglikelihood of the test points.
4.2.4 Performance
The parameter settings leading to respective best performance for all benchmark algorithms are shown in Table I. These settings are crossvalidated using the first batches of each dataset ( for Wafer dataset).
Algorithm  Noise Parameter  Noise Parameter  Smoothness Parameter  Cutoff Value (Window Size) 
TEM  
KDE        
Bspline      
KDEtrack      
TAKDE      
ECG  
KDE        
Bspline      
KDEtrack      
TAKDE      
Wafer  
KDE        
Bspline      
KDEtrack      
TAKDE     
Algorithm  TEM  ECG  Wafer 

KDE  
Bspline Kalman Filter  
KDEtrack  
TAKDE(normal)  
TAKDE(cor)  
TAKDE 
The results are tabulated in Table II. TAKDE tagged with "(normal)" represents the performance achieved with smoothness parameter recommended in equation (19) (normal bandwidth selector) and the optimal cutoff in Table I. TAKDE tagged with "(cor)" represents the performance achieved by TAKDE under KDEtrack best settings in terms of cutoff value and smoothness parameter. As we can observe, TAKDE dominates all other benchmark algorithms in terms of test loglikelihood by a large margin. TAKDE is also robust with respect to different cutoff values and different smoothness parameters, as it dominates all other benchmark algorithms even under the best settings for KDEtrack. The only exception is TAKDE with normal bandwidth selector on the TEM dataset. The underlying reason is that the low data volume available at different batch (training and testing combined) forces the "true" density distribution at each time stamp to an average of Dirac measures, which is far from the normal assumption of the normal bandwidth selector.
Algorithm  TEM  ECG  Wafer 

Bspline Kalman Filter  
KDEtrack  
TAKDE 
The runtime comparisons are shown in Table III. The values represent the time consumption for executing the density estimation for all test data points in all batches. We can observe that despite being more accurate than the benchmark algorithms, TAKDE is also much faster in speed as it requires little calculation in addition to kernel density evaluation. The computation advantage makes a huge difference, for the ECG dataset in particular as the other two benchmark algorithms do not run nearly fast enough to catch up with the kHz data collection rate.
4.3 Visual Examination
In this subsection, we visualize the previously compared density estimators. We pick the time stamps in batches of data in the TEM dataset for visualization. The results are shown in Fig. 5. As we can observe, TAKDE at its optimal setting (test loglikelihood) yields a more flexible model compared to other algorithms. TAKDE with normal smoothness parameter yields the smoothest model among all. Our results in Table II also show that the normal smoothness parameter can achieve estimation performance close to the optimal setting while yielding smooth density functions that facilitate easy interpretation. For this reason, in most realworld applications that do not place estimation accuracy as their first priority, we do recommend using the normal smoothness parameter (19) as the preset to avoid crossvalidation.
5 Conclusion
In this paper, we established a theoretical AMISE upper bound expression for the "sliding window" kernel density estimator in dynamic density estimation. We proposed the temporal adaptive kernel density estimator that maintains the fast processing advantage of the "sliding window" kernel density estimator, while being theoretically optimal under the worstcase AMISE. We provided extensive numerical simulations to verify that TAKDE is superior to stateoftheart realtime dynamic density estimators in terms of the mean test loglikelihood. TAKDE also dominated these algorithms in terms of achieving smaller runtime.
The proposed weight sequence is reminiscent of the attention mechanism in a transformer neural network for sequence reweighting
[vaswani2017attention]. Considering the massive success of transformers in different fields, one of the future research directions is to see whether learning the weight sequence through the attention mechanism can result in better performance.Note that TAKDE in its current state only works for univariate density estimation. Thus, another future direction is to extend it to multivariate density cases.
Appendix A Proof of Theorem 1
We omit the superscript for weight and bandwidth for the presentation clarity. First, recall the definition of from (5)(6), where we have
(33) 
The bias of the estimator can be written as
(34)  
where denotes the convolution, and is the true density of batch .
Using to denote the variance operator, the estimator variance can be calculated as
(35) 
due to the independence, where
(36) 
The decomposition of the MSE of the "sliding window" estimator is as follows
(37)  
Integrating above over , we have
(38) 
Given the expressions of bias (34) and variance (36), to calculate AMISE, we need to derive the Taylor approximations of the following quantities
(39)  
First, we have
(40)  
where we note that holds, because as . We also have that
(41)  
where we used the assumptions that and . Given the above asymptotic characterization of the quantities, we can rewrite the bias term (34) as
(42) 
We can also write the variance (36) as
(43) 
We can now simplify the MSE (37) as
(44)  
Disregarding the terms that converge to zero and taking integral over , we can derive an upper bound for AMISE as
(45)  
where the last two lines follow from the CauchySchwarz inequality for the terms in the square. Note that by definition. Observing that completes the proof of Theorem 1.
Appendix B Proof of Corollary 2
(45) shows that the upper bound on AMISE depends on the weight sequence and the bandwidth sequence . Therefore, we can minimize the upper bound with respect to both of these parameters.
Differentiating with respect to yields the following (optimal) sequence
(46) 
We can find the optimal sequence of weights by simply solving the minimization of Lagrangian of (45) with the constraint and incorporating (46). This will result in the following expression for the sequence
(47)  
which completes the proof of Corollary 2.
Acknowledgments
The authors gratefully acknowledge the support of NSF Award #2038625 as part of the NSF/DHS/DOT/NIH/USDANIFA CyberPhysical Systems Program.