# Modeling a Hidden Dynamical System Using Energy Minimization and Kernel Density Estimates

In this paper we develop a kernel density estimation (KDE) approach to modeling and forecasting recurrent trajectories on a compact manifold. For the purposes of this paper, a trajectory is a sequence of coordinates in a phase space defined by an underlying hidden dynamical system. Our work is inspired by earlier work on the use of KDE to detect shipping anomalies using high-density, high-quality automated information system (AIS) data as well as our own earlier work in trajectory modeling. We focus specifically on the sparse, noisy trajectory reconstruction problem in which the data are (i) sparsely sampled and (ii) subject to an imperfect observer that introduces noise. Under certain regularity assumptions, we show that the constructed estimator minimizes a specific energy function defined over the trajectory as the number of samples obtained grows.

## Authors

• 1 publication
• 1 publication
• 7 publications
07/27/2021

### Kernel Density Estimation by Stagewise Algorithm with a Simple Dictionary

This study proposes multivariate kernel density estimation by stagewise ...
07/13/2016

### Kernel Density Estimation for Dynamical Systems

We study the density estimation problem with observations generated by c...
12/20/2012

### Nonparametric ridge estimation

We study the problem of estimating the ridges of a density function. Rid...
12/09/2011

### Green's function based unparameterised multi-dimensional kernel density and likelihood ratio estimator

This paper introduces a probability density estimator based on Green's f...
12/09/2019

### Conditional Kernel Density Estimation Considering Autocorrelation for Renewable Energy Probabilistic Modeling

Renewable energy is essential for energy security and global warming mit...
10/19/2020

### DAN – An optimal Data Assimilation framework based on machine learning Recurrent Networks

Data assimilation algorithms aim at forecasting the state of a dynamical...
01/27/2021

### Parameter and density estimation from real-world traffic data: A kinetic compartmental approach

The main motivation of this work is to assess the validity of a LWR traf...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In this paper we propose an algorithm for modeling and forecasting a sparse, noisy, recurrent trajectory that lies entirely on a manifold embedded in an arbitrary dimensional Euclidean space. By sparse, we mean the signal may be subject to long gaps in observation; by noisy we mean the signal is sampled by an (unknown) imperfect observer. We will define recurrent precisely in the context of the underlying mathematical model, however in general we mean the trajectory visits a neighborhood (or collection of neighborhoods) infinitely often. Examples of these trajectories include vehicle (ship, plane and car) tracks, migration data (e.g., in birds, whales and sharks), and some economics data subject to seasonality (e.g., detrended annual sales).

Our approach uses a combination of kernel density estimation (KDE) and energy minimizing inference for sparse trajectory reconstruction prior to model learning. Our goal in using a KDE is to construct distribution estimators, rather than pointwise estiamtors with confidence intervals. That is, rather than using a traditional pointwise timeseries forecasting method, our objective is to generate a sequence of probability distributions that can be used to generate an optimized pointwise estimator on demand. The methods proposed in this paper will generalize to compact manifolds in arbitrary dimensions, however we will focus specifically on examples from compact subsets of

and the -sphere as a representation of the Earth.

### i.1 Related Work

Our work extends and is related to the basic statistical problem of time series modeling. Linear and non-linear time series modeling is a well established field of statistics Box et al. (2008) and statistical process control del Castillo (2002)

. Basic linear regression

Hogg and Tanis (2006) (Chapter 8) and non-linear regression Serber and Wild (2003) (Chapter 2) attempt to model observations as functions of a variable . In one dimension, Autoregressive Integrated Moving Average Models (ARIMA) extend these notions by allowing the model to vary as a function of past values and past shocks Box et al. (2008) (Chapters 3 and 7). Seasonal ARIMA models extend this notion by adding seasonal periodicity Ghysels and Osborn (2001) (Chapter 5). Fractional ARIMA Hosking (1981) models add short and long range dependence, not expressible with classical ARIMA techniques. In particular, these non-linear models are better able to express persistence and anti-persistence. Finally (Generalized) Autoregressive Heteroskedastic Models (ARCH/GARCH) add heteroskedastic behavior to the error components of the time series allowing globally stationary and locally non-stationary error terms to be analyzed Engle (2001)

. Many of these techniques can be extended to vector valued functions (of the type we consider). In particular, Vector Autoregressive models (VAR and VARIMA)

Hacker and Hatemi-J (2008) can be used to model time series of vector valued functions. The most general models are the stochastic differential and difference equations that use Weiner and Lévy processes to model stochasticityOksendal (2007) (Chapter 1).

Grid-based methods that approximate the trajectory can be employed when the space of the time series is continuous but can be partitioned into a collection of discrete grid points and the trajectory modeled as a time series of grid points. The work in Martinerie and Forster (1992); Rabiner (1989); Streitt and Barrett (1990)

describes methods of using hidden Markov models (and in the case of

Martinerie and Forster (1992) dynamic programming) to identify optimal estimators for the behavior of trajectories passing through the discretized state space. Griffin et al. (2011) uses a multi-resolution grid model and a continuous time model to construct a hybrid track estimator that attempts to retain the simplicity of a grid-based model without sacrificing the accuracy of a continuous model. We note that the approaches discussed heretofore are all designed to generate pointwise forecasts with confidence regions, while our objective in using a KDE-based approach is to generate a sequence of probability distributions, which can be used to generate a pointwise forecast.

Forecasting dynamical systems, especially non-linear dynamical systems, is a well known problem in physics with applications to noise reduction and experimental data smoothing. Kostelich and Yorke (1988); Kostelich and Schreiber (1993); Schreiber (1993) consider noise reduction in dynamical systems with Schreiber (1993)

providing a fitting approach that is qualitatively similar to the work presented in this paper. Anomaly detection is considered in

Cao et al. (2004) with states goals similar to those in Pallotta et al. (2013), but applied to one-dimensional chaotic signals. Forecasting and non-linear modeling is considered in McSharry and Smith (1999); Voss (2001); Danforth and Yorke (2006); de S. Cavalcante et al. (2013). In addition to this work, Li and Cheng (2010) applies stochastic hidden Markov models to fuzzy time series forecasting. Fuzzy time series forecasting is also considered in Huarng et al. (2007). Han et al. (2018) considers the problem non-uniform state space reconstruction of chaotic time series. Chaotic time series forecasting is also considered in Shen et al. (2013), which uses an ant colony optimization algorithm to optimally embed a time series in an appropriate space. Joint continuous and discrete forecasting is considered in Ramasso et al. (2013)

, while outlier detection of time series is considered in

Rasheed and Alhajj (2014)

. More recent work has applied multi-layer perceptron neural networks to time series forecasting

Rahman et al. (2016).

Using a KDE for the purpose of modeling and forecasting recurrent trajectories has been studied by other authors in more restricted contexts. Pallotta, Vespe, and Bryan Pallotta et al. (2013) use a kernel density estimation technique to model shipping routes using Automatic Identification System (AIS) data. They use the resulting distributions to identify anomalous behavior in ship routes. As they note, AIS data is exceptionally dense, and can be used in real-time to track ships.

Additionally, it is well known that a kernel density estimate can be used as a convolutional filter on noisy data. This was done in Lampe and Hauser (2011) in order to visualize streaming data from an aircraft. This is a trajectory in , although Lampe and Hauser (2011) only considers the projection onto . In both Pallotta et al. (2013) and Lampe and Hauser (2011), the data are high density with minimal noise. This is not realistic in antagonistic situations or in cases where the trajectory cannot be observed with high fidelity. This occurs naturally when biologists observe animals in their natural habitat (e.g., see OC1 (2018)). This paper considers situations in which the sampled trajectory is neither dense nor exhibits high signal-to-noise ratio (SNR).

### i.2 Paper Organization

The remainder of this paper is organized as follows: In Section II we introduce notation and the underlying mathematical model to be used throughout the rest of the paper. In Section III we discuss our proposed modeling and forecasting algorithms. Theoretical results on the algorithms are provided in Section IV. We present empirical results using synthetic and real-world data sets in Section V. Finally conclusions and future directions are presented in Section VI.

## Ii Notation and Preliminaries

### ii.1 Notation and Assumptions

Let be a (hidden) dynamical system describing the motion of an (autonomous) particle in a compact manifold embedded in Euclidean space. Throughout this paper, boldface symbols will indicate positions on the manifold in an appropriate coordinate system; e.g., and . We assume the manifold is endowed with an appropriate metric. For example, when , the standard Euclidean metric is used; when (the 2-sphere), the Haversine metric is applicable. We denote distance between two points in as . For notation used through the paper, we use to denote equality by definition rather than derivation.

The dynamical system we study is hybrid in the following sense: We assume there is a finite set . At any time , either:

1. There are positions and the function defines a sub-trajectory in so that

 (1)

where is a hidden energy function (Lagrangian) and are hidden (possibly time parameterized) constraints.

2. We have . At some time , a new is chosen (possibly at random).

We assume the dynamical system is recurrent in the sense that the choice on

is governed by a (hidden) Markov chain with no transient states and is either ergodic or periodic. Therefore, if

, there is some so that .

In the problem we discuss, all relevant information about the dynamics, including , the Lagrangian and some (perhaps all) of the constraints are hidden. The assumption that is constructed from piecewise optimal paths is used to justify our method of inferring missing information in sparsely sampled data. For simplicity, in the remainder of this paper, we will assume that are time invariant and denote the constraint functions by . In the sequel, we assume data are sampled discretely and with noise to produce a sparse noisy signal:

 xi=φ(ti)+ϵti (2)

here the are i.i.d. mean-zero noise vectors with covariance matrix . We note, our approach is a parameter-free approximation method and our focus is not on estimating

unlike (e.g.) in the traditional Kalman filter estimate (see

Kalman (1960)).

In addition to the recurrence assumption, we assume that is piecewise smooth, and in particular at an instantaneous velocity can be constructed using initial conditions. In practice velocity is numerically approximated by a difference quotient. Finally, since we assume that obeys a set of externally imposed constraints defined by in Eq. 1, we assume there is a feasible region defined by and for all , . In particular, may be convex as a set in an appropriate Euclidean space, but may not be, making the problem more challenging.

### ii.2 Techniques

We provide a brief overview of Kernel Density Estimates (KDE), which are a foundational element of our proposed algorithm. The interested reader may consult Scott (2015) for a more detailed overview. The KDE is a non-parametric estimate of the probability distribution of a set of data . Formally, the univariate KDE is given by

 ^f(x)=1NhN∑i=1K(x−xih),

with the requirement that

is a valid probability density function (PDF). Intuitively, one may think of a KDE as a mean of many probability distribution functions with a normalizing constant,

. The parameter is referred to as the bandwidth and there are many well established rules-of-thumb for choosing given arbitrary data (see, e.g., Jones et al. (1996)

). The bandwidth controls the variance of kernel and acts as a smoothing control on the resulting KDE.

The KDE can be generalized to arbitrary dimensions. Let be data points in , possibly on a manifold . Define . The KDE generalizes to:

 ^f(x)=1nh1h2⋯hdN∑i=1(d∏j=1K(xj−xijhj)),

where is a valid PDF on and is the bandwidth parameter for the coordinate . We will write the bandwidth vector .

KDE on the Sphere: In Section V, we apply the proposed algorithm to cruise ships traveling on the surface of the earth. While for this paper we consider a small enough region to approximate by a plane, on larger scales this may lead to significant distortion. In such a case one might wish to approximate the Earth as a 2-sphere

. In this case, one could use use the Kent distribution, the analogue to the Gaussian distribution on a sphere, as the choice of kernel. Let

be the longitude in degrees, and the the latitude in degrees. The general formulation of the Kent distribution in spherical coordinates is

 g(Ω,ϕ)=c(κ,β)−1exp(κcosΩ+βsin2Ωcos2ϕ).

Here is a parameter representing the concentration of the distribution, is an analogue of covariance, which Kent describes as the “ovalness,” and is a normalizing constant given by

 c(κ,β)=π∫02π∫0exp(κcosx+βsin2xcos2y)sinxdydx.

If we choose , in accordance with our simplifying assumption that covariance of is , then we have . A full description of the Kent distribution can be found in Kent (1982).

KDE in the Plane: In , which is a sufficient for approximation of the earth’s surface over small regions, one natural choice of kernel is the Epanechnikov kernel. The Epanechnikov kernel is generalizable to arbitrary dimensions; in two dimensions, it is given by the product kernel:

 ^f(x,y)=916nh1h2n∑i=1(1−(x−xi)2h21)(1−(y−yi)2h22).

Epanechnikov Epanechnikov (1969) shows that the Epanechnikov kernel minimizes relative global error in the following sense: Let be the true distribution and assume that is analytic. Suppose is approximated by using Epanechnikov kernel. Then minimizes the functional:

 ∬R2E[^f(x)−f(x)]2dx. (3)

The preceding result is derived using the calculus of variations with constraints and is independent both of distribution (subject to the requirement of analyticity) and constraints on the kernel function.

Eq. 3 defines the Mean Integrated Square Error (MISE). Therefore, the Epanechnikov kernel minimize MISE. Scott Scott (2015) notes “…the MISE error criterion has two different, though equivalent, interpretations: it is a measure of both the average global error and the accumulated point-wise error.” The fact that the Epanechnikov kernel is a MISE minimizer makes it the natural choice of non-Gaussian kernel, especially since it also trivial to extend it to the case where the inferred distribution must have compact support.

In particular, we may constrain the magnitude of by requiring the KDE has a compact support. The Gaussian kernel does not respect such a constraint, as it gives non-zero probability to the whole of the manifold , rather than restricting the non-zero probability to a region that satisfies velocity constraints. As we also show in Section IV, a kernel with bounded support can also be useful to discuss feasible regions.

### ii.3 Deriving a Point Estimator and Uncertainty Regions with a KDE

Given a PDF (potentially a KDE) , there are several ways to find a point estimator. The most straightforward method is to compute , the expectation. However, if is multimodal on and there are constraints on the dynamics (see Eq. 1), then it is possible , where is the feasible subset of . In this case it is more useful to to compute:

 ~x=maxx∈Ωf(x), (4)

as the point estimator. Depending on the numerical complexity, one can also define the conditional distribution and take . We note that Eq. 4 may be a non-convex optimization problem. In this case, it might be simpler to compute the unconstrained optimization and either (i) accept that or (ii) project onto the boundary of the feasible region. We discuss the limits of this approach in Section IV.

To find uncertainty regions, we compute the highest density region, following the Monte Carlo technique established in Hyndman (1996). The highest density region (HDR) of a distribution function is the set of corresponding to the preimage of a horizontal “slice” of where the slice includes the maximum of the PDF, and continues extending down until a certain percentage of the support is above a given threshold.

For a PDF with support define:

 R(c)={x∈X:f(x)≥c}. (5)

The highest density region is the set where:

 cα=argmaxc∫⋯∫R(c)f(x)dx≥(1−α). (6)

Hyndman Hyndman (1996) proposes a Monte Carlo algorithm which samples the computed PDF (in our case the KDE) and uses the quantile as an estimator for . We use this algorithm in the sequel.

Lastly, since we consider a sequence of KDE’s which advance in time, there is an implicit inclusion of the velocity of the object. Thus, we do not need to actually compute explicitly while generating a forecast.

## Iii Algorithm Description

In this section we motivate Algorithm 1 to forecast a finite sequence of triples where is an estimator of , is a distribution used to construct the HDR for that acts as an uncertainty region, and . The algorithm takes as input the observation set , the time-indexed sequence of observed positions. Recall from Eq. 2, is observed with noise. Note that we do not require for the points in .

The algorithm is broken into four stages:

1. Identify a collection of points in similar (in an appropriately defined metric) to the last known position of the particle.

2. Extract sub-trajectories of corresponding to each identified point in that (i) begin at the identified point and (ii) span an input forecast time. This set of sub-trajectories is denoted .

3. Densify the observed sub-trajectories in to obtain using a line integral minimization on an estimator of . Each densified trajectory in is composed of points on that are equally spaced in time.

4. Let:

 ¯¯¯¯¯¯P={{¯¯¯x(j)i}j1+¯qi=j1:j∈{1,…,|H|}}

be the densified trajectories. For each (time index) use the set:

 Xi={¯¯¯x(j)i:j∈{1,…,|H|}}

to construct , the KDE. Use the KDE to construct and an associated HDR representing an uncertainty region.

Metrics and Tolerances: In Step 1, we have already defined the distance on the manifold . Let

be the corresponding metric tensor on the manifold. Given two velocity vectors

and , let the angle metric to be:

 δ(v1,v2)Δ=1−g(v1,v2)∥v1∥∥v2∥

This is just the standard cosine distance when . In the absence of velocity data, we can use a finite difference method so that (e.g.) the velocity vector of the last observed point in is:

 vN≈d(xN,xN−1)tN−tN−1

As input to Algorithm 1, we take two parameters , which is a a tolerance on and , which is a tolerance on . These function as hyper-parameters in our proposed algorithm.

Forecast Duration: Define to be the forward time restriction beginning with and including all points so that . That is:

 P(xi,T)={xk∈P:k≥i∧tk−ti≤T}

In Algorithm 1, is the duration of the forecast and is an input parameter.

Sampling Period: For sparse track reconstruction and forecast generation, we require a sampling frequency. That is, a value so that if is a densified trajectory corresponding to some subtrajectory , then for all : , where . This sampling period gives resolution to intermediate points of prediction but does not affect predictions made at any given point. It is now straightforward to see that .

Track Densification: Suppose and must be densified; i.e., there is some pair so that so that . (Note: if is too dense, it is trivial to downsample it to make it sparser.) If approximations and are available, then it is trivial to solve (numerically):

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩minγ∫tj+1tj^L(γ,˙γ,t)dts.t.g(γ,˙γ)≤0γ(tj)=xj,γ(tj+1)=xj+1 (7)

The resulting solution can be used to provide an estimated track of arbitrary density. If is not available, then it is straight-forward to define a Gaussian well function:

 ^LG(x)=N∑j=1⎛⎜ ⎜⎝1−1√2πσ2iexp(−d(x,xi)22σ2i)⎞⎟ ⎟⎠ (8)

or a least squares function:

 (9)

and solve the constrained line integral minimization problem:

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩minγ∫γ^L∗(x)dxs.t.g(γ,˙γ)≤0γ(tj)=xj,γ(tj+1)=xj+1 (10)

Using Eq. 8 in Eq. 7 causes inferred trajectories to follow historical trends, since the center of the Gaussian well yields the minimal energy, while using Eq. 9 causes the path to minimize the square error. One benefit to Eq. 9 is it has simpler theoretical properties as we show in the sequel. On the other hand, when using Eq. 8, if is an increasing function of , then this is a continuous variant of pheromone routing Sim and Sun (2003); Hu et al. (2010).

Constraint inference is more difficult. In practical situations (e.g., ship tracks) there are obvious constraints in play, like land avoidance (see Section V for examples). For the remainder of this paper, we assume that the constraint function (and hence the feasible region ) is known (or at least an approximation is known) and consider constraint estimation as future work. Algorithm 1 shows the pseudo-code for the proposed algorithm.

## Iv Theoretical Results

If the Lagrangian is stationary, we can show that the optimal solution to the problem given in Eq. 10 is asymptotically when and as the sampling rate increases. Assume . Further assume we have observations from the path connecting with denoted as . We are considering the asymptotic case when the sampling rate is infinite, so:

 x(i)(t)=φx0(t−t0)+ϵ(i)t=φ(t)+ϵ(i)t (11)

Assuming we use as our estimation for then we solve:

 minγ∫tft0M∑i=1d(γ(t),x(i)(t))2∥∥γ′(t)∥∥dt,

subject to any appropriate constraints. Here must represent the norm in the manifold . At any time instant , the value minimizing is:

 γ∗(t)=1MM∑i=1x(i)(t)=φ(t)+1MM∑i=1ϵ(i)t

from Eq. 11. (To see this, note the integrand is simply the energy function for a mechanical equilibrium point.) We assumed the noise was i.i.d with mean zero. Therefore, as ,

 1MM∑i=1ϵ(i)t→0

and . We ignored constraints only because we can see that must satisfy these constraints and therefore, asymptotically so will . A similar argument can be made for , but it is not as clean, owing to the additional exponential function in the Gaussian.

Using the above results, we see that the proposed technique for filling in missing information in our discretely sampled noisy signal is (in some sense) an optimal one, assuming a stationary Lagrangian. In the case of non-stationarity, the problem is more difficult, and hence the use of heteroskedastic variances (see Eq. 8) related to the time of the observation.

The inferred point predictor given in Eq. 4 is simple to implement but does not take constraints into consideration. Therefore, we quantify how far outside the feasible region a point predictor could be expected to be. This can be used to determine whether should be considered.

As before, let be the feasible region for the trajectory . Without loss of generality, assume is a proper subset of , so that the feasible region is non-trivial. Let be the infeasible or forbidden region. Denote by the topological closure of with boundary .

We show that feasible regions (and hence forbidden regions) are (partially) inferred as a part of Algorithm 1. To do this, define the point-to-set metric by

 ρ(S,x)=inf{d(x,y):y∈Y}

That is, this is the distance from to the closest point in the set .

Let be a fixed forbidden region with closure and boundary denoted as above. Assume the prediction point is computed with the unconstrained form of Eq. 4 and the Epanechnikov kernel with bandwidth vector . Let be the dimension of , and be the Euclidean metric on . Then:

In other words, any prediction will be at most distance inside of the forbidden region. If this is trivial, so we consider the case when .

From Eq. 4, is a local maximum of a KDE . Let be the shifted Epanechnikov kernel

 Kh(⟨x1,x2,...,xd⟩)=K(x1−xj1h1,x2−xj2h2,...,xd−xjdhd)

centered at . Then the support of is the parallelepiped:

 [xj1−h1,xj1+h1]×[xj2−h2,xj2+h2]×...×[xjd−hd,xjd+hd].

The support of is the union of the supports of the individual kernels centered at for , hence there is some point such that . We know since . To maximize

 supy∈(supp^fi)∩Yρ(Ω,y),

that is, to have conditions which allow to be as far inside as possible, we need to minimize . The closest that could be to without being in is if , the boundary of . This of course assumes that is open - if is closed, then is strictly positive and our argument still holds. We now see that .

It should be noted that the essence of this argument extends to any Kernel whose support is bounded. We also note that if the topological diameter of a forbidden region is smaller than , then this property does not prohibit from being at any point of .

## V Experimental Results

We discuss two experiments to test Algorithm 1. In the first experiment, we forecast two cruise ships over several days (Carnival line’s Freedom and Dream ) to evaluate the performance of Algorithm 1 in a real world context. In the second experiment, we evaluate the efficacy of Algorithm 1 with a synthetic data set. Use of a synthetic data set allows us to more closely control the underlying dynamical system and provides a method for exploring potential limitations of the proposed technique.

We use two error metrics to evaluate the algorithm: absolute pointwise error (), and percent in highest density region (). Let be the true position of particle at time . We create a forecast using Algorithm 1 (including information prior to ), and obtain prediction point for time . The function is defined as the distance . As noted, we can construct HDR at . Let:

 χRi(x)={1x∈Ri0else

be an indicator function, then define

 %HDR(F)Δ=1¯¯¯q−1N+¯q∑i=N+1χRi(pi).

tells us how far off the pointwise forecast is while

tells if the true position is the derived uncertainty region. We compute mean and standard deviation of

for an entire forecast, to give a global error metric for a forecast as a whole.

### v.1 Ship Track Forecasts

Cruise ships exhibit highly recurrent behavior as they travel from port to port, according to a list of destinations which appeal to tourists. Cruise ships also use AIS to give their positions at a high sampling rate with low noise. This makes them excellent subjects on which to test our algorithm, as we can downsample a portion of a known track and generate noise for use as training data. After creating a forecast from this sparse, noisy training data, we can then use the remainder of the track as high-resolution ground truth for an error analysis of the forecast.

In order to test our algorithm with this data, we used (approximately) two-years of positional data on the Carnival line cruise ships Dream (from December 2011 - July 2012) and Freedom (from March 2012 - June 2013). The data was taken from sai , under fair use. This data was densely sampled, giving a location for each ship on average about once every fifteen minutes. The first 80% of the historical trajectory of each ship was used as the historical data for “training” the KDE model and the last 20% was used as an unseen track on which to test.

We downsampled the training data (the first 80% for each ship) to give one position every day with exactly 24 hour intervals, while retaining the resolution of the unseen track. Since there were usually not AIS positions at exactly 24 hours of time-difference, we linearly interpolated between the nearest known position before the 24-hour mark and nearest known position after the 24-hour mark. The choice of linear interpolation is “wrong” in the sense that we are working on an oblate spheroid as the manifold, but served the purpose of introducing noise into the training data. The result was a sparse noisy track; this was the desired condition for our historical data.

We generated several forecasts with different parameters. In particular, we considered forecast windows of one week with 15 minute resolution, and with input search radii of 10NM, 20NM, and 40 NM. In each case we chose a bandwidth of 1.5 degrees of latitude/longitude, or 90 nautical miles. The bandwidth was chosen by trial-and-error to yield a smooth forecast. There is certainly room for improvement in this approach, but our results confirm that it is a valid approach.

After performing the energy minimization step of Algorithm 1, we have densely sampled paths. Error metrics for forecasting are are plotted in Fig. 1. The trajectories of cruise ships change frequently (e.g., as a result of stop over at ports of call). By way of comparison, we note that course change dynamics would have to be known a priori when using (e.g.) a Kalman Filter.

Table 1 shows summary statistics for the cruise ship forecasting experiments. For Dream, as we increase the search radius, we see an increase in the average error and standard deviation of error, and a decrease in percent in HDR. For Freedom, the average error went down from 10NM to 20NM to 40NM. The standard deviation also only marginally increased and %HDR increased significantly.

At a higher level, this example provides relevant information about the proposed method. First, there is not necessarily a single best initial search radius - it is context dependent; i.e., a parameter of the model that must be fit. Secondly, it validates our choice of HDR as uncertainty region, because our worst prediction (Freedom, 10NM), was still within the HDR of the time. Moreover, Table 1 shows that greater error and standard deviation of error corresponds to a lower . It is true that the HDR depends on the bandwidth parameter, but with a properly chosen bandwidth, we have a reasonable uncertainty region.

Another positive aspect of our forecast is how it treats land. For the most part, the forecast respects the fact that it must remain in the water, and in the few cases were the forecast does go over land, it is only over small islands or tips of peninsulas, which may not be considered by the energy minimization step 111This step was carried out using a gridded representation of the Earth using a numerical approximation of the line integral.. This is important to note because we did not give as an input the location of landmasses in the statistical model. Not only does it respect navigational constraints in this manner, but we also see two interesting patterns in the Freedom forecast. When the ship goes around the Bahamas, the true track goes west of the islands, while the forecast goes mostly east of the islands. In this sense, we have a valid navigational pattern given by the forecast. A similar occurrence happens as the ship proceeds southeasterly, north of Puerto Rico. The forecast goes south in between Hispaniola and Puerto Rico avoiding both landmasses, before proceeding northwest at which point the error goes down to around 50NM. While the forecast was wrong, it gave valid outputs without explicit knowledge of the location of land.

Examining the plots in Fig. 1 more closely, we do not see a general trend in error with respect to time. For Dream the worst error occurs within the first two days of prediction and is almost entirely temporal. For Freedom the worst error is in the last two days and is almost entirely spatial. A traditional forecaster such as a Kalman or particle filter would be expected to have increasing error over time. Finally, we note that the error seems to be periodic. More specifically, it seems to cycle daily. It is possible that this is an artifact of our daily downsampling while preprocessing the historical data.

To summarize: Algorithm 1 gives a very reasonable forecast, with only minimal information about the manifold of interest. It respects navigational constraints, and does not appear to loose accuracy over time in any general way.

### v.2 Synthetic Data Forecasts

In order to have more control over the data set (thereby guaranteeing the data meets our assumptions), we created a synthetic history of trajectories on consisting of 10,000 data points with Gaussian noise, shown in Fig. 2. The generated track included six loiter points, including a bifurcating trajectory where, after reaching the loiter point located at , the particle uniformly choses to go to or . The purpose of the loiter points is to understand how the algorithm treats speed implicitly, while the purpose of the bifurcation is to see how the forecasting algorithm deals with such phenomena.

We generated a length trajectory from the same dynamical system as the first data points, with . In reporting (as in Table 2) we consider each index to be a half-unit of time, so that corresponds to reporting time and corresponds to time . We note that was not used to train the model (in the sense of contributing to the data used to build a KDE). This track is shown in blue in Fig. 3, where the points represent the actual and the lines simply connect the dots.

After generating training data and a ground truth trajectory for comparison to predictions, we compute a prediction . We plot the prediction in red in Fig. 3.

From Fig. 3, we see that the overal shape of the prediction and ground truth are roughly similar. In particular, the predicted path and ground truth are very close up through the predictions made after the ground truth leaves the loiter region.

The average is with a standard deviation of units. It is clear that this relatively large deviation comes from only . Looking only from to , we see the is less than units, and is as low as units at . Moreover, every observation is within a 70% HDR.

The first notable failure in the proposed method can be observed between times and . The predictions linger at the loiter point for a slightly longer time than ground truth. After there is finally a prediction made outside of the loiter region (), the distance to from is much larger than the distance between any two points on the test path.

To some extent the observed behavior is positive because the algorithm self-corrects, in the sense that it catches up where it should be, rather than just proceeding on from the loiter point at a constant velocity. If it did proceed from the loiter point at a constant velocity, it would then be temporally behind the correct position regardless of how accurate the prediction was spatially. On the other hand, in some cases this may be traveling too fast; i.e., the resulting predicted trajectory may not be physically realizable.

The second notable failure occurs , where the prediction back-tracks toward the loiter point, rather than remaining on its present trajectory. This is not consistent with any behavior exhibited by the ground-truth particle at any point on the trajectory. Again this has both a positive and negative interpretation, as the error at this point is quite low (), but it is not physically faithful to any pattern occurring in the data.

At this point, we note that the predicted path correctly used the right fork of the bifurcating path. This is a random choice. The underlying model chose uniformly to move to the right or left. The prediction points happened to follow the correct path because there was a slight skew toward the correct fork and then the test path also went right. We note that the underlying data used to train the KDE contained information on the bifurcation and the uncertainty included both possible paths. We note this by way of indicating that the uncertainty regions are far more important than specific pointwise predictors.

The third notable failure occurs at time . We see the prediction is made at the loiter point corresponding to the left fork. This makes sense because the distance between the loiter point at and is less than the distance between and . In other words, at the time which a particle following the left fork would be at the loiter point, a particle following the right fork would be somewhere roughly along the line between and . The concentration of mass of historical points (corresponding to the greatest peak in the KDE) at this time should then be expected to occur, as is the case, at , because the historical points on the right fork are more spread out. We can see this shown in the contour plot of the KDE at time in Fig. 4. This phenomenon also occurs (switching the role of the two forks) at time , where we see the prediction is made near , another loiter point.

This is perhaps the biggest failure of our approach. We would prefer an approach with the property that once a decision was made at a bifurcation point, the prediction would stay on one path - even if it was the wrong one. Not only is this wrong because it switches forks, it is also wrong because it creates a situation where the particle is traveling unreasonably fast. Correcting these issues in the algorithm is a subject of future work.

With the exception of the three errors, the predictions made were quite good. The prediction was close in space (if not always in time) to the ground truth prior to the loiter point at . Considering that we never explicitly compute speed, this is good performance. We also see that near the loiter points, the error becomes quite low. This makes sense, because we are most sure of the position when it is at a loiter point. Because of the relative sterility of our data, we accept that all the ground truth occurs within the 70% HDR, but note that it would be wrong to call it a “confidence region” in the strict sense of the term.

## Vi Conclusions and Future Directions

The previous literature on modeling and forecasting using Kernel Density Estimates has focused on well sampled data. In this paper, we develop Algorithm 1 to generate forecasts with the cases when we have a sparse, noisy history of a recurrent trajectory. This is an improvement on existing methods, due to the fact that our approach requires significantly less data. The energy minimization technique we use allows us to “connect the dots” between existing data points in an intelligent way, so that we can sample data as finely as we wish.

Algorithm 1 has several elegant features. Traditional forecasters use some approximation or knowledge of speed, which is then projected forward in time. Our forecasting algorithm treats speed implicitly as we develop time-indexed probability distributions over the manifold. This simplifies the computation, and makes the algorithm path-independent. In practice, this allows us to pick specific times at which to provide a forecast position and uncertainty function , rather than having the requirement of computing intermediate steps to predict the whole path.

The algorithm also has nice theoretical properties. In particular, the optimal solution of a minimal energy trajectory is the asymptotic limit of the prediction as the sampling rate goes to infinity. Moreover, with a reasonable choice of parameters, it respects the existence of forbidden regions, albeit with a “fuzzy” boundary.

When we assume but cannot guarantee that the governing energy function is the same, (as in the cruise ship forecasts) we still get reasonably good results. When we can guarantee that the underlying energy function is the same (as in our synthetic forecast) the algorithm performs quite well. It gives a reasonable forecast, with error that is mostly temporal, rather than spatial. When the error is spatial, it is as a result of a valid path which was not taken. In both cases of spatial and temporal error we still have an uncertainty region for each point, given by the highest density region of the KDE , and experimental evidence supports our claim that this makes sense as an uncertainty region.

While the algorithm performs well, there are certainly improvements which can be made. In particular, we note that often times the error is temporal rather than spatial. That is, the predicted point is not off because the predicted trajectory has high error in space, but rather because the time is off. A next step would be to consider how to improve this weakness in the algorithm. Additionally, for our ship forecast, we heuristically tuned the bandwidth to give a smooth path. It would be desirable to find a non-heuristic way to find the appropriate bandwidth, as the existing bandwidth selection rules (particularly Scott’s and Silverman’s Rules of Thumb) give a track with high error. Finally, improving our handling of bifurcating tracks when constructing

may also improve the resulting forecasts.

## Vii Acknowledgements

This work was supported in part by the Office of Naval Research through Naval Sea Systems Command DO 0451 (Task 23875). We would like to thank Brady Bickel and Eric Rothoff for helpful conversations during the development of this paper.