I Introduction
In the summer of 2011 an array of floating ocean surface buoys (drifters) were deployed in the Sargasso Sea to assess the lateral diffusivity of oceanic processes [1]. Each drifter was equipped with a global positioning system (GPS) receiver recording locations every 30 minutes. Addressing the primary goal of understanding the physical processes controlling lateral diffusivity requires significant processing of the drifter positions, including removing the mean flow across all drifters, accounting for the large scale strain field, and analyzing the residual spectra for hints of a dynamical process. However, it quickly became clear that the GPS position data, which can have accurracies as low as a few meters [2], was contaminated by outliers with position jumps of hundreds of meters or more. Prior to analysis, the position data requires removing the outliers, and interpolating gaps to keep the position data synchronized in time across the drifter array.
The basic problem is ubiquitous: observations from GPS receivers return observed positions at times that differ from the true positions by some noise
with variance
. The primary goal of smoothing is to find the true position not contaminated by the noise, while the primary goal of interpolating is to find the true position between observation times.The approach taken here is to use smoothing splines. Our model for the ‘true’ path is specified using interpolating bsplines such that
(1) 
where is the order (degree ) of the spline. For observations we construct bsplines such that for appropriately chosen coefficients . To smooth the data we choose new coefficients that minimize the penalty function
(2) 
for some tension parameter . If then and because , but if then this forces the to a th order polynomial (e.g., when , the model is forced to be a straight line because it has no second derivative). The resulting path is known as a smoothing spline and was first introduced in modern form by [3], but according to [4] the idea dates back to [5]. Once and are chosen, the smoothing spline has one free parameter () and its optimal value can be found by minimizing the expected mean square error when the true value of is known [6].
As a practical matter there are three issues that must be addressed before smoothing splines are applied to GPS data:

how do we choose and —and how do these choices affect the recovered power spectrum?

how do we modify the spline fit to accommodate the nonGaussian errors of GPS receivers?

how do we identify and remove outliers?
To address these issues, but also serve as a practical guide to other practitioners, we start by reviewing Bsplines in section II and introduce the canonical interpolating spline that is used as the underlying model for path in (1). We also demonstrate the effect that choosing has on the highfrequency slope of the power spectrum of the interpolated fit.
Section III takes a broad look at smoothing splines and the assumptions they make on the underlying process. Many of the ideas presented in this section are known to the statistics community, so here we present these ideas from a more physical perspective. We show that the penalty function in (2
) can be formulated as a maximum likelihood problem and applying tension is equivalent to assuming a Gaussian distribution on the tensioned derivative of the underlying process.
Section IV uses ensembles from synthetic data designed to mimic the oceanographic data in order to test a number of choices that have to be made. We first establish that setting is a reasonable choice. We then show that the tension parameter can be chosen a priori (without optimization of the mean square error) when the effective sample size
(which we define later) can be estimated from the data. This estimate for the effective sample size can then be used to reduce the coefficients,
, in the spline fit without increasing mean square error. Finally, we show how the effective sample size of the fit establishes the highest resolved frequency.The second half of the manuscript addresses issues specific to GPS positions errors. In section V we discuss the assumptions of stationarity and isotropy required for bivariate smoothing splines. In section VI we show that the GPS errors are not Gaussian distributed, but distributed, and we show how to modify the technique for a distribution. Finally, section VII addresses how to modify the expected mean square error minimizer to make smoothing splines robust to outliers.
One of the major outcomes of this work is the implementation of Matlab classes for generating bsplines, interpolating splines, smoothing splines as well as a class specific to smoothing GPS data^{1}^{1}1https://github.com/JeffreyEarly/GLNumericalModelingKit. These classes are highlighted throughout the manuscript in their relevant sections.
Ii Interpolating Spline
Assume that we are given observations of a particle position with no errors. The simplest possible form of interpolation would be a nearest neighbor method that assigns the position of the particle to the nearest observations in time. The resulting interpolated function is a polynomial of order (piecewise constant), shown in the top row of Fig. 1. The next level of sophistication is to assume a constant velocity between any two observations and use that to interpolate positions between observations, second row of Fig. 1. This also means that we now have a piecewise constant function that represents the velocity of the particle, shown in the second row, second column of Fig. 1. This is a polynomial function of order .
It is slightly less obvious how to proceed to a polynomial of order . With data points we can construct a piecewise constant acceleration (the second derivative) using the independent accelerations computed from finite differencing, but where to place knot points that define the boundaries of the regions and how to maintain continuity is slightly less clear. The approach taken here is to use Bsplines.
Iia BSplines
A Bspline (or basis spline) of order (degree ) is a piecewise polynomial that maintains nonzero continuity across knot points. The knot points are a nondecreasing collection of points in time that we will denote with . The basic theory is well documented in [4], but here we will present a reduced version specifically tailored to our needs.
The th Bspline of order is defined as
(3) 
This is the rectangle function as shown in the first row, first column of Fig. 2. If we are given knot points, then we can construct Bsplines of order , although notice that if a knot point is repeated this will result in a spline that is zero everywhere. To represent an interpolating function for the observations of a particle position we define knot points as
(4) 
This will create independent basis functions that provide support for the region (provided the last spline is defined to include the last knot point). The interpolating function is defined as where the coefficients are found by solving . The result of this process is shown in Fig. 1 for 7 irregularly spaced data points.
All higher order Bsplines are defined by recursion,
(5) 
This recursion formula takes two neighboring lower order splines and ramps the left one up over its nonzero domain and ramps the right one down over its nonzero domain. The result of this process is to create splines that span across one additional knot point at each order, and maintain continuity across one more derivative. Examples are shown in Fig. 2.
Any knot points that are repeated times will result in a total of splines of order one that are everywhere zero. This has the effect of introducing discontinuities in the derivatives for higher order splines. For our purposes, we will only use this feature to prevent higher order splines from crossing the boundaries. For order splines we will use knot points at locations
(6) 
This creates a knot point at every observation point, but repeats the first and last knot point. This has the effect of terminating the first and last spline at the boundary and creating second order Bsplines, . Once again the interpolating function is defined as where the coefficients are found by solving . The second row of Fig. 1 shows an example.
This process can be continued to higher and higher order Bsplines. For splines that are of even order, we create knots points with
(7) 
and for splines that are odd order, we create knot points with
(8) 
The knot points are chosen specifically to create splines for the data points such that the interpolated function crosses all observations . The path is the canonical interpolating spline of order K. Examples are shown in Fig. 1.
The knot placements in (7) and (8) are equivalent to the notaknot boundary conditions described in [4] and used in the cubic spline implementation in Matlab. In the usual formulation of the notaknot boundary condition, the knot positions do not change as a function of spline order, and therefore additional constraints have to be added at each order—especially the requirement that the highest derivative maintain continuity near the boundaries. In the formulation here, these constraints are implicit in (7) and (8).
IiB Numerical implementation
The root class in our suite of Matlab classes is the BSpline class, which evaluates a complete Bspline basis set given a set of knot points. This class was used to generate Fig. 2.
The interpolating spline used to generate Fig. 1 is implemented in the InterpolatingSpline class—a sublcass of BSpline. This class generates interpolating splines of arbitrary order given a set of data points , thus generalizing the cubic spline command that is built in to Matlab.
IiC Synthetic Data
Throughout this manuscript we generate synthetic data for both the signal and the noise. The velocity of the signal is generated from a Gaussian process known as the Matérn [7]. The spectrum of the Matérn is given by
(9) 
with , which has finite amplitude at low frequencies and powerlaw fall off at high frequencies, two physically realistic properties observed, among other things, in ocean surface drifters [8].
For these experiments we choose values of so that the high frequency spectrum is proportional to , , . The Matérn is used to generate the velocity of the signal and integrated to get positions. Parameters are chosen such that the square root of velocity variance in each direction is m/s and the damping scale minutes. These choices resemble the data from the drifters. Fig. 3 shows an example velocity spectrum of the signal with .
The position data is contaminated with (white) Gaussian noise with meters, a value chosen to resemble GPS errors. In section VIA we consider noise generated from a distribution which more accurately reflects GPS errors.
For all of these experiments we use a range of strides
, that is, subsampled versions of the underlying process as input into the spline fits. A stride of 100 indicates that the signal is subsampled to 1 every 100 data points. This lets us evaluate the quality of fit against different sampling rates.
IiD Spline degree,
We first examine a synthetic signal uncontaminated by noise, to examine the role of spline degree, , on the interpolated fit. As noted in [6], the degree of the spline sets its roughness. In terms of the power spectrum, this corresponds to the high frequency slope as can be seen in Fig. 3 which shows fits with . Setting produces a high frequency fall off in the spline fit of . Although this would appear to be a desirable feature when fitting to a process with slope , the mean square error is consistently higher.
The bottom panel of Fig. 3 shows the coherence between the spline fit and the true signal. There is no discernible difference in coherence between spline fits with . The coherence quickly drops to near zero at the same frequency in all three cases. The implication here is that the spline fits are essentially producing noise at frequencies above the lossofcoherence. This is why the shallower slopes (with more variance at high, incoherent frequencies) have a larger mean square error than the steeper slopes (with less variance at high, incoherent frequencies). The conclusion here is that smoother is better: it is better to use an unnecessarily high order spline to avoid adding extra noise at high frequencies.
Iii Smoothing Spline
A typical starting point for maximum likelihood is to establish the probability distribution function (PDF) of the errors,
. The canonical example in onedimension (e.g., [9]) is to assume that the error in our position measurements are Gaussian i.i.d. and are therefore drawn from the following probability distribution(10) 
where
is the standard deviation. This assumption alone places no assumptions on the signal itself, only on the structure of the noise.
The probability of the observed data given model is
(11) 
where we have taken .
Maximizing the probability function in (11) is also the same as minimizing its argument—up to a constant this is the log likelihood, called the penalty function
(12) 
Stated in this way it is plain to see that this is the same as asking for the ‘leastsquares’ fit of the errors.
Iiia Smoothing spline penalty function
The model used here will be the canonical interpolating spline of order described in section II. Of course, we have chosen our knot points such that the model intersects the observations and this certainly maximizes (11) (and minimizes (12)) because all the errors are zero, but the resulting distribution of errors (a delta function at zero) does not look anything like the assumed Gaussian distribution. Thus, if we want the error distribution that we get out to look like that which we assumed, it is necessary to constrain the problem in some way.
The smoothing spline augments the penalty function of (12) by adding a global constraint on the th derivative of the resulting function as in (2). If then this reduces to the leastsquares fit in (12), but if then this forces the model to an th order polynomial.
To interpret the first term of (2), consider a motionless particle at true position . Using the relevant observations , the sample mean estimates the particle’s position . The unbiased sample variance estimates the variance of the noise, , and is given by , the expected value of which is .
Now consider the opposite extreme where the particle is moving so fast (or the observations are so sparse) that each observation is completely independent of its neighbors. In this case, each observation must be considered separately, so the sample mean at time is just (i.e., we are summing over the single relevant observation). In this scenario we cannot produce a sample variance, because there is only a single relevant observation at time .
In practice, the number of relevant observations can be anywhere between and . Here we use the term effective sample size, denoted by , to describe the typical number of observations being used to estimate either the particle’s position or the variance of the noise at any given time. In this context, the first term of (2) is proportional to an ensemble of multiple estimates of the sample variance
(13) 
which is expected to scale as
(14) 
where is our definition of the effective sample size as determined from the sample variance. Revisiting the limiting cases, as the sample variance matches the true variance, but as , the sample variance vanishes.
There is a very simple physical interpretation for the second term in (2). Consider the case where so that the smoothing spline is a constraint on velocity. When averaged over the integration time, the integral produces the root mean square velocity, , which means that the second term scales like . In general, where is the rootmeansquare of the th derivative, this means that scales like
(15) 
The interpretation of the smoothing spline is therefore that the two terms are balanced by a relative weighting of the sample variance of the noise and mean square of the th derivative of the physical process. As will be discussed in section IV, both and can be estimated a priori and therefore a good initial estimate for can be made.
IiiB Smoothing spline maximum likelihood
The penalty function for the smoothing spline in (2) can be restated in terms of maximum likelihood under some conditions (see also chapter 3.8 in [10]). Assume that in addition to knowing about how the measurement errors are distributed like in (11), that we also know how the velocity of underlying physical process is distributed. For example, in geophysical turbulence it has been shown that the velocity probability distribution function is like the Laplace distribution [11]. To recover the smoothing spline, we need to consider the case where the velocity PDF is Gaussian. Stated as maximum likelihood, this means that at any given instant (not just the times of observation) we expected the model velocity to look Gaussian. We can discretize the problem by sampling the velocity times , where and . The maximum likelihood is thus stated as
(16) 
which is simply the joint probability of the error distribution from (11) and the velocity distribution of the underlying physical process. We also include parameter for convenience in order to set the relative weighting between the two distributions, although it could be absorbed into the definition of . Writing (16) as a penalty function (after converting the product of exponentials into exponentials of sums), we have that
(17) 
where is a constant. Setting and renormalizing the penalty function by (which has no effect on the location of its minimum), (17) can be written as
(18) 
Apart from the discretization of the integral, (18) is the same as the penalty function for a smoothing spline (2).
There is an important special case when tension is applied at the same order as the spline, . In this case the spline is piecewise constant for with exactly unique values. The parameter and (16) can be simplified. This case is appealing because only the unique values of the derivative that can be computed from data points are being used for tension, which is not the case when .
This maximum likelihood perspective shows that adding tension to the penalty function is equivalent to assuming that one of the higher order derivatives in the model (e.g., velocity if ) is Gaussian. This is therefore making an assumption about the underlying physical process of the model. This is in contrast to the first term which is entirely a statement about measurement noise.
As an aside, writing the smoothing spline as a maximumlikelihood condition (16), suggests that if the underlying physical process has a nonzero mean value in tension, the fit will not behave as expected. However, smoothing splines can be easily modified to accommodate a mean value in tension, as shown in appendix A.
IiiC Optimal parameter estimation
For a given choice of and , the minimum solution to (2) can be found analytically (see [12] and our appendix A). Once the solution is found the smoothing matrix is defined as the matrix that takes the observations and maps them to their smooth values, .
The free parameter is a relative weighting between the two terms in (2) and choosing its optimal value can be done by minimizing the expected mean square error [6],
(19) 
where is the Euclidean norm and indicates the trace.
It is worth noting that a fair amount of the literature on smoothing splines is devoted to minimizing the mean square error when the variance, , is not known. For example, [6] and [13] use crossvalidation to estimate and minimize the mean square error. Recent work comparing different estimators shows that no single technique appears to be optimal [14]. For our application however, the errors in GPS data can be relatively easily established, as shown in section VI.
The mean square error in (19) is a combination of the sample variance and the variance of the mean. As already discussed in the context of the penalty function in section IIIA, the first term in (19) is an ensemble of sample variances, and therefore by combining (13), (14) and (19) we obtain
(20) 
The second term in (19
) is proportional to twice the squared standard error, i.e., the variance of the sample mean. As discussed in
[12], the quantity is the covariance matrix with the squared standard error along the diagonal and thus the mean squared standard error is given by . The variance of the sample mean is known to scale inversely with the number of samples being used to estimate the mean. Thus, we use this to define the effective sample size of the variance of the mean, with(21) 
Taking the measures of effective sample size as functions of , the mean square error can be expressed by combining (19)–(21) such that
(22) 
If one assumes that , then the expected mean square error from (19) is equal to . Although not shown here, in an empirical analysis we find that and are approximately equal, although becomes highly variable when approaches 1.
These measures of effective sample size can be used to estimate the value of
necessary for optimal tension without minimizing the expected mean square error. Note that the definition of effective sample size used here is related to, but not the same as, the notion of degreesoffreedom used in
[15] and references therein.Iv Spline order, tension order, and the spectrum
With a model path (1), a penalty function (2), and a minimization condition (19), we have all the primary pieces to create a smoothing spline interpolant to the data. However, there are a number of choices that still have to be made. In this section we use synthetically generated data to represent our physical process, and contaminate the process with Gaussian noise as described in section IIC. We use this synthetically generated data to test our ability to recover the signal and examine the effects of changing the spline and tension order on the mean square error and the resulting spectrum.
The results of this section are empirical, and it is important to acknowledge upfront that any conclusions reached may depend on our particular choice of physical model that generates the signal which has been chosen to resemble the oceanographic data of interest. Nevertheless, our expectation is that the conclusions here are ‘O(1)’ correct, and applicable, at least, to our GPS tracked drifter dataset.
Iva Tension degree,
Given a smoothing spline of degree , the tension in the penalty function (2) can be applied at any degree . We use the synthetic data for the three different slopes to empirically establish the relationship between the tension degree, and the spline degree, .
For and all we minimize the mean square error against the true values. The minimization is performed for 200 ensembles of noise and signal with three slopes (, , ) and 5 different strides. For a given slope, stride, and realization of noise, we identify the minimum mean square error across and and compare all other values of and as a percentage increase relative to that minimum. After aggregating across slopes, strides, and ensembles, the 68% confidence range is shown in table I.
T  

S  1  2  3  4  5 
1  33.880.3%  
2  14.075.1%  0.812.1%  
3  17.177.5%  1.013.1%  0.04.5%  
4  22.881.9%  1.014.5%  0.04.6%  0.06.3%  
5  27.691.4%  0.815.4%  0.04.6%  0.06.1%  0.012.8% 
IvB Loss of coherence
The lossofcoherence defines the time scale below which the smoothing spline is not providing useful information. A reasonable hypothesis is that this scale is related to the effective sample size, because the effective sample size indicates how many points are being used to estimate the true value. Therefore the lossofcoherence occurs at the effective Nyquist which we define as
(23) 
In practice, we use because it is less variable than for values near 1 and is the more direct measure of how many points are being used to estimate the model path. Fig. 4 shows the power spectrum and coherence of optimal tension fits for three different strides of the data. In all three cases (23) indicates almost exactly where the coherence drops below 0.5.
IvC Reduced spline coefficients
One practical consideration when working with large datasets is that the computational cost of creating the spline fit may be limited by the rate of solving for the spline coefficients. It is therefore beneficial to reduce knot points (and therefore total splines) where possible. A reasonable hypothesis is to suppose that when the effective sample size is large, as measured by (21), that we may be able to avoid placing a knot point at every data point—essentially ‘skipping’ data points.
To test this idea, we find the optimal fit over a range of different strides (which varies the effective sample size) and increase the number of knot points that are skipped until the mean square error starts to rise. We find that we can safely skip knot points without sacrificing any precision. In fact, as can be seen in table II, in some cases the optimal mean square error improves with fewer knot points. The ‘full dof’ column indicates a fit where one knot point is created for every observation point, whereas the ‘reduced dof’ indicates a fit where the number of knot points is reduced.
This means that when handling large datasets, we can reduce the number of splines being used if the effective sample size is large, and we can simply ‘chunk’ the data (split into multiple independent pieces) when the effective sample size is small.
IvD Interpolation condition
To estimate the value of from (15), we require an estimate of the mean square value of a derivative of the process, as well as an estimate of the effective sample size, . Assuming one can make an estimate of from the signal (see appendix C), we just need a method for estimating the effective sample size.
We argue that the effective sample size should vary based on the relative size of the measurement errors to the speed of motion. For example, if the position errors are only meter, but a particle typically travels meters between measurements, then it is hardly justifiable to increase the tension so that the smoothing spline misses the observation points by meter. There is not enough statistical evidence to suggest that the particle didn’t go right through the observation point. On the other hand, if the position errors are meter, but the particle typically travels centimeters between measurements, nearby measurements provide more information about the particle’s true position during that time, so our estimate of the particle’s true position is closer to a mean of the nearby observations.
This idea can be made more rigorous by noting that one would consider change in position, , statistically significant if it exceeds the position errors by some factor. Assuming the physical process has a characteristic velocity scale, , we use this concept to define as
(24) 
where is the typical time between observations. This argument suggests that the effective sample size should be proportional to , i.e.,
(25) 
where and are unknown constants, and we prevent the effective sample size from dropping below 1. Intuitively this means that as long as the particle does not move too far between observations, nearby observations help to estimate the true position of the particle.
To test the relationship between and the effective sample size, we compute the optimal smoothing spline for a range of values of (created by subsampling the signal) for the three different slopes (, , ). The value is computed from the optimal solution for 50 ensembles and shown in Fig. 5. The fits are remarkably good, but depend on the slope of process. Processes with shallower slopes (rougher trajectories) provide a smaller effective sample size for a given value of .
Using the interpolation condition to estimate the effective sample size, we set , the empirically determined best fit for slope . For all spline fits then, we use
(26) 
as an initial estimate for the optimal smoothing parameter where both in (26) and in (24) are estimated using the method described in appendix C.
The scaling law for can be found analytically. Let the position observations be given by where
(27) 
If the effective sample size is , then the particle changes position by between samples. Applying the twosample test, two positions will be considered different for where
(28) 
The power law in (28) matches the empirically derived power laws shown in Fig. 5 and suggests that in (25) should be . This also suggests that the coefficient in (25) can be related to , a measure of statistical significance.
IvE Optimal fits
stride  optimal mse  reduced dof  blind initial  expected mse  

1  8.6  11.5 m  0.1%  56.4%  7.4% 
2  4.9  20.4 m  0.0%  36.3%  2.8% 
4  2.9  34.2 m  0.1%  20.0%  1.7% 
8  1.7  55.9 m  0.0%  5.6%  1.0% 
16  1.2  81.8 m  0.0%  3.6%  0.5% 
1  12.5  7.64 m  0.1%  38.6%  6.4% 
2  7.1  13.4 m  0.1%  20.4%  3.5% 
4  4.1  23.5 m  0.0%  9.8%  2.2% 
8  2.3  41.8 m  0.0%  1.7%  1.2% 
16  1.4  67.9 m  0.0%  9.6%  0.6% 
1  15.6  5.69 m  0.1%  33.8%  7.9% 
2  9.0  10.5 m  0.1%  18.6%  5.1% 
4  5.0  18.6 m  0.0%  8.6%  2.4% 
8  2.8  33.2 m  0.0%  3.2%  1.5% 
16  1.6  57.6 m  0.0%  15.4%  0.8% 
Table II summarizes the key results of this section by applying a smoothing spline with with to the 200 ensembles of the noise and signal with three different slope (, , ) and five different strides. The second and third columns show the effective sample size and average mean square error when the smoothing spline is applied using the true values to minimize the mean square error—this is the lower bound. The fourth column shows average increase in mean square error when reducing the number of spline coefficient as documented in section IVC. There is almost no change in mean square error and therefore all subsequent methods (whether blind or unblind) use this technique. The fifth column uses (26) from section IVD to provide a (blind) initial guess of the tension parameter. Here the results are mixed—a typical increase in mean square error is about 3050% when the effective sample size is large. While this might seem large, this is a small fraction of the total variance of the noise, e.g., an optimal mean square error of m increase to m when the total variance is m. When the data sets are small (and computation time is not a limiting factor), nearly optimal fits can be found using (19), as shown in the last column of the table.
IvF Numerical implementation
The numerical implementation of the methods in this section are available in the SmoothingSpline class which subclasses BSpline. This class is initialized with three required parameters: a set of data points
and a distribution (specifically a normal distribution for the results in this section). The initial value of
is chosen using (26). The SmoothingSpline class implements a .minimize() method which takes any function of the spline as an argument (such as (19)), and minimizes the function by varying .V Bivariate smoothing splines and stationarity
Up to this point we have considered univariate data, , but GPS position data is fundamentally bivariate. The term ‘bivariate’ in the context of splines is often used to denote splines defined on two independent variables—however, in this context we define bivariate to mean two dependent variables (e.g., and ) and one independent variable (e.g., ).
The trivial approach to working with such bivariate data is to treat each direction independently—i.e., minimize and independently of each other. However, it is often the case that the underlying physical process is isotropic. In the context of the maximum likelihood formulation of smoothing splines (18), this means that we expect (the rms value of the tensioned variable) to be the same in all directions (invariant under rotation). This however does not mean that should necessarily equal . To be explicit, if
(29) 
then even if , the effective sample sizes and will not necessarily be equal if there is any mean velocity because, as shown in section IVD, the effective sample size depends on velocity.
Therefore to assume isotropy in and use a bivariate smoothing spline, the mean velocity from the underlying process must be removed. What qualifies as mean and fluctuation rarely has a clear answer, but a reasonable option is letting a polynomial of degree define the mean. This has the added benefit of removing a constant nonzero tension value, which as shown in section IIIB, changes the problem formulation.
It is worth noting that it is not actually isotropy that requires removing the mean velocity, but in fact stationarity. The effective sample size is shown to be dependent on rms velocity, so if the velocity varies in time, then the optimal effective sample size will need to vary as well. This means that not only do smoothing splines require stationarity in the tensioned variable as shown in section IIIB, but they also require stationarity in the velocity to be effective. This last requirement can be solved by either removing the mean (as we have suggested), or segmenting observations into pseudostationary chunks.
Va Assessing errors
Removing the mean or some other lowpassed version of the data means that the total smoothing matrix will be some combination of the lowpassed and highpassed smoothing matrices. Once this matrix is computed, it can be used to compute the standard errors.
We first create a low pass filter to capture the mean component of the flow using a simple polynomial fit,
(30) 
and then define the residual as our stationary part,
(31) 
We now compute the smoothing spline as usual on the residual,
(32) 
So the total, smoothed path is
(33) 
From this we can compute the covariance matrix and the standard error.
VB Numerical implementation
The BivariateSmoothingSpline class is initialized with data and a distribution. For a spline of degree , a spline of degree is used to remove the mean in each direction. In the case of a normal distribution, this is simply a least squares polynomial fit. By assumption, the residual data (, in the notation above) is stationary and isotropic, so the tension parameter is applied equally to spline fits in the two directions. Minimization is performed on the sum of the expected mean square error in both directions.
Vi GPS data set
The primary dataset considered here will be nine surface drifters that were deployed in the Sargasso Sea in the summer of 2011 [1]. In the past, such drifters used the Argos positioning system which has significantly poorer temporal coverage and position accuracy [16], but recently the majority of surface drifters have employed GPS receivers and transmitted their data back through Argos or Iridium satellites.
The GPS receiver sits on the surface drifter and collects position data, but because of atmospheric conditions or ocean waves, the receivers are sometimes unable to obtain a position, or when they do, it is highly inaccurate. Thus, despite nominal accuracies of a few meters, it is often the case that some positions are off by more than 1000 meters, as can be seen in Fig. 8. Applying a smoothing spline fit using the methodology in section III produces an extremely poor fit, with clear overshoots to bad data points.
Via GPS error distribution
We characterize the GPS errors by considering data from a motionless GPS receiver allowed to run for 12 hours. The specific GPS receiver used for this test was not the same as the one used for the drifters (because it was no longer available) but should produce errors similar enough for this analysis.
The position recorded by the motionless GPS are assumed to have isotropic errors with mean zero, which means that the positions themselves are the errors. The probability distribution function (PDF) of the combined and position errors are shown in Fig. 6.
The error distribution is first fit to a zeromean Gaussian PDF (10). The maximum likelihood fit is found by simply computing the standard deviation of the sample, which is found to be meters and shown as the gray line in Fig. 6. However, it is clear the error distribution shows much longer tails than the Gaussian PDF.
The Student distribution is a generalization of the Gaussian that produces longer tails and is defined as
(34) 
where the parameter scales the distribution width and the parameter sets the number of degrees of freedom. The variance is and only exists for . The distribution is equivalent to the Gaussian distribution when . We find the best fit distribution to the data by minimizing the AndersonDarling test. The best fit with parameters meters and is shown as the black line in Fig. 6. Different choices in GPS receivers and using the KolmogorovSmirnoff test results in very similar parameters, i.e., meters and .
The position error distributions also imply a combined distance error distribution by computing and is shown in the lower panel of Fig. 6. For two independent Gaussian distributions this results in a Rayleigh distribution,
(35) 
The distance distribution for two distributions is computed numerically and is shown in the bottom panel of Fig. 6 on top of the actual distance errors. Approximately 95% of distance errors are within meters.
Fig. 7 shows the autocorrelation function of the GPS position errors. We find a rough empirical fit to be where seconds and seconds, which reflects an initially rapid fall off in correlation, followed by a slower decline. The smallest sampling interval of the GPS drifters in question is 30 minutes and therefore it is safe to assume the errors are uncorrelated for our purposes. Although the drifter sampling rate allows us to avoid further discussion of the autocorrelation function of GPS errors, accounting for autocorrelation is a relatively easy extension (and in fact, already implemented in the code).
The smoothing spline algorithms described in section III are modified to use the distribution as described in section B. Table III shows that the conclusions reached for Gaussian data in section III still apply with distributed data.
stride  optimal mse  reduced dof  blind initial  expected mse  

1  8.2  11.8 m  0.3%  66.7%  7.7% 
2  4.7  20.9 m  0.3%  47.3%  6.6% 
4  2.8  38.0 m  0.1%  24.2%  4.4% 
8  1.6  66.3 m  0.0%  8.2%  9.3% 
16  1.2  101. m  0.0%  8.1%  3.7% 
1  12.1  7.51 m  0.1%  36.2%  8.8% 
2  6.8  13.4 m  0.1%  22.8%  7.0% 
4  3.9  26.0 m  0.0%  11.5%  3.8% 
8  2.2  47.5 m  0.0%  2.2%  3.2% 
16  1.3  82.5 m  0.0%  12.6%  8.5% 
1  14.9  6.01 m  0.2%  35.3%  9.0% 
2  8.6  10.5 m  0.2%  24.8%  7.0% 
4  4.8  19.1 m  0.1%  7.8%  4.6% 
8  2.7  36.4 m  0.0%  3.2%  2.7% 
16  1.6  69.1 m  0.0%  18.9%  11.5% 
Vii Minimization with Outliers
The goal here is to find a smooth solution in the presence of outliers—points that do not appear to be of the known error distribution for the GPS receiver shown in section VIA. These points are obviously problematic as can be seen in Fig. 8, where individual data points jump hundreds of meters and even several kilometers away from its neighbors. Errors of this size are inconsistent with the noise analysis of the preceding section, so the goal here is to find a model path robust to this uncharacterized noise. What makes outliers ‘obvious’ to the eye is that they appear as unexpectedly large motions, inconsistent with most of the other motion for that path. In this sense, the smoothing spline formulation is a good one as it assumes the motion at some order (e.g., acceleration) is Gaussian, as shown in section IIIB. Interestingly, in the nine drifters we are analyzing here, one drifter shows no obvious outliers, suggesting the issue may be related to how the antenna is configured. This particular drifter serves as a useful point of comparison.
Minimizing with the expected mean square error (19) produces a fit so poor that it is not worth showing. Because outliers add enormous amounts of variance, the expected mean square error vastly under tensions the spline—essentially chasing every outlier shown in Fig. 8. Because some of the noise is uncharacterized, this suggests using a method such as crossvalidation might be effective. The orange line in Fig. 8 uses a smoothing spline fit, assuming Student tdistributed errors, but minimized with crossvalidation. This fit performs relatively well, but compared with the drifter 7, it is clear that it still chases some outliers. The goal in this section is to develop a method robust to outliers in cases where we know something about the noise.
The basic problem formulation is as follows: we define a new ‘robust distribution’, , that includes the known noise distribution, , plus an unknown (or assumed) form of an outlier distribution, ,
(36) 
We consider a distribution for with parameters found from the GPS errors in section VIA. The distribution of is also set to be a distribution, but with and which roughly matches the total variance of the observed outliers. In our tests we varied from 0 up to , approximately the range of observed outliers from the drifter data sets.
Throughout our attempts to smooth the noisy GPS data we tried many different approaches to modifying smoothing splines for robustness to outliers, but ultimately found that enormous gains in accuracy are made by simply discarding outliers while minimizing the expected mean square error (19). The results of this approach are shown in section VIIA, but we also document our methodology to reliably estimate the outlier distribution in section VIIB.
Viia Robust minimization
The whole problem with outliers is that we do not know their distribution, so minimizing the expected mean square error using (19) with the expected variance from the robust distribution defined in (36) cannot possibly work. Outliers add extra variance, and will therefore cause the spline to be under tensioned ( too small). The key concept behind our method is to simply exclude the outliers from the calculation of (19), where outliers are defined as points unlikely to arise with the known noise distribution. The ranged expected mean square error thus replaces with,
(37) 
and discards all rows (and columns) of where or .
To test this approach we generate data as before, but now also let a certain percentage of outliers () be generated with an outlier distribution following (36). We consider five different values of as well as , which is just (19). Tests across a number of ensembles with outlier ratios we find that is overall the best choice.
ViiB Full tension solution and outlier distribution
The full tension solution is defined as the maximum allowable value of given the known noise distribution. That is, the spline fit is pulled away from the observations so that the distribution of observed errors () matches the expected distribution . In cases where the effective sample size is large, the full tension solution will approximately match the optimal (minimal mean square error) solution. In cases where the effective sample size is small, the full tension solution is more akin to a lowpass solution (because increasing is equivalent to decreasing ).
In the simplest case where there are no outliers, the full tension solution can be found by requiring that the sample variance match the variance of . When outliers are present, a more robust method of estimation is required. After some experimentation, we found that the most reliable method of achieving full tension is to minimize the AndersonDarling test of on the interquartile range of observed errors. In fact, we found that this method can be used to estimate the outlier distribution and further refine both the full tension solution and the range over which the expected mean square error is computed.
The outlier distribution is estimated in the following fashion. We first assume that the outlier distribution follows a distribution with and that . If the spline is in full tension, then the observed total variance can be used to find for the outlier distribution. From (36) it follows that,
(38) 
which, given some , can be solved for . Our method considers 100 different values of logarithmicaly spaced from to and chooses the value which minimizes the AndersonDarling test.
With an estimate for , the full tension solution can be refined by now minimizing the AndersonDarling test of on the interquartile range of observed errors. This iterative process converges quite quickly on a good estimate for the outlier distribution and the full tension solution.
ViiC Extension to bivariate data
The strategies in this section are relatively easily extended to bivariate data. All error distributions are assumed isotropic, and thus the outlier distribution can be estimated by including the errors from both independent directions. The ranged expected mean square error calculation defined in section VIIA uses the distance of the error for its cutoff in order to remain invariant under rotation.
Application of this methodology to one of the GPS drifters (drifter 6) is shown in Fig. 8. Although it is impossible to know exactly how well the smoothing spline fit performed, comparison with drifter 7 (with no apparent outliers) suggests that our methodology successfully avoids chasing outliers.
ViiD Numerical implementation
The GPSSmoothingSpline inherits from the BivariateSmoothingSpline class and assumes the errors follow the distribution found in section VIA. The class also projects latitude and longitude using a transverse Mercator projection with the central meridian set to the center of the dataset.
Viii Conclusions
The methodology manuscript solves our initial problem of finding smoothed, interpolated positions from our noisy GPS drifter dataset with outliers. For signals similar to the Matérn process, we found that

the spline degree should be set to a value higher than the high frequency slope of the process (section II)

the tension degree can be set to (section IV), and

the optimal tension parameter can be estimated a priori (also section IV).
For the GPS data in particular, there appear to be three key steps for using smoothing splines to achieve these results:

using a distribution for the noise (section VI),

removing the mean velocity to make the bivariate data stationary (section V), and

using the ranged expected mean square error for robustness to outliers (section VII).
The effective Nyquist identified in section IVB indicates that the power spectrum for the GPS drifters resulting from the smoothed fit is valid up to about half the nominal sampling rate.
Appendix A Numerical implementation
The Bsplines are generated using the algorithm described in [4] with knot points determined by (7) and (8). The matrix with components denotes the th Bspline at time
. In this notation the column vector
represents the coefficients of the splines such that positions at time are given by where .The smoothing spline condition given in (16) can be augmented to include a nonzero mean tension, ,
(39) 
where we have taken for this calculation. The discretized penalty function is
(40) 
where denotes the covariance matrix describing the measurement errors and we absorbed several constants into . To find the coefficients that minimize this function, we take the derivative with respect to , set it to zero, and solve for ,
(41) 
where is a vector of s. The operation essentially integrates the splines and results in a column vector with the integrated values.
We define the smoothing matrix as the linear operator that takes observations to their smoothed values ,
(42) 
From this definition and (41),
(43) 
when .
Appendix B Iteratively reweighted least squares
In practice it is challenging to use the distribution directly because it does not result in a linear solution for the coefficients as in (41). One method around this issue is to use a search algorithm to directly look for the maximum values. Alternatively, one can use the iteratively reweighted least squares (IRLS) method.
The idea with IRLS is to reweight the coefficients of the Gaussian, in (10), so that the resulting distribution looks like the desired distribution, e.g., (34). Recalling that , the minimization condition that , implies that
(44) 
for the Gaussian distribution, whereas for the distribution this implies that,
(45) 
This means that one can set
(46) 
to get a matching distribution. Of course, this is only true if is already known, which initially it is not. So the method becomes iterative—one starts with determined from the Gaussian fit, then determine a new after reweighting . This method iterates until stops changing. We can rewrite (46) as a function of ,
(47) 
From (47) it is clear that if then it will be reweighted to a smaller value, essentially making the observation point more strongly weighted. On the other hand, if , then its relative weighting will decrease, and it will be treated more as an outlier.
More generally, the weight function for a pdf is found by setting equal to of a Gaussian pdf where replaces , and then solving for . The result is that,
(48) 
Note that the same strategy could be used to reshape the pdf of a Gaussian to match the desired distribution, but here we simply match the minimization conditions of the pdfs.
As a point of reference, Tukey’s biweight is given by,
(49) 
which, as a weight function is,
(50) 
In a practical sense, the of (43) is replaced with the diagonal matrix populated with the reweighted values for each observation such that,
(51) 
This operator is again used to compute the standard error from the variances, , where the variance is assumed to be for each observation when using a distribution.
The reality is that the smoothing spline solution does depend on the initial value of used in the IRLS method. That said, we find that for uniform initial weightings (e.g., all values start with the square root of the variance), the differences are not statistically significant from other initial values.
Appendix C Estimating the variance of the signal
The method in this paper depends on good estimates of the rootmeansquare velocity, , of the signal in order to determine the effective degrees of freedom, as well as the variance of the tensioned derivative. The approach taken here is to compute the power spectrum of the signal at the derivative of interest, and sum the variance that is statistically significantly greater than the expected variance of the noise.
Given a process observed with values at times where , we estimate the mean of its th derivative by performing a least squares fit to the polynomial . The detrended time series is then defined as . The power spectrum of this time series is given by
(52) 
where the frequencies are given by . By Plancherel’s theorem,
(53) 
The power spectrum of the th derivative of the process is computed as
(54) 
Note that it is important to detrend the signal prior to computing the derivative because, by assumption, the signal is periodic and has no secular trend.
The noise, , has total variance . Because the noise is assumed to be uncorrelated, the variance distributes evenly across all frequency. The spectrum of the noise is therefore