1 Introduction
When continuoustime records are separated into natural consecutive time intervals, such as days, weeks, or years, for which a reasonably similar behavior is expected, the resulting functions may be described as a time series. For this kind of functional time series, each observed trajectory is a random function. Researchers have proposed a variety of prediction methods for stationary functional time series. Besse, Cardot, and Stephenson (2000) proposed a nonparametric kernel predictor. Antoniadis and Sapatinas (2003) studied firstorder functional autoregression curve prediction based on a linear wavelet method. Kargin and Onaski (2008) introduced the predictive factor method. Aue, Dubart Norinho and Hörmann (2015) proposed to use multivariate techniques.
Functional data, sometimes exhibit two types of variabilities, say, amplitude variability, which corresponds to the sizes of features of curves, and phase variability, which pertains to variation of locations of curve features. For example, Figure 1 presents the smoothed curve of annual ocean surface temperatures of seven consecutive years from 1957 to 1963 in the Niño 1+2 region, which is between the International Date Line and
W. Each curve has a peak and a valley, corresponding to hot season and cold season. The time of hot and cold season in different years can be varied. Consequently, it is important to consider the phase variability in this case. However, existing research works only consider amplitude variability of functional times series, but not phase variability. An immediate result is that, the predicted curve may not show the common pattern of the population. When trajectories share a common pattern and meanwhile present phase variation, a typical technique researchers usually adopt is functional registration, which seeks to classify the total variability into two categories, amplitude variability and phase variability (see e.g. Srivastava and Klassen (2016)). To the best of our knowledge, methods for prediction in functional data have not incorporated curve registration. To overcome this serious limitation, we develop a novel method for stationary functional time series, whose trajectories share a common pattern. Our goal is not only to give competitive prediction in terms of mean squared error, but also to preserve the underlying pattern for the predicted curve.
The prediction method in this article involves the prediction of amplitude functions and warping functions. The major challenge is the prediction of warping functions, since they do not lie in a linear space, and thus ordinary linear models are not applicable. Warping functions must be monotonically increasing, and they are restricted to start and end at two fixed values. There are several ways to model warping functions. Generally speaking, all these methods seek to apply linear model to nonlinear objects.
It is noted that warping functions share similar properties with probability distribution functions. There are some papers on modeling probability density functions. A typical idea of these research works is to use some transformations ensuring that the transformed objects are still in a linear space. Brumback (2004) proposed a selfmodeling method for monotone functions involving the transformation proposed by Jupp (1978), which is a bijective map from the space of monotone increasing vectors to Euclidean space. Gervini (2015) used the Jupp transformation to study warped functional regression. In their works, the authors apply Jupp transformation to transform increasing vectors to unconstrained vectors. Peterson and Müller (2016) proposed to use the log quantile density transformation and log hazard transformation to map a density function into a linear Hilbert space. Guégan and Iacopini (2018+) proposed a nonparametric prediction method for probability density functions using centeredlog transformation. Another way is to study the manifold structure of warping fucntions. There are some works on linear modeling for manifolds. Cheng and Wu (2012) used local linear regression models to study the scaletomanifold regression problem, where the covariate lies on an unknown manifold. Dai and Müller (2018) studied spherical PCA. They proposed to apply fPCA to the tangent vectors at the Fréchet mean of the sphere, and then use inverse exponential map to transform tangent vectors back into manifold objects. The square root of slope functions (SRSF) of warping functions are of unit norm, and thus lie on an infinite dimensional sphere, making it reasonable to apply spherical PCA on SRSFs.
However, all these methods have some limitations. One common characteristic of the first method is that the transformations all involve the “logarithm”, sometimes necessitating a further rescaling step. However, the major limitation of the “log” function is that it will make the image around zero significantly small, which is nearly impossible to be predicted. Besides, density functions lie in a nonlinear space, and it is always unnatural to use linear models directly. Regarding the second method, since SRSFs of warping functions only form the positive orthant of a sphere, without constraints, it is impossible to find a linear model with homogeneous coefficients for prediction. Some researchers may consider to apply functional linear mixed effect model (see Guo (2002)), where each trajectory is considered to be a linear combination of shifted template functions. This method, however, cannot guarantee the resulting functions pertaining to the same pattern. All of these problems motivate us to find a new methodology to predict the stochastic process composed of warping functions.
We develop a novel method that can jointly predict the amplitude and warping functions. The major advantage of our method is that it does not require any unnatural transformations and it retains the predicted warping functions strictly in their original nonlinear space. We first implement functional registration to obtain amplitude and warping functions. To predict warping functions, we propose a statespace model, in which the states are driven by a Markov chain. Spherical means clustering is applied to reduce the dimension of warping functions. In the model, we use finite prototypes to represent the nonlinear manifold of warping functions, where we assume each warping function can be expressed as the sum of its corresponding prototype and a random error. For the prediction of amplitude function, we propose a switching coefficient operator FAR model, in which the states of warping functions influence the coefficient operators. The predicted warping functions and amplitude functions are combined to obtain the final prediction.
In this article, several other issues will be addressed:

Since the real states in the statespace model are unknown in practice, the transition probability matrix of the hidden Markov chain has to be estimated through the estimated states instead of the real states. What can be said about the largesample behavior of the estimator?

For the prediction of amplitude functions, is the fFPE criterion proposed by Aue et al. (2015) still available?

How can we quantitatively justify that the proposed method can preserve the common pattern well?
We give the solutions in the remainder of the paper. We study the asymptotic properties of the least squares estimator of the stochastic matrix in the statespace model with misclassification under consideration, and show that the fFPE criterion can be still applied for this method under some mild conditions. We find the quantity with which the estimator is consistent and the asymptotic distribution of the estimator. Based on the definition of shape space proposed by Srivastava and Klassen (2016), we define the functional shape space as a quotient space with respect to time warping, and we propose to use amplitude distance to measure similarity in pattern/shape between two functions
and :where the superscript means “minimized”, is a warping function and denotes the norm induced by Fisher–Rao metric. The prediction error is also provided for comparison of prediction accuracy.
The rest of the paper is organized as follows. In Section 2, we illustrate the modeling procedure for the stochastic process of warping functions and amplitude functions. In Section 3, we illustrate the joint prediction algorithm and a method of order selection for the statespace model. In Section 4, the shape space framework and the reasoning for using amplitude distance to measure shape similarity can be found. In Section 5, we derive the asymptotic properties of the least squares estimator of the stochastic matrix in the statespace model. In Section 6, we show the results of a simulation study comparing the prediction performance of the new method and an amplitudeonly prediction method. In Section 7, we report the results of real data analysis on annual ocean surface temperatures. More simulation results and the technical proof can be found in the appendix.
2 Stochastic Process of Phase and Amplitude Variability
2.1 Amplitude and phase variability framework
In what follows, let be an arbitrary stationary functional time series defined on a common probability space , where the time parameter is discrete and the parameter is continuous. We assume the following decomposition
The observations are elements of the Hilbert space equipped with the inner product , and the norm of each satisfies . Define the mean curve and covariance function pointwise through
The warping functions have the following property: , , is invertible, and both and are smooth. Let denote the set of all such functions. The square root of slope function (SRSF) of is defined as
and a SRSF can be transformed back into a warping function by applying to it
where is a bijective map, and is the firstorder derivative of . It can be shown that .
Remark: In practice, we only observe , thus we need to apply functional registration algorithm to obtain and . In the following, we assume both and are already obtained. There are a few available functional registration methods (e.g. Ramsay and Silverman (2015), Srivastava and Klassen (2016) and Chakraborty and Panaretos (2017)) but we implemented the method of Srivastava and Klassen (2016) because the method avoids overregistration problem.
2.2 Statespace model for warping functions
Since are not in a linear space, linear methods are not appropriate. Therefore we need to consider the manifold structure of warping functions. To do so, we study the SRSFs of , whose manifold structure is the positive orthant of a sphere. We propose a statespace model with the following assumptions.

The process is driven by a Markov chain, and each state of the Markov chain is associated with a fixed prototype warping function.

The hidden Markov chain is an irreducible and ergodic process with a finite number of states;

’s are random error functions with , and given , is independent of and , , and are such that the resulting functions are still warping functions.
Assume the Markov chain has states, then each state can be represented by a stateindicating vector . is a dimensional vector, satisfying and , for . The statespace model is specified as follows:
here, are the prototype warping functions. These prototypes can be viewed as a series of basis functions of . is the stochastic transition probability matrix of the Markov chain.
2.2.1 Estimation of the statespace model
Since the hidden state and transition probability matrix are unknown in practice, we need to first estimate ’s, ’s, and then . We apply spherical means clustering to SRSFs of warping functions, and use the clusters centroids as the estimators of the SRSF of ’s. The estimators of ’s can be obtained by applying to the cluster centroids,
where is the centroid of the th cluster of SRSFs. The classified categories of are considered as the estimated state of . More details are discussed below.
The standard spherical means clustering aims to minimize
over all assignments of objects to cluster and over all SRSF representations of prototype warping functions , and the selection of will be discussed below. A typical projection and minimization procedure is repeated to obtain the estimators of the unknown ’s and ’s. is a dimensional vector where only the ’s element is and the rest elements are zeros. We then estimate by the least squares method, where is replaced with , say,
The number of hidden states is unknown in practice, and we propose a crossvalidation method in Section 3.4 to select . We assume the selected is correct, and will not distinguish between the selected and the real number of states. Note that, using the R package skmeans, spherical means clustering algorithm can be implemented by the R function (see Hornik et al. (2012)). The estimation procedure is summarized as follows:
2.3 FAR process for amplitude functions
Recall that the amplitude functions are defined in . The notation indicates that, for some , . By spectral decomposition, we have the following expression of the covariance operator of any ,
where
are the eigenvalues (in strictly descending order) and
are the corresponding normalized eigenfunctions, so that
and . To satisfy the condition of Mercer’s theorem, we usually assume the covariance operator to be continuous. The set form a series of orthonormal basis of . Then by Karhunen–Loève theorem, allows for the representationThe coefficients in this expansion are called the fPC scores of .
Without loss of generality, we assume the mean of the functions is zero. The higherorder FAR() model is defined by the stochastic recursion,
There are two basic assumptions: (1) is an sequence in with , and (2) the operators
are such that the above equation possesses a unique stationary and causal solution. Here, we adopt the procedure in Aue et al. (2015). They fit a vector autoregressive model (VAR(
)) to the emperical fPC score vectors , where the superscript “” means emperical. The algorithm can be summarized as follows:The selected of is the minimizer of the fFPE (final functional prediction error) criterion function
where is the estimator of , say, , where are the prediction residuals.
3 Joint Prediction Methodology
After separating phase and amplitude components, it is natural to consider how to predict the two components jointly, since these two sequences are not necessarily independent of each other. Here, we propose the shape preserving (SP) method which is novel prediction algorithm method that jointly predicts the amplitude and phase functions of future curves. Since warping functions and amplitude functions are defined in two different spaces, we need to find a common space for these two kinds of functions for the joint prediction.
3.1 Prediction of warping function
We convert the stochastic process of warping functions into a Markov chain by applying spherical means clustering to their corresponding SRSFs, as has been discussed in section 2.2. In order to incorporate the correlation between phase and amplitude variability, we also assume the same kind of statespace model for the sequence of amplitude functions, and apply means clustering to estimate the hidden states of amplitude functions. Similarly, the classified categories are treated as the estimation of hidden states. Figure 2 shows the framework.
where indicates the true state and indicates the estimated state, and superscripts and refer to amplitude and phase variability, respectively. These two sequences could be correlated due to the dependence of phase and amplitude variability. We combine the two categorical sequences to obtain a new sequence, , in which represents the Kronecker product.
We propose to use least squares method to estimate the transition matrix of this combined estimated Markov chain, where is a matrix, is the number of states of phase variability and is the number of states of amplitude variabilty. When the sample size is small, we might need some adhoc adjustments to ensure the estimated matrix satisfies the constraints of a stochastic matrix. We can do the adjustment by solving the optimization problem
where is the set of all probability transition matrices, and is Frobenius norm, and is the original least squares estimator of . The predicted state is
Suppose is the predicted stateindicating vector of the next warping function, which is obtained from , then the predicted warping function is
where ’s are the estimated prototype warping functions.
3.2 Prediction of amplitude function
Without loss of generality, we assume the amplitude functions have mean zero. We propose an FAR model with switching coefficient operators for the prediction of amplitude functions. The coefficient operator is determined by the state of the previous warping function. Suppose is the hidden state of , then the proposed model has the representation
where are centered, independent and indentically distributed innovations in , and are bounded linear operators such that the above equation has a unique stationary and causal solution.
3.2.1 Estimation and Prediction
The estimation procedure is inspired by Aue et al. (2015) but with the appropriate modification that is more directly suitable for our purpose. We propose to separate the total sum of squares of the error terms with respect to the hidden states of warping functions and then minimize the subSSEs to obtain the sets of estimated coefficient operators. More details are discussed below.
We obtain the estimation of , by minimizing the objective function
By simple decomposition, we have
where is the number of of which the previous function ’s warping function is of state . Then we can minimize the following quantity to obtain the estimation of :
We apply the multivariate technique to estimate for each , that is, the functions
are projected onto a finite dimensional subeigenspace through fPCA, and the unknown operators are estimated in that finite dimensional subspace. Assume
is the estimator of , then the predictor of should beRemark: The final expression is binary. In practice, we can also try the weighted predictor,
The weighted predictor have smaller variance but larger bias. The probabilities of states
need to be estimated under some principle, for example, , where . When the warping functions can be well classified, we can adopt the binary predictor, otherwise, we can try the weighted predictor.3.2.2 Parameter selection
It can be shown that we can still use the fFPE criterion in Aue et al. (2015) to select the order and dimension of the subeigenspace for prediction. Since the eigenfunctions are orthogonal and the fPC scores are uncorrelated, the mean square prediction error can be decomposed as
where denotes norm. The decomposition reveals the tradeoff between bias and variance. As for the first summand, assuming follows a variate VAR() process with switching coefficient matrix, that is,
where is the error term. It can be shown that (see, e.g., Lütkepohl 2006) that
where and is the least squares estimator of , and . Let be the predictor of if the estimated state of is . Assume the classification is correct, and we have
where is the hidden state of , and is the number of of the th state, and means . Replacing with ), where
is the pooled unbiased estimator of
, such that , finally we havethus the selection of can be performed with the modified fFPE criterion given by,
Remark: This is a generalization of the fFPE criterion proposed by Aue et al. (2015). It is hard to find an unbiased estimator for because of misclassification, but when the misclassification probability is small, the bias tends to be negligible. In most cases, we do not know the real hidden states, so we cannot distinguish the real states and the estimated states.
3.3 Algorithm
The prediction algorithm proceeds in four steps. First of all, implement functional registration algorithm to separate amplitude and phase variability. Assume the number of hidden states of phase variability resp. amplitude variability, say, resp. are already known a priori or estimated by the data, obtain the stateindicating vector of the estimated hidden state of warping function by applying spherical means clustering to the SRSFs of warping functions. Then apply means clustering to the amplitude functions and obtain the stateindicating vector . Next combine the two sequences to generate a new sequence , and estimate the transition probability matrix by the least squares method as
The one step ahead prediction for the stateindicating vector of warping function is
where is the matrix
with , . The corresponding predicted warping function is
Next, fix the dimension and order , and fit a FAR() model with switching coefficient operators to predict the next amplitude function, say,
where is the predictor of , while the estimated state of , say , is . The last step is to apply the predicted warping function to the predicted amplitude function, and the final predictor is
We summarize the algorithm as follows:
3.4 Datadriven selection of the number of states
To the best of our knowledge, there is no widely accepted procedure for order selection of hidden Markov models. The selection of states number is a tradeoff between bias and variance. A large number of states will reduce bias, but will increase variance since we have more parameters to be estimated. Considering that our purpose is prediction, we propose an approach based on prediction error. The prediction performance is evaluated by two metrics, say,
distance and amplitude distance. Assume that we had a large test dataset which is an independent copy of the dataset used for model fitting. We can, for example, use the first curves in to fit a model with states, and predict the rest curves with the fitted model, and then calculate the average distance and amplitude distance between the predicted curves and the curves to be predicted. We can refer to these two average errors for order selection.In practice, we may not have a large sample size, and we cannot reserve a large fraction of data for test procedure. In this case, we may apply the idea of MonteCarlo crossvalidation (see Shao, J. (1993)). We choose a fraction of consecutive curves for training and the rest curves are used for testing. This procedure is repeated multiple times where the partitions are randomly chosen on each run. We choose a group of candidate state numbers. The two average errors are computed for models with different candidates, and we choose the state numbers with the most decent errors.
4 Shape Similarity
One of the main questions considered in this article is: what is a good measurement of shape similarity? In order to compare shape of different trajectories, we need to formally define the functional shape space . We also need a distance to evaluate pattern similarity. We propose a novel principle that, if a function can be warped into another, then the two functions are considered to be of the same shape. Here, we shall follow the convention that shape is independent of scale and location (Srivastava and Klassen (2016)), so we first rescale functions, so that they are of unit norm, and start at the same value. Then we study the shape difference of the thus obtained set. This resulting space is termed preshape space. In the functional shape space, we will unify the shape representations, that is, obtain the unification of all points in preshape space representing the same shape. The functional shape space is a quotient space of with respect to warpings.
4.1 Functional shape space
We define an equivalence relation on as follows: let , be two elements in the preshape space, if there exists a warping function such that . Then for any element in the preshape space, the set of all warped functions of the function are considered as an element of the functional shape space , that is,
where is the space of all warping functions. Based on our definition, the distance between two shape objects should be invariant to warpings. Before we give the distance for measuring shape similarity, we first briefly introduce the Fisher–Rao metric.
4.2 Fisher–Rao metric
FisherRao metric is fundamental to the registration algorithm of Srivastava and Klassen (2016). Let be the functional space we consider, for any , and , where is the tangent space of at , the Fisher–Rao metric is defined as the inner product
One advantage of the FisherRao metric over Euclidean metric is that it avoids the overregistration problem (Srivastava and Klassen (2016)). One important property of FisherRao metric is invariance of simultaneous warping: for any , . An immediate consequence of this property is that the registration between two functions is unique, which is important for defining a distance in the shape space.
Under the SRSF representation, the Fisher–Rao Riemannian metric on becomes the standard metric (see [21], pp. 105). We can take this property and write the geodesic distance under the Fisher–Rao metric explicitly as
where , and are the SRSF representations of .
The Fisher–Rao metric is defined only on a subset , but under SRSF representation, we can generalize it to endowed with metric. We call the metric on SRSF representation space the extended Fisher–Rao metric.
4.3 Amplitude distance
We shall use the amplitude distance, which has been shown to be a proper distance on the functional shape space, to measure the similarity of pattern/shape,
or equivalently,
which makes a metric space. If two functions are of the same shape, then the amplitude distance is zero. The geodesic distance under the Fisher–Rao metric is invariant to simultaneous warpings. Therefore, the effect of phase variability will not influence the amplitude distance between two functions, say,
and thus the amplitude distance between two shape objects is unique. This is the main reason why we use amplitude distance to measure shape similarity.
Remark: In this paper, we use both the amplitude distance and the Euclidean distance to evaluate the prediction. Neither of the distance can evaluate the prediction well individually, as we consider both amplitude and phase variability.
5 Theoretical Results
We use the least squares method to estimate the unknown transition probabilities, and we aim to find the asymptotic properties of the estimator. It is known that the least squares estimator of the stochastic matrix of a Markov chain is consistent and asymptotically normal (see van der Plas (1983)). However, since the real hidden state of warping and amplitude functions need to be estimated, the least squares estimator of the transition matrix is not necessarily consistent with .
In order to establish the asymptotic property of the least squares estimator , we make the following assumptions.
Assumptions

The Markov chain is stationary and ergodic, and has finite states;

The estimated prototypes are obtained from an independent copy of observations, and thus the estimated state resp. is independent of resp. given resp. , where and ;

The number of states is known;

The misclassification probabilities are the same for all ;

The matrix is positive definite.
Remarks: Note that Assumption (A2) is compatible with the assumption on the statespace model’s error . Based on the model assumption, the estimated state is only related to the real state and the random error , so the second assumption is a natural consequence of the assumption on . Assumption 2 means, given the corresponding real state, the estimated state will be independent of all other states. This is a reasonable assumption, since as the sample size grows large enough, the estimated prototype functions will tend to be uncorrelated with any individual function, and we can assume that the estimated state is only related to the corresponding actual state.
The Bayesian theorem implies the following propositions.
Proposition 1.
Under Assumptions (A1)–(A4), the transition probabilities of the combined estimated process are given by
Remarks: Proposition 1 implies the transition probability of the estimated Markov chain. We show that the least squares estimator for the estimated Markov chain is consistent with
and asymptotically normal.
The least squares estimator is defined as the minimizer of the following quantity
where
By Proposition 1, we have
(51) 
Then we have the following theorem for the least squares estimator, which is a generalization of the result of van der Plas (1983). In the paper. the author considers aggregated Markov chains, but it is not necessary to assume the process is a Markov chain. It is enough to have condition (51). First we state the following lemma from van der Plas (1983).
Lemma 1.
Let be a stationary and ergodic process with values in a Euclidean space . Let be a compact subspace of some Euclidean space. Let be a real valued measurable function on such that is a continuous function of for all . Define for all and assume that , then
a.s. uniformly for all .
Then we can derive our first result from the above lemma.
Theorem 1.
Let be the stateindicating vectors of the estimated Markov chain, and assume that Assumptions (A1)–(A4) hold. Then for each
there exists a random matrix
such that andRemark: From Theorem 1, we know that the estimator does not converge to the real transition matrix , but to another stochastic matrix