Learning Time Series from Scale Information

03/18/2021 ∙ by Yuan Yang, et al. ∙ Harvard University 0

Sequentially obtained dataset usually exhibits different behavior at different data resolutions/scales. Instead of inferring from data at each scale individually, it is often more informative to interpret the data as an ensemble of time series from different scales. This naturally motivated us to propose a new concept referred to as the scale-based inference. The basic idea is that more accurate prediction can be made by exploiting scale information of a time series. We first propose a nonparametric predictor based on k-nearest neighbors with an optimally chosen k for a single time series. Based on that, we focus on a specific but important type of scale information, the resolution/sampling rate of time series data. We then propose an algorithm to sequentially predict time series using past data at various resolutions. We prove that asymptotically the algorithm produces the mean prediction error that is no larger than the best possible algorithm at any single resolution, under some optimally chosen parameters. Finally, we establish the general formulations for scale inference, and provide further motivating examples. Experiments on both synthetic and real data illustrate the potential applicability of our approaches to a wide range of time series models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series modeling plays a key role in understanding and predicting sequences of data that arise naturally from fields such as engineering, biology, finance, etc. The scope of time series considered in this paper are not limited to the data that are sequentially obtained in time. It may also be data sampled from space. In this work, we consider a novel methodology referred to as the scale-based inferences for predicting time series. The initial motivations are twofolds: first, a realistic time series usually consists of complex dynamics and cannot be modeled as a single stationary or ergodic one; second, the data size is usually massive so that offline inference and model checking/selection is not computationally tractable. It is therefore preferable for an analyst to retrieve useful information of particular interest, such as trend, volatility, or one-step prediction, in a sequential manner. To address the aforementioned issues, we make a general assumption that, though a time series consists of distinct dynamics, such dynamics are ergodic. Thus, it is technically possible to infer the current piece of data by looking at historical data with behavioral resemblance. Here, “dynamics” is a general term that refers to any particular feature of time series, for example “moment”, “spectrum”. In a parametric setting, it can be understood as different parameters of a particular model, or different models. We will give more concrete formulations in later part of the paper.

Related Work: In the time series literature, a classical way to extract the scale information such as trend and cycle is by time series decompositions[cleveland1976decomposition, west1997time, baxter1999measuring]

, which decomposes the data into trend/cycle plus a stationary process. Multi-state autoregressive models impose a parametric hierarchy on the mixture autoregressive (AR) models

[RACS, Allerton]

. It usually assumes a Markov chain for the transition among finite number of AR states. The state can be treated as the scale information, and the data at each time step is governed by the data generating process corresponding to the particular state. Autoregressive Conditional Heteroskedasticity (ARCH) Models

[engle1982autoregressive]

have been widely applied to modeling financial time series that exhibit time-varying volatility clustering. The variances that change over time can be understood as the scale information. In the literature of statistics and signal processing, Least Absolute Shrinkage and Selection Operator (LASSO)

[tibshirani1996regression] and recursive -regularized least squares (SPARLS)[babadi2010sparls, sheikhattar2015recursive] are popular methods those aim at simultaneous model selection and inference; the two seemingly contradictory aspects are technically reconcilable, because the imposed -constraint tends to select sparse number of covariates from the massive candidate set. They have been widely appreciated due to the robust performance and computational tractability, especially when compared with classical Akaike information criterion or Bayesian information criterion which requires combinatorial tests. In view of that, it can be understood as a scale-based philosophy: extract few significant covariates from a large scale while suppressing the remaining ones. In computer science, manifold learning methods[belkin2006manifold] are based on the idea that the dimension of many datasets is only artificially high and that the data actually stay on a lower dimensional manifold. In some manifold learning algorithms such as the locally linear embedding approach, the modeling is based on each local information. It shares some similarities with our scale-based inference for time series in that they both aim to extract useful information inherited at each locality, which can be regarded as a scale parameter.

Notation: For two deterministic sequences , we use to represent . Let

denote any random variable that converges in probability to zero. We write

if for some constant for all sufficiently large . For a time series and a positive integer , we let

denote the column vector

; when there is no ambiguity, we simply write it as . Let denote the Euclidean norm of a vector . Let

denote the normal distribution with mean

variance matrix .

2 Prediction for One Time Series

In general, the criteria to evaluate the performance of particularly retrieved information of interest may differ. If an analyst is interested in one-step ahead prediction, then it is natural to measure the goodness of modeling via the expected prediction error. Suppose that the underlying data generating procedure is

(1)

where is a continuous function and are independent and identically distributed noises with mean and variance 111We do not need to assume that noises are Gaussian here and also in the proof of Theorem 1.. The one-step prediction error (loss) is defined as

. In a typical classical approach, either parametric or nonparametric, an analyst first estimates the function

and then apply the plug-in rule: . The oracle situation is where is luckily obtained and . Otherwise, it is straightforward to prove that . Though the lower bound is practically not achievable, a ”good” predictor is generally considered to be the one such that as

(i.e., more and more data are observed). A major problem involved in a typical approach is that: if it is parametric (e.g., linear AR), there is a risk to mis-specify the model class (e.g., the true functional relationship is nonlinear); if it is nonparametric (e.g., spline approximation), it usually requires massive computation and it is not clear how to choose the kernels (e.g. number and spacing of knots). We propose a simple but effective pattern-matching algorithm. It can be categorized as a nonparametric approach, but much easier to implement than kernel-based ones since fast algorithms can be employed for searching

-nearest neighbors. Our approach is sketched in Algorithm 1. Next, we provide theoretical analysis for its performance. We make the following assumptions.

0:  data ; postulated dimension ; distance measure ;
0:  predictor
1:  Collect which are nearest neighbors of under distance measure
2:  Let
Algorithm 1 Pattern Matching Algorithm

(A1) For a constant large enough, are independent.

The assumption is applicable in practice because under mild conditions, many stationary process such as autoregression are strong-mixing [athreya1986note]. In other words, as long as is allowed to diverge in any speed, will become asymptotically independent.

(A2) is Lipschitz continuous function.

A linear function is perhaps the simplest example of Lipschitz continuous functions because of Cauchy’s inequality. Thus, autoregressive model is a special case addressed here.

(A3) The sequence of random variables

converges to a stationary distribution whose probability density function (PDF) is monotone decreasing in the tail. In other words, for a positive integer

, is decreasing in for all whose lengths are greater than a certain constant.

A special case for (A3) to hold is the usual setting when is a linear function and the noises are Gaussian.

Theorem 1

Assume (A1)-(A3) and that is the Euclidean distance, . If , where is the true dimension of the time series as defined in (1), then for any bounded subspace , for any , a.s. with uniform convergence rate.

Our proof is based on constructing the nearest neighbor density estimator. Under assumption (A1), there are historic points independently distributed according to probability density function , the stationary distribution of consecutive points. Without loss of generality we write as , and suppose that are such points. Let be the nonparametric density estimator at point , where is the volume of the ball with radius the distance between and its th nearest neighbor. It is well known that

(2)

given and [devroye1977strong]. Given any bounded subspace , assumption (A2) implies that there exists a constant such that for all . In view of (2), there exists such that for all and , . Thus, converges uniformly to zero as tends to infinity.

If the data points and additive noises after each are respectively denoted by and , then the algorithms gives which satisfies

Let denote the Lipschitz constant. It follows from

and the law of large numbers that

as goes to infinity.

Remark 1

Theorem 1 shows that for a wide range of the prediction error produced by Algo. 1 converges fast to the theoretical lower bound. The choice of that achieves the optimal rate of convergence depends on the specific data generating process which is usually unknown. But from a number of synthetic data experiments, we found that is generally a good choice. Algo. 1 will be used in the method introduced in the next section.

3 Sequential Prediction from Multiple Resolutions

In this section, we study a specific scale information which is the data resolution. In different domains of research such as the environmental science and finance, it has been pointed out that model predictability may depend heavily on which resolution of data has been used for model fitting; in particular, high resolution (e.g., hourly data) is not necessarily better than low resolution (e.g., daily data) in terms of prediction [shen2015influence, wang2013accelerating, brooks2014introductory]. It is a natural idea to predict using each resolution, and then combine the results properly in order to achieve a more reliable prediction. In some sense, similar idea appears in model averaging methods[hoeting1999bayesian].

We propose Algo. 2 referred to as the multiple resolution based sequential prediction and learning in time series (Mr-split). It is worth mentioning that we do not make a strict definition on “data at resolution ”, as for a practitioner there are various possible definitions. In the pseudo-code of Algo. 2, we use to denote the pre-processed data at resolution in general. For example, can be , . We are going to provide more examples in the real data experiments. We note that is not necessarily a function of , but can be exogenous variables as well.

The basic idea of Algo. 2

is to linearly combine the predictor estimated from each resolution weighted by their accumulated scores, and then update the score according to some carefully chosen loss function

. The loss function takes the predictor and the revealed true value as arguments, and is used to penalize those resolution that does not perform well.

Algo. 2 is motivated by the exponential weight algorithm from online learning literature. However, it differs with the classic one in terms of the ultimate goal. In the exponential weight algorithm, a loss function, which is convex and bounded, is pre-defined as the performance measure. In the time series prediction, an analyst is more interested in the mean prediction error, which is usually unbounded. A straightforward way, as was used in (3) of Algo. 2, is to impose a constant upper bound which we refer to as the capacity parameter. The choice of here is essential. As we shall see from the proof, the optimal choice of is always proportional to . On one side, if is too large, then the score varies little at each time , which means that the learning procedure is slow. On the other side, if is too small, then most of the time the loss for each resolution is constant , which means that the learning procedure may not be effective in distinguishing which resolution performs better than others. Under mild assumptions, we prove in the following theorem that the optimal and should be and , respectively.

0:  data sequentially observed () at each resolution ; initial score for each resolution ; capacity parameter .
0:  predictor at each
1:  for  do
2:     For each resolution , predict , from using Algo. 1
3:     Predict
4:     Given the revealed true value , update the score , where
(3)
5:  end for
Algorithm 2 Multiple Resolution based Sequential Prediction and Learning in Time Series (Mr-split)

Before we proceed, we make the following assumption.

(A5) Assume that converges to a strictly stationary stochastic process with second moment, for each , as tends to infinity. Assume that the fourth moment of exists for .

We note that denotes the predictor of Mr-split at time , as shown in Algo. 2. By ergodicity, for each , the mean prediction error has a limit, denoted by , as tends to infinity. Suppose that is the resolution that achieves the minimal prediction error among all the resolutions, i.e. .

The question is whether produced by Algo. 2 can achieve the lowest error .

Theorem 2

Under Assumption (A5), suppose that we choose the capacity and for some constants that do not depend on , then Algo. 2 achieves the lowest possible prediction error as tends to infinity, i.e.,

Remark 2

The result shows that the predictor is at least as good as the best among asymptotically. When is the true data generating model, then the equality holds. This is because the true model asymptotically achieves the theoretical optimal prediction error, so

4 General Framework

We have emphasized on a special type of scale information, the data resolution, in the last section. We propose a generic method for scale-based inference in Algo. 3. Below are some additional motivating examples, where denotes the scale and denotes the parameter of the data generating process.

Example 1

Consider , where , are independent noises. It is equivalent to , where can be treated as the scale as it varies much slower than .

Usually, if the parametric model is (luckily) correctly specified, then the unknown parameters (

can be estimated via MLE; otherwise methods such as curve fitting or time series decomposition may be used. We are going to apply Algo. 1 to this data in the experiments.

Example 2

Consider where is any linear or nonlinear continuous function. There is no scale in this case. This example is coined for comparison with the next one.

Example 3

Consider where , a finite set of continuous functions. This is a mixture model, where the scale parameter has a finite support and parameters depends on .

0:  data ; postulated scale parameters and model parameter , where and are respectively the scale space and parameter space
0:  the quantity of interest
1:  Infer parameter and at time from the past data
2:  Compute
Algorithm 3 A Generic Algorithm for Scale-based Inference

5 Synthetic Data Experiments

In this section, we present experimental results to demonstrate the theoretical results and the advantages of our methods on various synthetic datasets. The codes and related data will be made public online in the future.

5.1 Mixed Oscillators

Figure 1: 1000 Synthetic data generated from a two oscillators mixture model , where and independently identically follows a distribution .

We first consider a case in Example 1. , where and independently identically follows a distribution . This is a typical example of fast changing data sitting on a slower changing periodic modulation. Figure 1 shows 1000 data points generated from this mixture oscillator model. We use the first 800 data points as the training set and last 200 data points as the test set. If we luckily specify the correct parametric model and the unknown parameters can be estimated via methods such as MLE, then the prediction error is only from the noise part , with a mean square error (MSE) converging to 1.

Figure 2: MSE of estimation from Algorithm 1. First 800 data points in Figure 1 are used as the source of matching, and the last 200 data points are used for testing. Horizontal axis is the number of past points we use for pattern matching.

Generally, we don’t know the exact form of the model generating the data, specifying the model form incorrectly would lead to an inconsistent estimation. Now we use Algorithm 1

, the non-parametric model to make prediction. The first 800 data points are used as the training set, and the last 200 data points are used as the test set.

Figure 2 shows the error of using different number of past points for pattern matching. The mean squared error(MSE) decreases as the number of past points increase and it converges to 1.10. We analyze why the MSE converges to 1.10 in the example: among the first 800 training data points, there are about 16 repeating patterns. So the number of repeating pattern is limited. When we find the k-nearest neighbors, we have to limit and we choose in the example. Thus the mean of 10 past predictors has variance , which is the variance of the mean for 10 independently identically distributed noise. The variance of our estimation comes from two parts, one from the variance of estimator, and another from the noise in test data, which is in the case. Adding the variance from test data noise, the total estimation variance is 1.10. When data size increases, we will have larger when finding the k-nearest neighbors, and the variance of the estimator in Algorithm 1 will decrease accordingly and finally converge to 0. Thus the MSE will only come from test data variance, and so MSE will converge to 1 as the data set size increases, as we proved in Theorem 1. This has been tested, and we omit the result here due to the page limit.

5.2 Nonlinear AR(n)

A nonlinear AR model hasn’t got a good way to make prediction unless we know the exact form of the nonlinear dependence. We show here our non-parametric approach gives consistent estimation for nonlinear AR cases. We consider a case in Example 2: a nonlinear AR(3) sequence , each independently identically follows a distribution . Figure 3 shows one instance of 1000 data points generated from the nonlinear AR(3) model, and we use the first 800 data points as the training set and last 200 data points as the test set. It is very difficult to tell from the data plot by eye what the form of the nonlinear dependence will be, and so we use a linear AR(n) model as a baseline without loss of generality.

Figure 3: 1000 Synthetic data generated from an AR(2) model , each independently identically follows a distribution .

Figure 4: MSE of estimation from Algorithm 1 and linear AR(n) model. First 800 data points in Figure 3 are used as the training matching set, and the last 200 data points are used for testing. Horizontal axis is the number of past points we use for pattern matching. We repeat the experiment for 100 times, and MSE at each point is the average value from 100 experiments.

For each experiment, we use the first 800 data as training data the last 200 data points as test data, and we repeat the experiment by 100 times. Figure 4 shows the average MSE for 100 experiments. For linear AR(n) model, the best MSE achieved is1.05 when n=4. For Algorithm 1, the one-scale non-parametric model, the error is minimized when we use the previous 3 data points as predictors. This is intuitively correct because we use a nonlinear AR(3) model to generate the data. The best mean squared error(MSE) is 1.00, and this error converges to the noise variance as we have proved in Theorem 1. Linear AR(n) models are a bit trivial compared to the nonlinear case. So we omit the results of linear AR(n) models.

5.3 Multi-Resolution AR(n)

We consider a case in Example 3: a mixture of three AR(2) sequences, one following , one following , and another following , each independently identically follows a distribution . The three sequences are mixed such that if we sample every three data points from the mixed series, these sampled points follows one of the three one of the AR(2) time series. Figure 5 shows one instance of 3000 data points generated from mixing three separate AR(2) sequences, each sequence with 1000 data points. It is very difficult to tell from the data plot what the resolution of the data will be, and we will show our Algorithm 2 framework will find out the correct resolution and converges to the correct model.

Figure 5: 3000 Synthetic data by mixing three separate time series generated from three AR(2) models , each independently identically follows a distribution . If we sample every three data points in the combined time series, these sampled points is one of the AR(2) time series.

Figure 6: Heatmap shows logarithm of final weights for different models in exponential weighting algorithm. The resolution is from getting data every one data point to every five data points, and the number of predictors is also from one to five. There are 25 models in total. The weights of different models have been taken logarithm. The redder the block, the larger weight the model has. The bluer the block, the smaller weight the model has.

Figure 7: The weights change of different models as we make prediction on test data. First 2400 data points in Figure 5 are used as the training matching set, and the last 600 data points are used for testing. We keep the lag order fixed as the true value 2, and only change the resolution from every one data point to every five data points (five models in total), in order to show how the weight each model changes clearly.

Figure 8: The weights change of the best model as we make prediction on test data by choosing different C. First 2400 data points in Figure 5 are used as the training matching set, and the last 600 data points are used for testing. We keep the lag order fixed as the true value 2, and only change the resolution from every one data point to every five data points (five models in total), in order to show how the weight each model changes clearly.

For each experiment, we use the first 2400 data points as the training set and last 600 data points as the test set, and we repeat the experiment by 100 times. Our algorithm assumes we don’t know how many autoregressive time series are mixed or what the form the autoregressive model is or what lag order is. We scan the resolution from 1 to 5, the lag order also from 1 to 5, and there are 25 models in total. Figure 6 shows the logarithm of final weights for each model at the end of prediction using Mr-split in one instance. The logarithm is taken because the relative weights for different models differ a lot in magnitude. It shows that for model with resolution 3, and using past 2 data points as predictors, has the largest weight. Figure 7 shows the change of weights as we make predictions for one instance. For the purpose of displaying weights change clearly, we keep the lag order fixed as the true value 2, and only change the resolution from every one data point to every five data points. The correct model weight converges to 1 almost after 400 prediction steps, and weights of other models converge to 0. Figure 8 compares the convergence rates of the best model for different capacity C values. It shows the convergence rate is optimal when choosing . The average MSE from 100 experiments is 1.00. As we have proved in Theorem 2, the master model converges to the true model estimation error when data size increases, which is 1 in the case, and the optimal .

6 Real Data Experiment

In this section, we apply our algorithms to S&P index real data, and the advantage of our method is revealed. The stock index change could not be effectively modeled by a parametric model. We will test the effect of using multi-scale non-parametric approach to detect patterns from past data and predict future data. We adopt the quandl finance&economics data base for the evaluation, which is available from https://www.quandl.com/data/YAHOO/INDEX_SPY-SPDR-S-P-500-SPY.

6.1 Daily Index Change Prediction

Investors are more interested in the price change prediction rather than the absolute index value itself, in this subsection we apply our method to predict the relative daily price changes compared to the previous day. Figure 9 shows the relative daily change of S&P 500 index from January, 2006 to December 2015.

Figure 9: S&P 500 daily index change from Year 2006 to Year 2015.

Figure 10: S&P 500 daily index change estimation RMSE using Algorithm 1 (green) and threshold autoregression model (blue). The horizontal axis is the number of past days we use to predict the next day. Training data: Year 2006 – Year 2013, test data: Year 2014 – Year 2015.

Figure 11: Heatmap shows final weights for different models in exponential weighting algorithm. The resolution is from every one data point to every five data points, and the lag order is also from one to five. There are 25 models in total. The redder the block, the larger weight the model has. The bluer the block, the smaller weight the model has.

Figure 12: The weights change of different models as we make prediction on test data. We keep the lag order fixed as 5, and only change the resolution from every one data point to every five data points (five models in total), in order to show how the weight each model changes clearly.

It is significant from the data plots that for different time period, the index change behaves very differently. In some regions, the value tends to be big and volatile, and in some other regions, the value tends to be small and stable. Threshold autoregressive (TAR) model is widely used to describe this phenomenon in economics. We use a one threshold autoregressive model as the baseline. Data in Year 2006 to 2013 is the training set, and data in Year 2014 and 2015 is the test set. Its prediction RMSE as a function of number of predictors is shown by the blue line in Figure 10. The horizontal axis is the number of past days we use to predict the next day. The optimal RMSE is around 0.0081 when using past 5 days data to make prediction. The green line shows the RMSE using Algorithm 1. The RMSE of our non-parametric model has consistent smaller RMSE than TAR model. This is because the non-parametric approach implicitly takes possible non-linear dependence on past data into consideration, and TAR only considers the linear dependence. The RMSE is minimized to 0.0074 when using past 5 days data to make prediction.

Mr-split is used to take the multiple resolution possibility into consideration. The resolution is from every one data point to every five data points, and the lag order d is also from one to five. There are 25 models in total. Figure 11 shows the final weight on each model. The redder block corresponds to a larger weight to the model, and the bluer block corresponds to a smaller weight to the model. Weights are heavy for resolution 1 and this implies stock index is mostly affected by recent daily data. For resolution 1, weight is heaviest when using past 5 days data to make prediction, and so our Mr-split converges to the best prediction. Figure 12 shows the change of weights as we make predictions on the daily index change. For the purpose of displaying weights change clearly, we keep the lag order fixed as 5, which is the lag order minimize RMSE from Algorithm 1, and only change the resolution from every one data point to every five data points. The resolution 1 model, namely every one data point, gains weight as we make prediction.The optimal RMSE we get is 0.0073. This is significantly better than using traditional TAR model and even slightly better than a single best model. The reason could be some models are better at some period of time, and some better at other period of time.

6.2 Monthly Average Index Change Prediction

In this subsection, we consider estimate the monthly average index change. We define the monthly average index change as the first 20 trading days daily index change average. Figure 13 shows the relative daily change of S&P 500 index from January, 2006 to December 2015. We are interested in if a low resolution prediction, namely using past month average index change data to make prediction better, or if a high resolution prediction, namely, for example, we predict everyday index changes in a month, and use their average as monthly average prediction better. A higher resolution prediction requires more information from data to make prediction and a higher resolution prediction appears better intuitively, but our algorithm indicates this is not necessarily true.

Figure 13: S&P 500 monthly average index change from Year 2006 to Year 2015.

Figure 14: Heatmap shows final weights for different models in exponential weighting algorithm. The resolution includes 1-day data, 2-day average data, 4-day average data, weekly (5-day) average data, 10-day average data, monthly (20-day) average data, six resolution levels (marked as resolution 1 to 6 in the figure), and the lag order is also from one to five. There are 30 models in total. The redder the block, the larger weight the model has. The bluer the block, the smaller weight the model has.

Figure 15: The weights change of different models as we make prediction on test data. We keep the lag order fixed as 2, and only change resolutions: 1-day data, 2-day average data, 4-day average data, weekly (5-day) average data, 10-day average data, monthly (20-day) average data (six resolution levels in total), in order to show how the weight of each model changes clearly.

We include 1-day data, 2-day average data, 4-day average data, weekly (5-day) average data, 10-day average data, monthly (20-day) average data, six resolution levels in our algorithm. The lag order d is from one to five. There are 30 models in total. Figure 14 shows the final weight on each model. It is clear from the weight that model with resolution label 6, lag order 2 has the largest weight, namely using monthly average data and making prediction based on past two monthly average data. For model with the highest resolution (labelled as Resolution 1), they get lower weight. Figure 15 shows the change of weights as we make predictions on the monthly average index change. For the purpose of displaying weights change clearly, we keep the lag order fixed as 2, and only change the resolution from every one data point to every five data points. The resolution 6 model, namely every one data point, gains weight as we make prediction.The optimal RMSE we get is 0.0011. The optimal RMSE got from TAR is 0.0013. Mr-split gives better prediction compared to traditional methods.

7 Conclusion

We design a non-parametric approach for time series inference based on resolutions/scales information. A non-parametric pattern matching approach that gives consistent prediction on a general autoregressive model is proposed. We also proposed a sequential prediction algorithm that gives good estimation for complex models by combining models from multiple resolutions. Experiments on both synthetic data and real data show that the approach is applicable to complex time series data inference.

8 Acknowledgements

The authors are especially grateful to Professor Finale Doshi-Velez for carefully reading the draft and providing valuable suggestions in terms of how to improve the quality of the paper. We also thank teaching fellows and classmates in the machine learning course at Harvard University for helpful discussions.

Appendix A Proof of Theorem 2

For , resolution gives predictor . The true value is then revealed. The th resolution suffers loss , where is defined as for some properly chosen constant . Our goal is to achieve the optimal prediction error for the combined predictor . Suppose that the th resolution is the true data generating resolution. We define the cumulated loss and as

Note that the weights can be defined equivalently by letting and . Define . It follows that and

(4)

We define to be the event that , the complement of , and . Since

we obtain

(5)

where we have used and .

Conditioning on , the value of can be bounded by

(6)
(7)

where we have applied Hoeffding’s inequality222Hoeffding’s inequality states that for any random variable . to (4), and Jensen’s inequality to the first part of (6) since is convex in its first argument. Bringing (7) into (5) for , we obtain

(8)

Combining Inequalities (8) and (4), we get

(9)

where the last equality is achieved when

(10)

What we are truly interested in is whether

converges to the optimum .

From our assumption and the Birkhoff Ergodic Theorem,

(11)

converges to , where denotes the stationary distribution of , as tends to infinity. We use

as the indicator random variable of event

. Since

(12)
(13)
(14)

where Inequality (12) is from Cauchy’s inequality and , (13) is from Markov’s inequality, and (14) is from . Combining the results from (14), (9) and (11), we obtain

(15)

as tends to infinity, where the equal sign in (15) is achieved at

(16)

This further implies that .

References