Nonparametric regression is a fundamental class of problems that has been studied for more than half a century in statistics and machine learning(Nadaraya, 1964; De Boor et al., 1978; Wahba, 1990; Donoho et al., 1998; Mallat, 1999; Scholkopf and Smola, 2001; Rasmussen and Williams, 2006). It solves the following problem:
Let for . How can we estimate a function using data points and the knowledge that belongs to a function class ?
Function class typically imposes only weak regularity assumptions on the function such as boundedness and smoothness, which makes nonparametric regression widely applicable to many real-life applications especially those with unknown physical processes.
A recent and successful class of nonparametric regression technique called trend filtering (Steidl et al., 2006; Kim et al., 2009; Tibshirani, 2014; Wang et al., 2014) was shown to have the property of local adaptivity (Mammen and van de Geer, 1997) in both theory and practice. We say a nonparametric regression technique is locally adaptive if it can cater to local differences in smoothness, hence allowing more accurate estimation of functions with varying smoothness and abrupt changes. For example, for functions with bounded total variation (when is a total variation class), standard nonparametric regression techniques such as kernel smoothing and smoothing splines have a mean square error (MSE) of while trend filtering has the optimal .
Trend filtering is, however, a batch learning algorithm where one observes the entire dataset ahead of the time and makes inference about the past. This makes it inapplicable to the many time series problems that motivate the study of trend filtering in the first place (Kim et al., 2009). These include influenza forecasting, inventory planning, economic policy-making, financial market prediction and so on. In particular, it is unclear whether the advantage of trend filtering methods in estimating functions with heterogeneous smoothness (e.g., sharp changes) would carry over to the online forecasting setting. The focus of this work is in developing theory and algorithms for locally adaptive online forecasting which predicts the immediate future value of a function with heterogeneous smoothness using only noisy observations from the past.
1.1 Problem Setup
We propose a model for nonparametric online forecasting as described in Figure 1. This model can be re-framed in the language of the online convex optimization model with three differences.
We consider only quadratic loss functions of the form.
The learner receives independent noisy gradient feedback, rather than the exact gradient.
The new criterion is called a dynamic regret because we are now comparing to a stronger dynamic baseline that chooses an optimal in every round. Of course in general, the dynamic regret will be linear in (Jadbabaie et al., 2015). To make the problem non-trivial, we restrict our attention to sequences of that are regular, which makes it possible to design algorithms with sublinear dynamic regret. In particular, we borrow ideas from the nonparametric regression literature and consider sequences that are discretizations of functions in the continuous domain. Regularity assumptions emerge naturally as we consider canonical functions classes such as the Holder class, Sobolev class and Total Variation classes (see, e.g., Tsybakov, 2008, for a review).
We consolidate all the assumptions used in this work and provide necessary justifications for them.
(A1) The time horizon for the online learner is known to be .
(A2) The parameter of subgaussian noise in the observations is known.
(A3) The sequence has its total variation bounded by some known positive , i.e., we take to be the total variation class where is the discrete difference operator.
The knowledge of in assumption (A2) is primarily used to get the optimal dependence of in minimax rate. This assumption can be relaxed in practice by using the Median Absolute Deviation estimator as described in Section 7.5 of Johnstone (2017) to estimate robustly. Assumption (A3) features a large class of functions with spatially inhomogeneous degree of smoothness. The functions residing in this class need not even be continuous. Our goal is to propose a policy that is locally adaptive whose empirical mean squared error converges at the minimax rate for this function class. The knowledge of is used to get its optimal dependence on the regret. Assumption (A4) is very mild as it puts restriction only to the first value of the sequence. This assumption controls the inevitable prediction error for the first point in the sequence.
1.3 Our Results
The major contributions of this work are summarized below.
It is known that the minimax MSE for smoothing sequences in the TV class is . This implies a lowerbound of for the dynamic regret in our setting. We present a policy Arrows (Adaptive Restarting Rule for Online averaging using Wavelet Shrinkage) with a nearly minimax dynamic regret and a run-time complexity of .
We also provide a more refined lower bound that characterized the dependence of and , which certifies the optimality of Arrows in all regimes. The bound also reveals a subtle price to pay when we move from the smoothing problem to the forecasting problem, which indicates the separation of the two problems when , a regime where the forecasting problem is strictly harder (See Figure 3).
Lastly, we consider forecasting sequences in Sobolev classes and Holder classes and establish that Arrows can automatically adapt to the optimal regret of these simpler function classes as well, while OGD and MA cannot, unless we change their tuning parameter (to behave suboptimally on the TV class) (see more details in Table 2).
2 Related Work
The topic of this paper sits well in between two amazing bodies of literature: nonparametric regression and online learning. Our results therefore contribute to both fields and hopefully will inspire more interplay between the two communities. Throughout this paper when we refer as the optimal regret, we assume the parameters of the problem are such that it is acheivable (see Figure 3).
Nonparametric regression. As we mentioned before, our problem — online nonparametric forecasting — is motivated by the idea of using locally adaptive nonparametric regression for time series forecasting (Mammen and van de Geer, 1997; Kim et al., 2009; Tibshirani, 2014). It is more challenging then standard nonparametric regression because we do not have access to the data in the future. While our proof techniques make use of several components (e.g., universal shrinkage) from the seminal work in wavelet smoothing (Donoho et al., 1990, 1998), the way we use them to construct and analyze our algorithm is new and more generally applicable for converting non-parametric regression methods to forecasting methods.
Adaptive Online Learning. Our problem is also connected to a growing literature on adaptive online learning which aims at matching the performance of a stronger time-varying baseline (Zinkevich, 2003; Hall and Willett, 2013; Besbes et al., 2015; Chen et al., 2018b; Jadbabaie et al., 2015; Hazan and Seshadhri, 2007; Daniely et al., 2015; Yang et al., 2016; Zhang et al., 2018a, b; Chen et al., 2018a). Many of these settings are highly general and we can apply their algorithms directly to our problem, but to the best of our knowledge, none of them achieves the optimal dynamic regret.
In the remainder of this section, we focus our discussion on how to apply the regret bounds in non-stationary stochastic optimization (Besbes et al., 2015; Chen et al., 2018b) to our problem while leaving more elaborate discussion with respect to alternative models (e.g. the constrained comparator approach (Zinkevich, 2003; Hall and Willett, 2013), adaptive regret (Jadbabaie et al., 2015; Zhang et al., 2018a), competitive ratio (Bansal et al., 2015; Chen et al., 2018a)), as well as the comparison to the classical time series models to Appendix A.
Regret from Non-Stationary Stochastic Optimization The problem of non-stationary stochastic optimization is more general than our model because instead of considering only the quadratic functions, , they work with the more general class of strongly convex functions and general convex functions. They also consider both noisy gradient feedbacks (stochastic first order oracle) and noisy function value feedbacks (stochastic zeroth order oracle).
In particular, Besbes et al. (2015) define a quantity which captures the total amount of “variation” of the functions using 111The definition in (Besbes et al., 2015) for strongly convex functions are defined a bit differently, the is taken over the convex hull of minimizers. This creates some subtle confusions regarding our results which we explain in details in Appendix I. Chen et al. (2018b) generalize the notion to for any where is the standard norm for functions222We define to be a factor of times bigger than the original scaling presented in (Chen et al., 2018b) so the results become comparable to that of (Besbes et al., 2015).. Table 1 summarizes the known results under the non-stationary stochastic optimization setting.
|Noisy gradient feedback||Noisy function value feedback|
|Convex & Lipschitz||-||-|
|Strongly convex & Smooth|
Our assumption on the underlying trend can be used to construct an upper bound of this quantity of variation or . As a result, the algorithms in non-stationary stochastic optimization and their dynamic regret bounds in Table 1 will apply to our problem (modulo additional restrictions on bounded domain). However, our preliminary investigation suggests that this direct reduction does not, in general, lead to optimal algorithms. We illustrate this observation in the following example.
Let be the set of all bounded sequences in the total variation class . It can be worked out that for all . Therefore the smallest regret from (Besbes et al., 2015; Chen et al., 2018b) is obtained by taking , which gives us a regret of . Note that we expect the optimal regret to be according to the theory of locally adaptive nonparametric regression.
In Example 1, we have demonstrated that one cannot achieve the optimal dynamic regret using known results in non-stationary stochastic optimization. We show in section 3.1 that “Restarting OGD” algorithm has a fundamental lower bound of on dynamic regret in the TV class.
Online nonparametric regression. As we finalize our manuscript, it comes to our attention that our problem of interest in Figure 1 can be cast as a special case of the “online nonparametric regression” problem (Rakhlin and Sridharan, 2014; Gaillard and Gerchinovitz, 2015a). The general result of Rakhlin and Sridharan (2014) implies a non-constructive algorithm that enjoys a regret for the TV class (see more details in Appendix A). To the best of our knowledge, our proposed algorithm remains the first polynomial time algorithm with regret and our results reveal more precise (optimal) upper and lower bounds on all parameters of the problem (see Section 3.4).
3 Main results
We are now ready to present our main results.
3.1 Limitations of Linear Forecasters
Restarting OGD as discussed in Example 1, fails to achieve the optimal regret in our setting. A curious question to ask is whether it is the algorithm itself that fails or it is an artifact of a potentially suboptimal regret analysis. To answer this, let’s consider the class of linear forecasters — estimators that outputs a fixed linear transformation of the observations. The following preliminary result shows that Restarting OGD is a linear forecaster . By the results of Donoho et al. (1998), linear smoothers are fundamentally limited in their ability to estimate functions with heterogeneous smoothness. Since forecasting is harder than smoothing, this limitation gets directly translated to the setting of linear forecasters.
Online gradient descent with a fixed restart schedule is a linear forecaster. Therefore, it has a dynamic regret of at least .
First, observe that the stochastic gradient is of form where is what the agent played at time and is the noisy observation . By the online gradient descent strategy with the fixed restart interval and an inductive argument, is a linear combination of for any
. Therefore, the entire vector of predictionsis a fixed linear transformation of . The fundamental lower bound for linear smoothers from Donoho et al. (1998) implies that this algorithm will have a regret of at least . ∎
The proposition implies that we will need fundamentally new nonlinear algorithmic components to achieve the optimal regret, if it is achievable at all!
In this section, we present our policy Arrows (Adaptive Restarting Rule for Online averaging using Wavelet Shrinkage). The following notations are introduced for describing the algorithm.
denotes start time of the current bin and be the current time point
denotes the average of the values for time steps indexed from to .
denotes the vector
zero-padded at the end till its length is a power of 2.i.e, a re-centered and padded version of observations.
where is a sequence of values, denotes the element-wise soft thresholding of the sequence with threshold
H denotes the orthogonal discrete Haar wavelet transform matrix of proper dimensions
Let where being a power of 2 is the length of . Then the vector can be viewed as a concatenation of contiguous blocks represented by . Each block at level contains coefficients.
Our policy is the byproduct of following question: How can one lift a batch estimator that is minimax over the TV class to a minimax online algorithm?
Restarting OGD when applied to our setting with squared error losses reduces to partitioning the duration of game into fixed size chunks and outputting online averages. As described in Section 3.1, this leads to suboptimal regret. However, the notion of averaging is still a good idea to keep. If within a time interval, the Total Variation (TV) is adequately small, then outputting sample averages is reasonable for minimizing the cumulative squared error. Once we encounter a bump in the variation, a good strategy is to restart the averaging procedure. Thus we need to adaptively detect intervals with low TV. For accomplishing this, we communicate with an oracle estimator whose output can be used to construct a lowerbound of TV within an interval. The decision to restart online averaging is based on the estimate of TV computed using this oracle. Such a decision rule introduces non-linearity and hence breaks free of the suboptimal world of linear forecasters.
The oracle estimator we consider here is a slightly modified version of the soft thresholding estimator from Donoho (1995). We capture the high level intuition behind steps (d) and (e) as follows. Computation of Haar coefficients involves smoothing adjacent regions of a signal and taking difference between them. So we can expect to construct a lowerbound of the total variation from these coeffcients. The extra thresholding step in (d) is done to denoise the Haar coefficients computed from noisy data. In step (e), a weighted L1 norm of denoised coefficients is used to lowerbound the total variation of the true signal. The multiplicative factors are introduced to make the lowerbound tighter. We restart online averaging once we detect a large enough variation. The first coefficient is zero due to the re-centering caused by operation. The hyper-parameter controls the degree to which we shrink the noisy wavelet coefficients. For sufficiently small , It is almost equivalent to the universal soft-thresholding of (Donoho, 1995). The optimal selection of is described in Theorem 1.
We refer to the duration between two consecutive restarts inclusive of the first restart but exclusive of the second as a bin. The policy identifies several bins across time, whose width is adaptively chosen.
3.3 Dynamic Regret of Arrows
In this section, we provide bounds for non-stationary regret and run-time of the policy.
Let the feedback be where is an independent, -subgaussian random variable.
-subgaussian random variable. Let. If , then with probability at least , Arrows achieves a dynamic regret of where hides a logarithmic factor in and .
Our policy is similar in spirit to restarting OGD but with an adaptive restart schedule. The key idea we used is to reduce the dynamic regret of our policy in probability roughly to a sum of squared error of a soft thresholding estimator and number of restarts. This was accomplished by using a Follow The Leader (FTL) reduction. For bounding the squared error part of the sum we modified the threshold value for the estimator in Donoho (1995) and proved high probability guarantees for the convergence of its empirical mean. To bound the number of times we restart, we first establish a connection between Haar coefficients and total variation. This is intuitive since computation of Haar coefficients can be viewed as smoothing the adjacent regions of a signal and taking their difference. Then we exploit a special condition called “uniform shrinkage” of the soft-thresholding estimator which helps to optimally bound number of restarts with high probability. ∎
3.4 A lower bound on the minimax regret
We now give a matching lower bound of the expected regret.
Assume and , there is a universal constant such that
The proof is deferred to the Appendix I. The result shows that our result in Theorem 1 is optimal up to a logarithmic term in and for almost all regimes (modulo trivial cases of extremely small and ) 333When both and are moderately small relative to , the lower bound will depend on a little differently because the estimation error goes to faster than . We know the minimax risk exactly for that case as well but it is somewhat messy (see e.g. Wasserman, 2006). When they are both much smaller than , e.g., when , then outputting when we do not have enough information will be better than doing online averages.
Remark 1 (The price of forecasting).
The lower bound implies that a term with is required even if , whereas in the case of a one-step look-ahead oracle (or the smoothing algorithm that sees all observations) does not have this term. This implies that the total amount of variation that any algorithm can handle while producing a sublinear regret has dropped from to . Moreover, the regime where the term is meaningful only when . For the region where , the minimax regret is essentially proportional to . This is illustrated in Figure 3.
It is worth pointing out that knowledge of and in the policy is primarily used to get the optimal dependence of in the minimax regret. One can still get a regret that grows as even without the knowledge of these parameters if it is in the achievable regime by taking them to be a fixed constant in the restart criterion.
3.5 Fast computation
Last but not least, we remark that there is a fast implementation of Arrows that reduces the overall time-complexity for step from to .
The run time of Arrows is , where is the time horizon.
The proof exploits the sequential structure of our policy and sparsity in wavelet transforms.
See Appendix G for details.
3.6 The adaptivity of Arrows to Sobolev and Holder classes
|1-52.5 Minimax Algorithm||Arrows||
The discrete Sobolev class and the discrete Holder class are defined as
These classes features sequences that are more spatially homogeneous than those in the TV class. The minimax cumulative error of nonparametric estimation in the discrete Sobolev class is (see e.g., Sadhanala et al., 2016, Theorem 5 and 6).
Let the feedback be where is an independent, -subgaussian random variable. Let . If , then with probability at least , Arrows with calibration achieves a dynamic regret of where hides a logarithmic factor in and .
Thus despite the fact that Arrows is designed for the total variation class, it adapts to the optimal rates of forecasting sequences that are spatially regular. The chosen scaling of activates the embedding (see the illustration in Table 2) with both classes having the same minimax rate. This implies that Arrows is simultaneously minimax over and . Minimaxity in Sobolev class implies minimaxity in Holder class since it is known that a Holder ball is sandwiched between two Sobolev balls having the same minimax rate (see e.g., Tibshirani, 2015). A proof of the theorem and related experiments are presented in Appendix F and J.
3.7 Experimental Results
To empirically validate our results, we conducted a number of numerical simulations that compares the regret of Arrows, (Restarting) OGD and MA. Figure 2 shows the results on a function with heterogeneous smoothness (see the exact details and more experiments in Appendix B
) with the hyperparameters selected according to their theoretical optimal choice for the TV class (See Theorem3, 4 for OGD and MA in Appendix C). The left panel illustrates that Arrows is locally adaptive to heterogeneous smoothness of the ground truth. Red peaks in the figure signifies restarts. During the initial and final duration, the signal varies smoothly and Arrows chooses a larger window size for online averaging. In the middle, signal varies rather abruptly. Consequently Arrows chooses a smaller window size. On the other hand, the linear smoothers OGD and MA use a constant width and cannot adapt to the different regions of the space. This differences are also reflected in the quantitative evaluation on the right, which clearly shows that OGD and MA has a suboptimal regret while Arrows attains the minimax regret!
4 Concluding Discussion
In this paper, we studied the problem of forecasting bounded variation sequences. We proposed a new forecasting policy Arrows, which we show to enjoy a dynamic regret of with runtime . We also derived a lowerbound which matches the upper bound up to a logarithmic term which certifies the optimality of Arrows in all parameters. Through connection to linear estimation theory, we assert that no linear forecaster can achieve the optimal rate. Adapting to parameters like and generalizing to higher order TV class and other convex loss functions are considered as future directions to pursue.
DB and YW were supported by a start-up grant from UCSB CS department and a gift from Amazon Web Services. The authors thank Yining Wang for a preliminary discussion that inspires the work, and Akshay Krishnamurthy and Ryan Tibshirani for helpful comments to an earlier version of the paper.
- Bansal et al. (2015) Nikhil Bansal, Anupam Gupta, Ravishankar Krishnaswamy, Kirk Pruhs, Kevin Schewior, and Cliff Stein. A 2-competitive algorithm for online convex optimization with switching costs. In LIPIcs-Leibniz International Proceedings in Informatics, volume 40. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015.
Baum and Petrie (1966)
Leonard E Baum and Ted Petrie.
Statistical inference for probabilistic functions of finite state markov chains.The annals of mathematical statistics, 37(6):1554–1563, 1966.
- Besbes et al. (2015) Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227–1244, 2015.
Bickel et al. (1981)
PJ Bickel et al.
Minimax estimation of the mean of a normal distribution when the parameter space is restricted.The Annals of Statistics, 9(6):1301–1309, 1981.
- Birge and Massart (2001) Lucien Birge and Pascal Massart. Gaussian model selection. Journal of the European Mathematical Society, 3(3):203–268, 2001.
- Box and Jenkins (1970) George EP Box and Gwilym M Jenkins. Time series analysis: forecasting and control. John Wiley & Sons, 1970.
- Chen et al. (2018a) Niangjun Chen, Gautam Goel, and Adam Wierman. Smoothed online convex optimization in high dimensions via online balanced descent. In Conference on Learning Theory (COLT-18), 2018a.
- Chen et al. (2018b) Xi Chen, Yining Wang, and Yu-Xiang Wang. Non-stationary stochastic optimization under lp, q-variation measures. Operations Research, to appear., 2018b.
- Daniely et al. (2015) Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411, 2015.
- De Boor et al. (1978) Carl De Boor, Carl De Boor, Etats-Unis Mathématicien, Carl De Boor, and Carl De Boor. A practical guide to splines, volume 27. Springer-Verlag New York, 1978.
- Donoho et al. (1990) David Donoho, Richard Liu, and Brenda MacGibbon. Minimax risk over hyperrectangles, and implications. Annals of Statistics, 18(3):1416–1437, 1990.
- Donoho (1995) David L Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
- Donoho et al. (1998) David L Donoho, Iain M Johnstone, et al. Minimax estimation via wavelet shrinkage. The annals of Statistics, 26(3):879–921, 1998.
- Gaillard and Gerchinovitz (2015a) Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In Conference on Learning Theory, pages 764–796, 2015a.
- Gaillard and Gerchinovitz (2015b) Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In COLT, 2015b.
- Hall and Willett (2013) Eric Hall and Rebecca Willett. Dynamical models and tracking regret in online convex programming. In International Conference on Machine Learning (ICML-13), pages 579–587, 2013.
- Hazan and Seshadhri (2007) Elad Hazan and Comandur Seshadhri. Adaptive algorithms for online decision problems. In Electronic colloquium on computational complexity (ECCC), volume 14, 2007.
- Hodrick and Prescott (1997) Robert J Hodrick and Edward C Prescott. Postwar us business cycles: an empirical investigation. Journal of Money, credit, and Banking, pages 1–16, 1997.
- Hutter and Rigollet (2016) Jan-Christian Hutter and Philippe Rigollet. Optimal rates for total variation denoising. In Conference on Learning Theory (COLT-16), 2016. to appear.
- Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages 398–406, 2015.
- Johnstone (2017) Iain M. Johnstone. Gaussian estimation: Sequence and wavelet models. 2017.
- Kim et al. (2009) Seung-Jean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. trend filtering. SIAM Review, 51(2):339–360, 2009.
- Koolen et al. (2015) Wouter M Koolen, Alan Malek, Peter L Bartlett, and Yasin Abbasi. Minimax time series prediction. In Advances in Neural Information Processing Systems (NIPS’15), pages 2557–2565. 2015.
- Kotłowski et al. (2016) Wojciech Kotłowski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In Annual Conference on Learning Theory (COLT-16), volume 49, pages 1165–1189. PMLR, 2016.
- Mallat (1999) Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
- Mammen and van de Geer (1997) Enno Mammen and Sara van de Geer. Locally apadtive regression splines. Annals of Statistics, 25(1):387–413, 1997.
- Nadaraya (1964) Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
- Rakhlin and Sridharan (2014) Alexander Rakhlin and Karthik Sridharan. Online non-parametric regression. In Conference on Learning Theory, pages 1232–1264, 2014.
- Rakhlin and Sridharan (2015) Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression with general loss functions. CoRR, abs/1501.06598, 2015.
- Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT Press, 2006.
- Sadhanala et al. (2016) Veeranjaneyulu Sadhanala, Yu-Xiang Wang, and Ryan Tibshirani. Total variation classes beyond 1d: Minimax rates, and the limitations of linear smoothers. Advances in Neural Information Processing Systems (NIPS-16), 2016.
Scholkopf and Smola (2001)
Bernhard Scholkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
Steidl et al. (2006)
Gabriel Steidl, Stephan Didas, and Julia Neumann.
Splines in higher order TV regularization.
International Journal of Computer Vision, 70(3):214–255, 2006.
- Tibshirani (2015) Ryan Tibshirani. Nonparametric Regression: Statistical Machine Learning, Spring 2015, 2015. URL: http://www.stat.cmu.edu/~larry/=sml/nonpar.pdf. Last visited on 2019/04/29.
- Tibshirani (2014) Ryan J Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. Annals of Statistics, 42(1):285–323, 2014.
- Tsybakov (2008) Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
- Wahba (1990) Grace Wahba. Spline models for observational data, volume 59. Siam, 1990.
- Wang et al. (2014) Yu-Xiang Wang, Alex Smola, and Ryan Tibshirani. The falling factorial basis and its statistical applications. In International Conference on Machine Learning (ICML-14), pages 730–738, 2014.
- Wasserman (2006) Larry Wasserman. All of Nonparametric Statistics. Springer, New York, 2006.
- Yang et al. (2016) Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient. In International Conference on Machine Learning (ICML-16), pages 449–457, 2016.
- Zhang et al. (2018a) Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems (NeurIPS-18), pages 1323–1333, 2018a.
- Zhang et al. (2018b) Lijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML-18), pages 5877–5886, 2018b.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML-03), pages 928–936, 2003.
Appendix A Discussion on other related works
Regret from Adaptive Optimistic Mirror Descent.
In Jadbabaie et al. (2015), the authors propose Adaptive Optimistic Mirror Descent (AOMD) algorithm that minimizes the dynamic regret against a comparator sequence . Their learning framework is the full information setting where learner predict for a convex set . Then a loss function is revealed to the learner. To capture the regularity of the comparator, they define a quantity They capture the regularity of loss functions by incorporating some external knowledge about their gradients via a predictable sequence . They define: . Finally to account for the temporal variability of , they introduce as discussed earlier. The final regret bound is expressed in terms of these three quantities. However, their algorithm is adaptive and requires no prior knowledge about them.
We note that our problem can be reduced to their framework if one considers loss functions . Then the expected dynamic regret against the comparator sequence is given by
where the expectation at right hand side is over the randomness of forecasting strategy. Hence we observe that their regret bound can be directly applied to bound the dynamic regret of our problem. It can be shown that (see Appendix H) given a fixed total variation bound , then and can be proved to be with high probability. Plugging this into their regret bound yields an rate in probability. However, it is unclear that whether AOMD is fundamentally limited by this rate or is there a potential suboptimality in their analysis of regret on our particular problem.
Other Dynamic Regret minimizing policies.
(Yang et al., 2016) defines a path variation budget that is equivalent to our to characterize the sequence of convex loss functions. However, under the noisy gradient feedback structure, they use a version of restarting OGD to get regret rate. This is very similar to the policy in (Besbes et al., 2015). Since OGD is a linear forecaster, it is sub-optimal for predicting bounded variation sequences under the squared error metric.
In (Koolen et al., 2015), they consider minimizing the dynamic regret subject to . This is basically the discrete Sobolev class. As shown in appendix E, our policy is minimax for forecasting such sequences as well.
(Chen et al., 2018a) considers the Smoothed Online Convex Optimization framework where they simultaneously minimize the hitting loss and a switching cost. They provide dynamic regret bounds on this composite cost in the setting that is known to the learner before making the prediction. If we consider , then the baseline they compare against reduces to the offline Trend Filtering (TF) estimate when . Then Theorem 10 of (Chen et al., 2018a) states that the cumulative composite cost incurred by their proposed policy differs from that of the TF estimate by a term that is . However, this doesn’t translate to a meaningful regret bound in our setting.
(Hall and Willett, 2013) proposes the Dynamic Mirror Descent (DMD) algorithm that make use of a family of dynamical models for making prediction at each time step. They achieve a dynamic regret bound of where the second term measures the quality of the dynamical models in predicting ground truth.
Comparison to Online Isotonic Regression.
(Kotłowski et al., 2016) considers the dynamic regret minimization,
where is a label revealed by the environment, is the prediction of the learner, and the comparator sequence should obey . Here is a fixed positive number. Note that their setting and our framework are mutually reducible to each other in terms of regret guarantees via 3. They propose a minimax policy that achieves a dynamic regret of which translates to an in probability in our setting under the isotonic ground truth restriction.
We note that the isotonic comparator sequence belong to a TV class of variational budget . By using an argument similar to that in appendix H which involves converting to deterministic noise setting and conditioning on a high probability event, it can be shown that our policy is out of the box minimax with high probability in isotonic framework.
Comparison to Online Non-Parametric regression methods.
We note that our problem falls into the more general framework of online non-parametric regression setting studied in (Rakhlin and Sridharan, 2015). We can reduce our dynamic regret minimization to their framework by using a similar argument as above through (3). Since the bounded TV class is sandwiched between Besov spaces for the range , the discussion in section 5.8 of (Rakhlin and Sridharan, 2015) establishes that minimax growth w.r.t as in the online setting for TV class. Thus our bounds, modulo logarithmic factor, matches with theirs though we give the precise dependence on and as well. It is worthwhile to point out that while the bound in (Rakhlin and Sridharan, 2015) is non-constructive, we achieve the same bound via an efficient algorithm.
(Gaillard and Gerchinovitz, 2015b) proposes a minimax policy for underlying ground truth functions that are Holder continuous. In particular, for the Holder class that satisfy , their algorithm yields a regret of . It is known ((Tibshirani, 2015)) that is sandwiched between two Sobolev balls having the same minimax rate. Since our policy is optimal for Sobolev space (appendix F), it is also optimal over Holder ball . The runtime of their policy for class is . It should be noted that Sobolev and Holder classes are arguably easier to tackle than the TV class since both of them can be embedded inside a TV class.
Strongly Adaptive Regret.
Daniely et al. (2015) introduced the notion of Strongly Adaptive (SA) regret where the online learner is guaranteed to have low static regret for any interval within the duration of the game. They also propose a meta algorithm which can convert an algorithm of good static regret to one with good SA regret. However low static regret for any interval doesn’t help in our setting because in each interval we are competing with a stronger dynamic adversary. A notion of SA dynamic regret would an interesting topic to explore.
For minimizing dynamic regret, Zhang et al. (2018b) proposed a meta policy that uses an algorithm with good SA regret as its subroutine. Hence we can use their framework with squared error loss functions as discussed above. They show that OGD has an SA regret of for strongly convex loss functions. Using OGD as the subroutine and applying corollary 7 of their paper yields a bound . By a similar argument one gets the same linear regret rate when online newton step is used as the subroutine. However, we should note that their algorithm works without the knowledge of radius of the TV ball .
Classical time series forecasting models.
Finally, we note that our work is complementary to much of the classical work in time-series forecasting (e.g., Box-Jenkins method/ARIMA Box and Jenkins (1970)1966)). These methods aim at using dynamical systems to capture the recurrent patterns under a stationary stochastic process, while we focus on harnessing the nonstationarity. Our work is closer to the “trend smoothing” literature (e.g., the celebrated Hodrick-Prescott filter (Hodrick and Prescott, 1997), trend filtering (Kim et al., 2009; Tibshirani, 2014; Hutter and Rigollet, 2016)).
Appendix B Additional Experiments
The function that we generated in Figure 2 is a hybrid function which in the first half is a “discretized cubic spline” with more knots closely placed towards the end. In the second half it is a Doppler function with being the time horizon. We observe noisy data , and are iid normal variables with . The value of for is around 17. Hence for all , we are under the regime of .
The window size for moving averages and partition width of OGD were tuned optimally for the TV class (see Appendix C for details). Figure 2 depicts the estimated signals and dynamic regret averaged across 5 runs in a log log plot. The left panel illustrates that Arrows is locally adaptive to heterogeneous smoothness of the ground truth. Red peaks in the figure signifies restarts. During the initial and final duration, the signal varies smoothly and Arrows chooses a larger window size for online averaging. In the middle, signal varies rather abruptly. Consequently Arrows chooses a smaller window size. On the other hand, the linear smoothers OGD and MA attains a suboptimal regret.
In Figure 4 and 5 we plot the estimates and log-log regret for two more functions: A linear function that is homogeneously smooth and less challenging and a step function which has an abrupt discontinuity making it more inhomogeneous than linear but have lesser inhomogeneity w.r.t hybrid signal considered in 3.7. Both OGD and MA were optimally tuned for the TV class as in Appendix C.
The red peaks corresponds to restarts by Arrows. For linear functions we can see that ARROWS chooses inter-restart duration/bin-widths that are constant throughout. This is expected as a linear trend is spatially homogeneous. For the step function, we see that Arrows restart only once since the start. Further, notice that it quickly restarts once the bump is hit. For both of these functions, necessary scaling is done so that we are in the regime quite early.
Appendix C Upper bounds of linear forecasters
In this section we compute the optimal batch size for Restarting OGD and optimal window size for moving averages to yield the regret rate.
Let the feedback be where is an independent, -subgaussian random variable. Let . Restarting OGD with batch size of achieves an expected dynamic regret of .
Note that in our setting with squared error losses , the update rule of restarting OGD reduces to computing online averages. Thus OGD essentially divides the time horizon into fixed size batches and output online averages within each batch. Our objective here is to compute the optimal batch size that minimizes the dynamic regret.
We will bound the expected regret. Let be the estimate of OGD at time . Let batches be numbered as where is the fixed batch size. Let the total variation of ground truth within batch be . Time interval of batch is denoted by
. Due to bias variance decomposition within a batch we have,
with the convention and at start of bin our prediction is just the noisy realization of the previous data point.
Summing across all bins gives,
where we have used assumption (A4) to bound the bias of the first prediction. The above expression can be minimized by setting to yield a regret bound of ∎
Under the same setup as in Theorem 3, moving averages with window size yields a dynamic regret of
Let the window size of moving averages be denoted by . Consider the prediction at a time . By bias variance decomposition we have,
By Jensen’s inequality,
Notice that the term will be multiplied by a factor in the above bias bound at time point , times in the next time point and so on. By summing this bias bound across the times points, we obtain
The squared bias for the initial points can be bounded by.
Summing the variance terms yields,
Thus the total MSE can be minimized by setting , we obtain a dynamic regret bound of
Appendix D Proof of useful lemmas
We begin by recording an observation that follows directly from the policy.
For bin that spans the interval , discovered by the policy, let the lengths of and be and respectively. Then
In the next lemma, we record the uniform shrinkage property of soft-thresholding estimator.
For any interval , let and . Then with probability at-least for each co-ordinate .
Consider a fixed bin with zero padded vector . Due to sub-gaussian tail inequality, we have with probability at-least . Consider the case . Then both the scenarios and leads to shrinkage to a value that is smaller than in magnitude due to soft-thresholding with threshold set to . Now consider the case when . Again, soft-thresholding in both scenarios and leads to shrinkage to a value that is smaller than in magnitude. One can come up with a similar argument for the case where . Now applying a union bound across all co-ordinates and all bins, we get the statement of the lemma. ∎
The number of bins, , discovered by the policy is at-most with probability at-least .
Let be the mean subtracted and zero padded sequence values in bin discovered by our policy. be the corresponding mean subtracted and zero padded observations. Note that due to zero padding and some of the last values in the vector can be zeroes. Let denotes the discrete wavelet coefficient vector. We can view the computation of the Haar coefficients as a recursion. At each level of the recursion, the entire length , is divided into intervals. Let the sample averages of elements of in these intervals be denoted by the sequence . Let denotes the vector of Haar coefficients at level .
First note that the Haar coefficient