1 Introduction
In typical forecasting problems, we make probabilistic estimates of future outcomes based on the previous observations. Recently, it has been shown that forecasting models can be complex nonconvex models
Flunkert et al. (2017); Wen et al. (2017). Frequent update of these models is desired as the relationship between the targets and outputs might change over time. However, retraining these models can be time consuming.Online learning is a method of updating the model on each pattern as it is observed as opposed to batch learning where the training is performed over groups of pattern. It is a common technique to dynamically adapt to new patterns in the data or when training over the entire data set is infeasible. The literature in online learning is rich with interesting theoretical and practical applications but it is usually limited to the convex problems where global optimization is computationally tractable Zinkevich (2003). On the other hand, it is NPhard to compute the global minimum of nonconvex functions over a convex domain Hazan et al. (2017); Hsu et al. (2012).
Due to the intractability of the nonconvex problems, various assumptions on the input have been used to design polynomialtime algorithms Arora et al. (2014); Hsu et al. (2012). However, these were too specific to the models and more generic approach was needed. One way to achieve this is by replacing the “global optimality” requirement with a more modest requirement of stationarity AllenZhu and Hazan (2016).
The idea of online learning was borrowed from game theory where an online player answers a sequence of questions. The true answers to the questions are unknown to the player at the time of each decision and the player suffers a loss after committing to a decision. These losses are unknown to the player and the performance of the sequence of decisions will be evaluated by the difference between this accumulated loss and the best fixed decision in hindsight. Most recently,
Hazan et al. (2017) proposed a notion of gradient based local regret for nonconvex games.Inspired by Hazan’s approach and incorporating the notion of calibration, we introduce a novel gradient based local regret for forecasting problems. Calibration is a wellstudied concept in forecasting Foster and Vohra (1998). From game theoretic point of view, we call a forecasting procedure “calibrated” if the forecasts are consistent in hindsight. To the best of our knowledge, such definition of regret is new. We show that the proposed regret has logarithmic bound under certain circumstances and we provide insights to the proposed regret. We conjecture that more efficient algorithms can be developed that minimizes our regret.
2 Setting
In online forecasting, our goal is to update at each in order to incorporate the most recently available information. Assume that represents a collection of consecutive points where is an integer and represents an initial forecast point.
are nonconvex loss functions on some convex subset
. To put in another way,represents the parameters of a machine learning model at time
, represents the loss function computed using the available data at time given the model parameters .2.1 Regret Analysis
The performance of online learning algorithms is commonly evaluated by the regret, which is defined as the difference between the real cumulative loss and the minimum cumulative loss across :
(1) 
If the regret grows linearly with , it can be concluded that the player is not learning. If, on the other hand, the regret grows sublinearly, the player is learning and its accuracy is improving. While such definition of regret makes sense for convex optimization problems, it is not appropriate for nonconvex problems, due to NPhardness of nonconvex global optimization even in offline settings. Indeed, most research on nonconvex problems focuses on finding local optima. In literature on nonconvex optimization algorithms, it is common to use the magnitude of the gradient to analyze convergence. Hazan et al. (2017) introduced a local regret measure  a new notion of regret that quantifies the objective of predicting points with small gradients on average. At each round of the game, the gradients of the loss functions from where most recent rounds of play are evaluated at the forecast, and these gradients are then averaged. Hazan et al. (2017)’s local regret is defined to be the sum of the squared magnitude of the gradients averages.
Definition 2.1.
(Hazan’s local regret) The local regret of an online algorithm is defined as:
(2) 
when and . Hazan et al. (2017) proposed various gradient descent algorithms where the regret is sublinear.
2.2 Proposed Local Regret
In order to introduce the concept of calibration Foster and Vohra (1998), let’s consider the first order Taylor series expansion of the cumulative loss:
(3) 
where for any . If the forecasts are wellcalibrated, then perturbing by any cannot substantially reduce the cumulative loss. Hence, we can say that the sequence is asymptotically calibrated with respect to , if:
(4) 
Definition 2.2.
(Proposed Regret) We propose a local regret as:
(5) 
where for . To motivate equation 5, we use the following equality:
(6) 
which holds for the interior points. Using our definition of regret, we effectively evaluate an online learning algorithm by computing the average of losses at the corresponding forecast values over a sliding window. Hazan et al. (2017)’s local regret, on the other hand, computes average of previous losses computed on the most recent forecast. We believe that our definition of regret is more applicable to forecasting problems as evaluating today’s forecast on previous loss functions might be misleading.
3 Bound Analysis
We provide bound for different scenarios for the proposed regret in equation 5 for the interior points in the feasible set with the following assumptions: ; ; parameter update at is: where is the learning rate for some small . We consider three scenarios: (i) , is constant and , (ii) and , (iii) and is constant. We also note the following Theorem whose proof is provided in section 5.1.
Theorem 3.1.
where .
3.1 Scenario 1: , is constant and
Since , the update rule becomes ; in other words, no projection operator is necessary. Hence we can write:
Taking
as a unit vector such that
, we can write . Hence; the bound for the proposed regret becomes:(8) 
which can be made sublinear in if is selected large enough.
3.2 Scenario 2: and
Assuming is interior of the feasible set for all and and setting , we can write the result in theorem 3.1 as:
(9)  
(10)  
(11) 
where is set to . Hence, we get:
(12) 
Summing this over yields:
(13) 
which concludes the logarithmic bound for the proposed regret for interior points when and .
3.3 Scenario 3: and is constant
Similar to 3.2, we can write:
(14) 
Summing this result across yields:
(15)  
(16) 
which is quadratic in but can be selected accordingly to make the upper bound sublinear.
4 Conclusion
We introduced a new definition of a local regret to study nonconvex problems in forecasting. We used the concept of a calibration and showed that our regret can be written as a local regret for the interior points in the feasible set. Our regret differs from Hazan’s regret in the sense that it emphasizes today’s reward as opposed to past reward. We also showed that our definition of regret has a logarithmic bound under some constraints. As a future direction, we plan to study the insights of our regret for the boundary points in the feasible set and propose efficient machine learning algorithms for nonconvex online learning that are optimal in terms of our definition of regret.
References
 AllenZhu and Hazan [2016] Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pages 699–707, 2016.
 Arora et al. [2014] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Conference on Learning Theory, pages 779–806, 2014.
 Flunkert et al. [2017] Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 Foster and Vohra [1998] Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.
 Hazan et al. [2017] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in nonconvex games. arXiv preprint arXiv:1708.00075, 2017.

Hsu et al. [2012]
Daniel Hsu, Sham M Kakade, and Tong Zhang.
A spectral algorithm for learning hidden markov models.
Journal of Computer and System Sciences, 78(5):1460–1480, 2012.  Wen et al. [2017] Ruofeng Wen, Kari Torkkola, and Balakrishnan Narayanaswamy. A multihorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
 Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
5 Appendix
Lemma 5.1.
where , for any such that .
Proof.
Let and recall that . Then we have:
(17)  
The inequality in 17 can be justified by geometrical interpretation of projections as shown in Figure 1.
Plugging , we have:
(18)  
Inequality 18 is a result of triangle inequality as drawn in Figure 1. Using the fact that in equation 18 , we can write:
(20) 
where equation 20 is a result of . By rewriting as , we get:
(21)  
Note that by replacing with and with in Figure 1, we can see that . Since , we get:
(22) 
∎
Proof of Theorem 3.1 :
As a result of lemma 5.1, we can write the following inequality:
The first term can be rewritten as
(24) 
The bound for the second term can be written as:
(25) 
as a result of . The bound for the third term can be rewritten as:
(30)  
(31) 
where equation 5 is a result of . Hence, we have:
(32)  
now, let’s explore the bound for for any . By definition of , we can write:
(34)  
(35)  
(36)  
(37) 
Hence, . Taking and combining 32 and 37, we get:
(38)  
Comments
There are no comments yet.