1 Introduction
Exponential smoothing (ES) methods model current and future time series observations as a weighted combinations of past observations, with more weight given to recent data. The word ‘exponential’ reflects the exponential decay of weights for older observations. ES methods have been around since the 1950s, and are still very popular forecasting methods used in business and industry, including supply chain forecasting [4], stock market analysis [15, 12, 3], weather prediction [16, 13], and electricity demand forecasting [14, 11].
In contrast to many techniques in machine learning, ES provides simple and interpretable models and forecasting capability by assuming a fixed structure for the evolution of the time series. For example, a simple (level only) model is
(1) 
where is an observation at time , and
is the estimate of
at time given . The forecast at is adjusted by a fraction of the error at time ; larger means greater adjustment. Iterating (1), we have(2) 
illustrating the exponential decay.
(a) Noisy series fit by HW (green), Robust 
HW (red transparent), and ES cell (red solid). 
(b) Level  (c) Trend 
(d) Seasonality  (e) Forecast (100 steps). 
The in (1) is fit using available data and used across the entire period of interest. More generally, a time series model also includes trend (longterm direction) and periodic seasonality components, with additional smoothing terms (
for these components. Classic ES methods use observed data to fit smoothing components, error variance, and initial state, and then use the quantities to provide point forecasts and quantify uncertainty. Every additive ES can be formulated using the compact single source of error (SSOE) representation
[9, 8]:(3)  
where is a linear measurement model, is a linear transition function and
is the vector of smoothing parameters. In (
1), we have , , and , and . More generally, tracks the deterministic components of the time series (level, trend, seasonality) while adjusts for stochastic disturbances.Flexibility of ES models
To show how ES models are constructed and transformed into (3), we compare the simple linear, Holt’s linear, and HoltWinters models. The simple linear model from (1) tracks the level using a zeroorder polynomial approximation:
(4) 
Holt’s Linear Model uses a first order (tangent) line approximation, and tracks both level and trend :
(5) 
Finally, the HoltWinters model adds a seasonality component with known periodicity , giving the augmented state , see (6).
(6) 
To write the HoltWinters model in form of (3), take
(7) 
(a) Local estimation by a single ES cell  (b) Overall approach: ESCells linked by a dynamic model. 
Limitations of classic ES
SSOE models are attractive in their simplicity, since they use a single parameter vector to model perturbations. However, the same must applied at every time point; this limits model flexibility and ensures estimates of are strongly affected by artifacts in the data, including noise and outliers. Consider a toy example (3) with two measurements; and are found by solving
(8) 
If is an outlier, it affects both and in the fit. Robust statistics are powerless for SSOE models: ignoring necessarily leaves a large , which affects . The propagation is recursive:
Each appears in all , , so an outlier at any point has to affect and , which control the entire time series. This phenomenon does not occur in other time series formulations, including AR models, where robust methods have been developed, see e.g. [10].
Earlier work in robust HoltWinters (RHW) uses Mestimators to filter the observations [5, 6]
, and then applies standard HW. In contrast, ESCells formulates a single problem to analyze the entire time series, simultaneously denoising, decomposing, and imputing.
The classic approach (8) is nonconvex, and has weak guarantees: stationarity conditions (such as ) do not imply global optimality, and solutions found by iterative methods depend on the initial point. As the level of noise and outliers increases, the ability of blackbox optimizers to get reasonable and breaks down. The ESCells approach uses a strongly convex formulation; it has a unique global minimum and no other stationary points.
We created a synthetic time series with trend and level shifts, as well as heteroscedastic noise and outliers in the observations. Figure
1 compares the performance of the HW model^{1}^{1}1Implemented in the standard HoltWinters R module [7] and RHW [6]^{2}^{2}2We implemented the approach. In[6, Eqs. 13,14], we take , and . Automated methods for obtaining smoothing parameters and failed in the presence of noise; so for the HW model, we use handtuned parameters , , , with the first 50 elements of the noisy . to the ESCells approach^{3}^{3}3https://github.com/UWAMO/TimeSeriesESCell. HW propagates outliers and is adversely affected by heteroscedastic noise, affecting the estimates of level, trend, and especially seasonality components (see Figure 1 (bd)). This lack of robustness gives a poor understanding of the overall time series (Figure 1 (a)) and leads to low forecasting accuracy, as corrupted errors are propagated in future times (see Figure 1 (e)). For RHW, automatic approaches to find failed, and we had to handtune parameters; the final result improves on HW but requires a long ‘burnin’ period, and still produces a somewhat noisy forecast. The ESCells approach captures and removes outliers and heteroscedastic noise, and correctly identifies the components.Time series estimation using ESCells
The ESCells approach is constructed from interconnected building blocks. The basic cell consists of local ES estimation over a fixed window, equipped with a convex regularization term (for denoising) and a robust loss function (to guard against outliers), see Fig.
2 (a). The cells are then linked together by the time series dynamics, but allowing discrepancies between and , see Fig. 2(b). These differences are treated as samples of, analyzed, and used to build forecasting confidence intervals. Fitting the entire ESCells model is a convex problem, and can be done at scale using efficient methods for dynamic optimization
[1, 2].1.1 ES cell model
First we formulate inference for a single ES cell. Given a time point and an integer , we take a window of size that includes all the points in the interval . Some measurements can be missing; and no time point outside has measurements. To model these cases, we introduce indicator variables
We also define a unimodal sequence of weights , with
The estimate depends only on observations in the times , and is obtained by propagating the estimate at the start of the window at time to the middle of the window at time , where :
(9) 
The estimate solves the optimization problem
(10) 
where , and extracts the difference of two seasonality components from the state . The objective function (10) extends the classic ES approach (8) in three respects.

The terms keep track of missing observations.

The loss used to compare to is robust to outliers.

The term adds total variation regularization for the seasonality components.
The objective function (10) is convex, as long as the loss and regularizer are both chosen to be convex.
1.2 Linking the ESCells
Each cell estimate only depends on local data. To connect and , we assume that the estimates satisfy
where, in contrast to the error term , are i.i.d. Gaussian errors. This is equivalent to adding the penalty
see (9).
This links together objectives of form (11) to generate a single objective
over the entire sequence
:
(12)  
The problem in (12) is nonsmooth but convex, and has dynamic structure. It has far more variables than the classic nonconvex ES formulation in (8). Nonetheless, it can be efficiently solved at scale using recent algorithms for generalized Kalman smoothing [1, 2]. Given , the final time series estimate is given by
(a) Level: Forecast + 99% CI.  (b) Trend: Forecast + 99% CI. 
(c) Seas.: Forecast + 99% CI.  (d) TS: Forecast + 99% CI . 
Time series forecasting using ESCells
ESCells capture two main sources of uncertainty that are important for forecasting future values of a time series: uncertainty in the residuals , and in the smoothing parameters .
ESCells track these two sources of uncertainty and can be used to create two separate confidence intervals: one representing the variability of each component of the signal, and the other capturing the structure of the residual.
Solving the full problem (12), we obtain the entire sequence , as well as corresponding estimates of residuals and smoothing parameters :
In order to obtain the prediction distribution, we simulate sample paths from the models, using the empirical distribution of and , and conditioned on the final state. This allows us to estimate any desired characteristics of the prediction distribution at a specific forecast horizon, and in particular to estimate confidence intervals that incorporate smoothing parameter and residual uncertainties. We can also incorporate modelbased residuals (instead of using the empirical distribution) by generating forecasted values from any given distribution.
To illustrate the ES forecasting framework, Figure 3 presents forecasts for the noisy synthetic model introduced in Figure 1. In particular, 100 step ahead forecasts and their 99% confidence intervals (using 10000 Monte Carlo runs) are shown for trend and seasonality (panels (a) and (b)). These are obtained by using the empirical distribution. The forecast for the full time series (and a zoomed plot) are shown in panels (c) and (d). The inner 99% CI (strictly inside the shaded region) takes into account only uncertainty in smoothing parameters , while the outer CI (the border for the shaded region) takes into account uncertainty in . Since the time series is contaminated by outliers, the outer CI is very wide in this case.
Real world Time Series : Twitter’s user engagement dataset
To test our algorithm we examine an anonymized time series representing user engagement on Twitter. This dataset is publicly available on its official blog^{4}^{4}4https://blog.twitter.com/2015/introducingpracticalandrobustanomalydetectioninatimeseries and is fully representative of the challenges tackled in this paper:

Distinct seasonal patterns due to user behavior across different geographies

An underlying trend which could be interpreted as organic growth (new people engaging with the social network)

Anomalies or outliers due to either special events surrounding holidays (christmas, breaking news) or unwanted behavior (bots or spammers)

Underlying heteroscedastic noise. embodying the variance of the signal.
(a) Classic HoltWinters Analysis and Forecast  (b) ESCells Analysis and Forecast. 
The dataset was originally put online to showcase a robust anomaly detection procedure. With the ESCells framework, we can go much further, decomposing the time series into interpretable components, and then forecasting both the components and the entire time series under uncertainty. The original aim (anomaly detection) is easily accomplished by studying the tail of the empirical residual distribution, as discovered by the approach.
The classic HoltWinters model fits outliers, forecasting sharp growth of engagement, which misses the observed trend (Figure 4(a)), and finds a very wide 99% confidence interval, In contrast, the ESCells approach (Figure 4(b)) avoids fitting the outliers; the average forecast correctly captures the decrease in the trend, and provides a much tighter asymmetric 99% CI. The improvement can be quantitatively assessed by looking at the Mean Absolute Percentage Error (MAPE) for the forecasted time series over a sliding window of 10 observations, Figure 5. Traditional HoltWinters has a higher MAPE than ESCell at every time point; moreover, the MAPE of the ESCells method is stable over time, while the MAPE of ES increases, illustrating its failure to robustly capture the long term trend of the time series. Trend, level and seasonality are shown in Figure 7. There is a clear decrease in level and trend, which are detected despite the large amounts of noise in the data.
1.3 Anomaly Detection
After fitting the ES procedure, we are left with a residual that we can analyze to understand anomalies in the time series. Figure 6
shows an example of outlier detection by looking at the 1.5% tails of the residual distribution.
(a) Trend, Forecast, & 99% CI. 
(b) Level, Forecast, & 99% CI. 
(c) Seas., Forecast, & 99% CI 
1.4 Robust auto completion of missing data
The ESCell algorithm is also robust to missing observations. Whether the data is missing at random, or in significant contiguous batches, it is automatically filled in by the ESCells algorithm. Since the problem is solved globally, nearby outliers do not affect the interpolated values, in contrast to local interpolation methods. Figure
9 shows the result obtained by removing two contiguous chunks of 100 observations each in two distinct parts of the time series. The data is automatically ‘infilled’ using the ESCell procedure.(a) Classic approach  (b) Global approach 

Discussion
ESCells is a new model for time series inference, capable of fitting and forecasting data in situations with high noise, frequent outliers, and large contiguous portions of missing data. These features are present in many realworld largescale datasets. The ESCells formulation differs from previous model in its global approach, as shown in Figure 8. We simultaneously denoise, impute, and decompose the time series by solving a single convex optimization problem with dynamic structure; then use samplingbased strategies for forecasting and uncertainty quantification. The results are illustrated on simulated and real data, where the proposed method yields a 5fold improvement in MAPE for the forecasting. The simplicity and versatility of the ESCells formulation makes it a superior alternative to HoltWinters and related time series models. Code for the approach and experiments is publicly available^{5}^{5}5https://github.com/UWAMO/TimeSeriesESCell.
Acknowledgements
The work of A. Aravkin was supported by the Washington Research Foundation Data Science Professorship.
References
 [1] A. Aravkin, J. V. Burke, L. Ljung, A. Lozano, and G. Pillonetto. Generalized kalman smoothing: Modeling and algorithms. arXiv preprint arXiv:1609.06369, Survey to appear in Automatica, 2017.
 [2] A. Y. Aravkin, J. V. Burke, and G. Pillonetto. Sparse/robust estimation and kalman smoothing with nonsmooth logconcave densities: Modeling, computation, and theory. Journal of Machine Learning Research, 14:2689–2728, 2013.
 [3] R. G. Brown and R. F. Meyer. The fundamental theorem of exponential smoothing. Operations Research, 9(5):673–685, 1961.
 [4] F. Chen, J. K. Ryan, and D. SimchiLevi. The impact of exponential smoothing forecasts on the bullwhip effect. Naval Research Logistics, 47(4):269–286, 2000.
 [5] T. Cipra. Robust exponential smoothing. Journal of Forecasting, 11(1):57–69, 1992.
 [6] S. Gelper, R. Fried, and C. Croux. Robust forecasting with exponential and holt–winters smoothing. Journal of forecasting, 29(3):285–300, 2010.
 [7] C. C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International journal of forecasting, 20(1):5–10, 2004.
 [8] R. Hyndman, A. B. Koehler, J. K. Ord, and R. D. Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
 [9] R. J. Hyndman, A. B. Koehler, R. D. Snyder, and S. Grose. A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting, 18(3):439–454, 2002.
 [10] R. Maronna, R. D. Martin, and V. Yohai. Robust statistics. John Wiley & Sons, Chichester. ISBN, 2006.
 [11] I. Moghram and S. Rahman. Analysis and evaluation of five shortterm load forecasting techniques. Power Systems, IEEE Transactions on, 4(4):1484–1491, 1989.
 [12] R. S. Pindyck. Risk, inflation, and the stock market. Technical report, National Bureau of Economic Research, 1983.
 [13] S. S. Soman, H. Zareipour, O. Malik, and P. Mandal. A review of wind power and wind speed forecasting methods with different time horizons. In North American Power Symposium (NAPS), 2010, pages 1–8. IEEE, 2010.
 [14] J. W. Taylor. Shortterm electricity demand forecasting using double seasonal exponential smoothing. Journal of the Operational Research Society, 54(8):799–805, 2003.
 [15] J. W. Taylor. Volatility forecasting with smooth transition exponential smoothing. International Journal of Forecasting, 20(2):273–286, 2004.
 [16] J. W. Taylor and R. Buizza. Neural network load forecasting with weather ensemble predictions. Power Systems, IEEE Transactions on, 17(3):626–632, 2002.
Comments
There are no comments yet.