Model Monitoring and Dynamic Model Selection in Travel Time-series Forecasting

by   Rosa Candela, et al.

Accurate travel products price forecasting is a highly desired feature that allows customers to take informed decisions about purchases, and companies to build and offer attractive tour packages. Thanks to machine learning (ML), it is now relatively cheap to develop highly accurate statistical models for price time-series forecasting. However, once models are deployed in production, it is their monitoring, maintenance and improvement which carry most of the costs and difficulties over time. We introduce a data-driven framework to continuously monitor and maintain deployed time-series forecasting models' performance, to guarantee stable performance of travel products price forecasting models. Under a supervised learning approach, we predict the errors of time-series forecasting models over time, and use this predicted performance measure to achieve both model monitoring and maintenance. We validate the proposed method on a dataset of 18K time-series from flight and hotel prices collected over two years and on two public benchmarks.



page 1

page 2

page 3

page 4


MegazordNet: combining statistical and machine learning standpoints for time series forecasting

Forecasting financial time series is considered to be a difficult task d...

Randomized Neural Networks for Forecasting Time Series with Multiple Seasonality

This work contributes to the development of neural forecasting models wi...

Analytics of Business Time Series Using Machine Learning and Bayesian Inference

In the survey we consider the case studies on sales time series forecast...

Model combinations through revised base-rates

Standard selection criteria for forecasting models focus on information ...

Big Data-driven Automated Anomaly Detection and Performance Forecasting in Mobile Networks

The massive amount of data available in operational mobile networks offe...

PriceAggregator: An Intelligent System for Hotel Price Fetching

This paper describes the hotel price aggregation system - PriceAggregato...

Incorporating travel behavior regularity into passenger flow forecasting

Accurate forecasting of passenger flow (i.e., ridership) is critical to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Travel industry actors, such as airlines and hotels, nowadays use sophisticated pricing models to maximize their revenue, which results in highly volatile fares [Chen2015]. For customers, fluctuation prices are a source of worry due to the uncertainty of future price evolution. This situation has opened the possibility to new businesses, such as travel meta-search engines or online travel agencies, providing decision-making tools to customers [Wohlfarth2011]. In this context, accurate price forecasting over time is a highly desired feature. Among many others, it allows customers to take informed decisions about purchases, and companies to build and offer attractive tour packages, while maximizing their revenue margin.

The exponential growth of computer power along with the availability of large datasets has led to a rapid progress in the machine learning (ML) field over the last decades. This has allowed the travel industry to benefit from the powerful ML machinery to develop and deploy accurate models for price time-series forecasting. Development and deployment, however, only represent the first steps of a ML system’s life cycle. Currently, it is the monitoring, maintenance and improvement of complex production-deployed ML systems which carry most of the costs and difficulties in time [Sculley2015, r2019overton]. Model monitoring refers to the task of constantly tracking a model’s performance to determine when it degrades, becoming obsolete. Once a degradation in performance is detected, model maintenance and improvement take place to update the deployed model by rebuilding it, recalibrating it or, in a more abstract way, by doing model selection.

Currently, this is a critical problem for our travel applications. While it is relatively easy and fast to develop ML-based methods for accurate price forecasting of different travel products, maintaining a good performance over time faces multiple challenges. Firstly, price forecasting of travel products involves the analysis of multiple time-series which are modeled independently, i.e. a model per series rather than a single model for all. According to the 2019 World Air Transport Statistics report, almost 22K city pairs are directly connected by airlines through regular services [international2019world]. As each city pair is linked to a time-series, it is impossible to manually monitor the performance of every associated forecasting model. For scalability purposes, it is necessary to develop methods that can continuously and automatically monitor and maintain every deployed model. Secondly, time-series comprise time-evolving complex patterns, non-stationarities or, more generally, distribution changes over time, making forecasting models more prone to deteriorate over time [aiolfi_persistence_2006]

. Poor estimations of a model’s degrading performance can lead to business losses, if detected too late, or to unnecessary model updates incurring system maintenance costs 

[Sculley2015], if detected too early. Efficient and timely ways to model monitoring are therefore key to continuously accurate in-production forecasts. Finally, a model’s degrading performance also implies that the model becomes obsolete. As a result, a specific model might not always be the right choice for a given series. Since time-series forecasting can be addressed through a large set of different approaches, the task of choosing the most suitable forecasting approach requires finding systematic ways to carry out model selection efficiently. One of the most common ways to achieve all of this is cross-validation [arlot2010survey]. However, this approach is only valid at development and cannot be used to monitor and maintain models in-production due to the absence of ground truth data.

In this work we introduce a data-driven framework to continuously monitor and maintain time-series forecasting models’ performance in-production, i.e in the absence of ground truth, to guarantee continuous accurate performance of travel products price forecasting models. Under a supervised learning approach, we predict the forecasting error of time-series forecasting models over time. We hypothesize that the estimated forecasting error represents a surrogate measure of the model’s future performance. As such, we achieve continuous monitoring by using the predicted forecasting error as a measure to detect degrading performance. Simultaneously, the predicted forecasting error enables model maintenance by allowing to rank multiple models based on their predicted performance, i.e. model comparison, and then select the one with the lowest predicted error measure, i.e. model selection. We refer to it as a model monitoring and model selection framework.

The remaining of this paper is organized as follows. Section 2 discusses related work. Section 3 reviews the fundamentals of time-series forecasting and performance assessment. Section 4 describes the proposed model monitoring and maintenance framework. Section 5 describes our datasets and presents the experimental setup. Experiments and results are discussed in section 6. Finally, in section 7 we summarize our work and discuss key findings.

2 Related Work

Maintainable industrial ML systems. Recent works from tech companies [Baylor2017, lin2012large, r2019overton] have discussed their strategies to deal with some of the so-called technical debts [Sculley2015] in which ML systems can incur when in production. These works mainly focus on the hard- and soft-ware infrastructure used to mitigate these debts. Less emphasis is given to the specific methods put in place.
Concept drift. The phenomenon of time-evolving data patterns is known as concept drift. As time-series are not strictly stationary, it is a common problem of time-series forecasting usually addressed through regular model updates. Most works on concept drift for time-series forecasting have focused on its detection, which is equivalent to what we denote model monitoring, but do not perform model selection as they are typically limited to a single model [Ferreira:2014:DCT:2542820.2562373, 10.1007/978-3-642-34166-3_40].
Performance assessment without ground truth. An alternative to cross-validation is represented by information criteria. The rationale consists in quantifying the best trade-off between models’ goodness of fit and simplicity. Information criteria are mostly used to compare nested models, whereas the comparison of different models requires to compute likelihoods on the same data. Being fully data-driven, our framework avoids any constraint regarding the candidate models, leading to a more general way to perform model selection. Specifically to time-series forecasting, Wagenmakers et al. [wagenmakers2006accumulative]

achieve performance assessment in the absence of ground truth using a concept similar to ours. They estimate the forecasting error of a new single data point by adding previously estimated forecast errors, obtained from already observed data points. The use of the previous errors makes it it sensible to unexpected outlier behaviors of the time-series.

Meta-learning. Meta-learning has been proposed as an way to automatically perform model selection. Its performance has been recently demonstrated in the context of time-series forecasting. Both [ALI20189, RePEc:msh:ebswps:2018-6] formulate the problem as a supervised learning one, where the meta-learner receives a time-series and outputs the “best” forecasting model. Authors in [cerqueira2017arbitrated] share our idea that forecasting performance decays in time, thus they train a meta-learner to model the error incurred by the base models at each prediction step as a function of the time-series features. Differently from [RePEc:msh:ebswps:2018-6], our approach does not seek to select a different model family for each time-series, and avoids model selection at each time step [cerqueira2017arbitrated], since these two represent expensive overheads for in-production maintenance. Instead, we maintain a fast forecasting procedure and select the best model for a given time period in the future, which length can be relatively high (6-9 months, for instance).

3 Time-series forecasting and performance measures

A univariate time-series is a series of data points , each one being an observation of a process measured at a specific time . Univariate time-series contain a single variable at each time instant, while multivariate time-series record more than one variable at a time. Our application is concerned with univariate time-series, which are recorded at discrete points in time, e.g., monthly, daily, hourly. However, extension to the multivariate setting is straightforward.

Time-series forecasting is the task consisting in the use of these past observations (or a subset thereof) to predict future values , with indicating the forecasting horizon. The number of well-established methods to perform time-series forecasting is quite large. Methods go from classical statistical methods, such as Autoregressive Moving Average (ARMA) and Exponential smoothing, to more recent machine learning models which have shown outstanding performance in different tasks, including time-series forecasting.

The performance assessment of forecasting methods is commonly done using error measures. Despite decades of research on the topic, there is still not an unanimous consensus on the best error measure to use among the multiple available options [HYNDMAN2006679]. Among the most used ones, we find Symmetric Mean Absolute Percentage Error (sMAPE) and Mean Absolute Scaled Error (MASE). These two have been adopted in recent time-series forecasting competitions [article].

4 Monitoring and model selection framework

Figure 1: Illustration of the proposed method. and contain multiple time-series, each of these composed of observations (green) and forecasts (red) estimated by a monitored model, . represents the forecasting performance of the monitored model. It is computed using the true values (yellow). A monitoring model is trained to learn the function mapping to . With the learned , the monitoring model is able to predict , the predicted forecasting performance of the monitored model given .

Let us denote the input training set. A given input is formed by the observed time-series and forecasted values, . The values in the set are obtained by a given forecasting model which we hereby denote a monitored model, . Let be a collection of performance measures assessing the accuracy of the forecasts estimated by . A given performance measure is obtained by comparing the forecasts from to the true values.

Lets define a monitoring model as a model that is trained to learn a function mapping the input time-series to the target . Given a new set of time-series , formed by a time-series of observations , , and forecasts obtained by , the learned monitoring model predicts , i.e. the predicted performance measure of given (Figure 1).

The predicted performance measures represent a surrogate measure of the performance of a given within the forecasting horizon . As such it is used for the two tasks: model monitoring and selection. Model monitoring is achieved by using as an alert signal. If the estimated performance measure of the monitored model is poor, this means the model has become stale. To achieve model selection, are used to rank multiple monitored models and choose the one with the best performance If the two tasks are executed in a continuous fashion over time, it is possible to guarantee accurate forecasts in an automated way.

In the following, we describe the performance measure e that we use in our framework, as well as the monitoring and monitored models that we chose to validate our hypotheses.

4.1 Performance measure

As previously discussed, performance accuracy of time-series forecasts is measured using error metrics. In this framework, we use the sMAPE. It is defined as:


where is the number of forecasts (i.e. forecasting horizon), is the true value and is the forecast.

In the literature, there are multiple definitions of the sMAPE. We choose the one introduced in [chen2004assessing] because it is bounded between 0 and 2; specifically, it has a maximum value of 2 when either or is zero, and it is zero when the two values are identical. The sMAPE has two important drawbacks: it is undefined when both , are zero and it can be numerically unstable when the denominator in Eq. 1 is close to zero. In the context of our application, this is not a problem since it is unlikely to have prices with value zero or very close to it.

4.2 Monitoring models

The formulation of our framework is generic in the sense that any supervised technique that can solve regression problems can be used as monitoring model

. In this work, we decided to focus on latest advances in deep learning. We consider four

monitoring models

: Long Short-Term Memory (LSTM) networks, Convolutional Neural Networks (CNNs), Bayesian CNNs and Gaussian process (GP) regression. The latter two models differ from the former ones in that they also provide uncertainties around the predictions. This can enrich the output provided by the monitoring framework, in that whenever an alert is issued because of poor performance, this is equipped with information about its reliability This section illustrates the basic ideas of each of the selected

monitoring models.

Long Short-Term Memory networks. LSTM [hochreiter1997long]

networks are a type of Recurrent Neural Networks (RNNs) that solve the issue of the vanishing gradient problem 


present in the original RNN formulation. They achieve this by introducing a cell state into each hidden unit, which memorizes information. As RNNs they are a well-established architecture to model sequential data. By construction, LSTMs can handle sequences of varying length, with no need for extra processing like padding. This is useful in our application, whereby time-series in the datasets have different lengths.

Convolutional Neural Networks. CNNs [lecun1998gradient] are particular class of deep neural networks where the weights (filters) are designed to promote local information to propagate from the input to the output at increasing levels of granularity. We use the original LeNet [LeCun:1999:ORG:646469.691875] architecture, as it obtains generally good results in image recognition problems, while being considerably faster to train with respect to more modern architectures. CNNs are not originally conceived to work with time-series data. We adapt the architecture to work with time-series by using 1D convolutional filters. Unlike RNNs, this model does not support inputs of variable size, so we to resort to padding: where necessary we append zeros to a time-series to make them uniform in length. We denote this model LeNet.

Bayesian Convolutional Neural Networks. Bayesian CNNs [gal2016dropout]

represent the probabilistic version of CNNs, used in applications where quantification of the uncertainty in predictions is needed. Here network parameters are assigned a prior distribution and then inferred using Bayesian inference techniques. Due to the intractability of the problem, the use of approximations is required. Here we choose Monte Carlo Dropout 

[gal2016dropout] as a practical way to carry out approximate Bayesian CNNs. By applying dropout at test time we are able to sample from an approximate posterior distribution over the network weights. We use this technique on the LeNet CNN with 1D filters to produce probabilistic outputs. We denote this model Bayes-LeNet.

Gaussian processes. GPs [Rasmussen:2005:GPM:1162254]

form the basis of probabilistic nonparametric models. Given a supervised learning problem, GPs consider an infinite set of functions mapping input to output. These functions are defined as random variables with a joint Gaussian distribution, specified by a mean function and a covariance function, the latter encoding properties of the functions with respect to the input. One of the strengths of GP models is the ability to characterize uncertainty regardless of the size of the data. Similarly to CNNs, in this model input sequences must have the same length, so we resort to padding.

4.3 Monitored models

Similar to monitoring models, given the generic nature of the proposed framework, there is no constraint on the type of monitored models that can be used. Any time-series forecasting method is candidate monitored models. For this proof of concept, we consider six different monitored models. We select five of them from the ten benchmarks provided in the M4 competition [article], a recent time-series forecasting challenge. These are: Simple Exponential Smoothing (ses), Holt’s Exponential Smoothing (holt), Dampen Exponential Smoothing (damp), Theta (theta) and a combination of ses - holt - damp (comb

). Besides these five methods, we included a simple Random Forest (

rf) , in order to enrich the benchmark with a machine learning-based model. We refer the reader to [breiman2001random, article] for further details on each of these approaches.

5 Experimental setup

This section presents the data, provides details about the implementation of our methods to ease reproducibility and concludes by describing the evaluation protocol carried during the experiments.

5.1 Data

Flights and hotels datasets. We focus on two travel products: direct flights between city pairs and hotels. Our data is an extract of prices for these two travel products obtained from the Amadeus for Developers API 111, an online web-service which enables access to travel-related data through several APIs. It was collected over a two-years and one-month period. Table 1 presents some descriptive features of the datasets.

Using the service’s Flight Low-fare Search API, we collected daily data for one-way flight prices of the top 15K most popular city pairs worldwide. The collection was done in two stages. A first batch, corresponding to the top 1.4K pairs (flights), was gathered for the whole collection period. The second batch, corresponding to the remaining pairs (flights-ext), was collected only over the second year. For hotels, we used the Hotel API to collect daily hotel prices for a two-night stay at every destination city contained in the top city pairs used for flight search. These represent 3.2K different time-series.

Both APIs provide information about the available offers for flights/hotels, that meet the search criteria (origin-destination and date, for flights; city, date and number of nights, fixed to 2, for hotels) at the time of search. As such, it is possible to have multiple offers (flights or hotel rooms) for a given search criteria. When multiple offers were proposed, we averaged the different prices to have a daily average flight price for a given city pair, in the case of flights, or daily average hotel price for a given city, in the case of hotels. In the same way, it is possible to have no offers for a given search criteria. Days with no available offers were reported as missing data. Lack of offers can be caused by sold outs, specific flight schedules (e.g. no daily flights for a city pair) or seasonal patterns (e.g. flights for a part of the year or seasonal hotel closures). More rarely, they could even be due to a failure in the query sent to the API. As a result, the number of available observations is smaller than the length of the collection period (see Table 1).

Public benchmarks. In addition to travel products data, we decided to include data coming from publicly available benchmarks. Benchmark data are typically curated and avoid problems present in real data, such as those previously discussed regarding missing data, allowing for an objective assessment and more controlled setup for experimentation. We included two sets from the M4 time-series forecasting challenge competition [article] dataset, yearly and weekly. Table 1 presents statistics on the number of time-series and the available number observations per time-series for these two datasets. Here, the number of available observations is equivalent to the time-series length as no time-series contains missing values.

Name # time-series min-obs max-obs mean-obs std-obs
flights 1,415 431 745 734 23
flights-ext 13,810 50 347 346 13
hotels 3,207 1 658 368 128
yearly 23,000 13 835 31 25
weekly 359 80 2,597 1022 706
Table 1:

Information about number of time-series, and minimum (min-obs), maximum (max-obs), mean (mean-obs) and standard deviation (std-obs) of the available number of time-series observations per dataset.

5.2 Implementation

The LSTM network was implemented in Tensorflow. It is composed of one hidden layer with 32 hidden nodes. It is a dynamic LSTM, in that it allows the input sequences to have variable lengths, by dynamically creating the graph during execution. The two CNN-based

monitoring models use the LeNet architecture. We modified both convolutional and pooling layers with 1D filters, given that the input of the model consists in sequences of one dimension. We added dropout layers to limit overfitting. In the Bayesian CNN, we applied a dropout rate of 0.5, also at testing time, to obtain 100 Monte Carlo samples as approximation of the true posterior distribution. The GP model used the implementation of Sparse GP Regression from GPy [gpy2014]. The inducing points [titsias2009variational] were initialized with -means and were then fixed during optimization. We used a variable number of inducing points depending on the size of the input and a RBF kernel with Automatic Relevance Determination (ARD). In all experiments we used 75% data for training and 25% for test and the Adam optimizer with default learning rate [DBLP:journals/corr/KingmaB14]. Only in the dataset flights-ext we used mini-batches of size to speed up the training. For the monitored models, we used the implementation available from the M4 competition benchmark Github repository222 and we used the Python sklearn package [scikit-learn] implementation of Random Forest.

5.3 Evaluation protocol

For flight and hotel data we set , which means we are predicting the price for days ahead. These are two commonly used values in travel, representing 3 and 6 months ahead of the planned trip, so it is important to have accurate predictions over those horizons. For the the M4 competition datasets, we use the horizon given by the challenge organizers: for yearly and for weekly. For each dataset, we reserve the first data points of the i-th time-series, where depends on the time-series’s length, as input of the monitored models to obtain forecasts. Where missing values were found, in flights or hotels, these were replaced with the nearest non-missing value in the past. We build and , by taking 75% and 25% from the total number of time-series, respectively. We thus compute the forecasting errors using the sMAPE in Eq. 1 for the training set . Finally, we predict the performance measure for the time-series in , using the four monitoring models.

We compare our model monitoring and selection framework with the standard cross-validation method, which we here denote cv-baseline, where a model’s estimated performance is obtained “offline” at training time with the available data. Specifically, given observations, we use the last observations as validation set to evaluate the model. This implies to reduce the number of observations available to train the forecasting models, which can be problematic when either is small or is large.

6 Experimentals and Results

We first study the proposed framework’s ability to achieve model monitoring (Sec 6.1). Then, we demonstrate how the predicted forecasting errors can be used to carry out model selection and how it positions w.r.t state-of-the-art methods doing the same task (Sec 6.2). In Sec 6.3, we illustrate the performance of the joint model monitoring and selection framework in our target application.

6.1 Model monitoring performance

We evaluate if the monitoring models’ predicted sMAPEs can be used for model monitoring by estimating if the predicted measure represents a good estimate of a monitored model’s future forecasting performance. We assess the quality of the predicted forecasting errors by estimating the root mean squared error (RMSE) between the predicted sMAPEs and the true sMAPEs, for every monitored model. The true sMAPE is obtained using the monitored model’s predictions and the time-series’ observations in through Eq. 1. Figure 2 left summarizes the obtained results on all datasets.

The overall average error incurred by the monitoring models are low. This suggests that the forecasting error predictions are accurate, meaning that the predicted sMAPE is reliable to carry out model monitoring. When compared to the baseline method, it the monitoring models consistently perform better than standard cross-validation when estimating the future performance of the forecasting monitored models. There is an exception to this when the monitored model is the Random Forests (rf). In this case, the cv-baseline is not the worst performing approach. However, it is still surpassed in performance by both LSTM and GP.

ccc & &
& &

Figure 2: RMSE between predicted and observed forecasting error, sMAPE on all datasets using a log scale (left); on flights, flights-ext with forecasting horizons (top center), (top right); and on hotels with forecasting horizons (bottom center) and (bottom right). The reported cv-baseline RMSE is obtained by comparing the estimated sMAPE at training with the observed values at testing.
Monitoring Flights Hotels
LSTM 0.116 0.017 0.151 0.031 0.193 0.021 0.182 0.039
LeNet 0.117 0.017 0.155 0.031 0.209 0.039 0.224 0.062
Bayes-LeNet 0.084 0.017 0.100 0.035 0.135 0.022 0.148 0.044
GP 0.136 0.007 0.126 0.028 0.164 0.014 0.165 0.036
cv-baseline 0.119 0.006 0.604 0.328 0.190 0.020 0.609 0.302
Table 2: RMSE between predicted and true sMAPEs for flights and hotel time-series.

Figure 2, center and right, specifically presents the results obtained for flights and hotels time-series. Table 2 stratifies the results for travel product time-series in terms of the forecasting horizon. Results show that our approach outperforms the cv-baseline for large forecasting horizons, e.g. , while the methods get closer as the forecasting horizon decreases. This is consistent with our hypothesis that data properties change over time. Using a validation set composed of time points close to the unseen data gives consistent information about the model’s performance, because the two sets of data (validation and unseen data) have similar properties. However, increasing has the effect of pushing away the validation time points from the unseen data. In this case, it is better to rely on the forecast error prediction rather than on an error measure obtained during training.

6.2 Model selection performance

In this experiment, we assess the capacity of the proposed method to assist model selection in the absence of ground truth. Monitored models are ranked by estimating the average predicted sMAPE over a given time-series and ordering the resulting values in ascending order. In this way, we obtain a list of monitored models from the best to the worst one. The best performing monitored model is selected.

We compare the ground truth ranking with the one obtained by each of the monitoring models and the cv-baseline. We apply a Wilcoxon test [wilcoxon1945individual] to the ranking results to verify if there are significant differences between each of the ranked monitored models. Table 3 presents obtained results in hotels and flight data. We omit LeNet’s results for lack of space, while it gives the lowest accuracy among the four tested models. We include these results as well as those obtained for all the datasets as supplementary material.

Ground Truth Monitoring models cv-baseline
LSTM Bayes-LeNet GPs
hotels - = 180
model sMAPE model sMAPE model sMAPE model sMAPE model sMAPE
1 damp 0.244 (0.153 ses 0.219 (0.032) ses 0.212 (0.087) damp 0.230 (0.119) ses 0.326 (0.202)
2 ses 0.246 (0.164) damp 0.220 (0.033) damp 0.224 (0.130) ses 0.231 (0.121) rf 0.413 (0.333)
3 theta 0.269 (0.217) theta 0.233 (0.059) comb 0.249 (0.166) comb 0.251 (0.149) damp 0.462 (0.391)
4 comb 0.270 (0.207) comb 0.234 (0.057) theta 0.268 (0.234) theta 0.252 (0.160) comb 0.746 (0.569)
5 rf 0.316 (0.300) holt 0.280 (0.124) rf 0.324 (0.329) rf 0.291 (0.207) theta 0.938 (0.620)
6 holt 0.325 (0.277) rf 0.292 (0.210) holt 0.325 (0.162) holt 0.299 (0.190) holt 1.047 (0.660)
hotels - = 90
model sMAPE model sMAPE model sMAPE model sMAPE model sMAPE
1 damp 0.242 (0.175) ses 0.203 (0.022) ses 0.238 (0.088) damp 0.221 (0.137) comb 0.237 (0.166)
2 ses 0.243 (0.174) damp 0.218 (0.026) damp 0.239 (0.122) comb 0.238 (0.155) ses 0.239 (0.177)
3 comb 0.253 (0.189) theta 0.223 (0.022) comb 0.259 (0.108) theta 0.240 (0.151) damp 0.250 (0.194)
4 theta 0.254 (0.190) comb 0.224 (0.030) theta 0.263 (0.132) ses 0.244 (0.180) theta 0.251 (0.201)
5 holt 0.275 (0.217) holt 0.244 (0.052) holt 0.282 (0.190) holt 0.265 (0.185) holt 0.277 (0.235)
6 rf 0.293 (0.285) rf 0.254 (0.103) rf 0.298 (0.176) rf 0.266 (0.191) rf 0.311 (0.296)
flights - = 90
model sMAPE model sMAPE model sMAPE model sMAPE model sMAPE
1 comb 0.174 (0.102) comb 0.154 (0.081) comb 0.151 (0.086) damp 0.159 (0.073) ses 0.187 (0.110)
2 damp 0.175 (0.106) damp 0.155 (0.076) damp 0.161 (0.086) comb 0.160 (0.082) theta 0.188 (0.109)
3 theta 0.176 (0.105) theta 0.157 (0.042) theta 0.163 (0.087) theta 0.162 (0.086) damp 0.189 (0.110)
4 ses 0.177 (0.106) ses 0.158 (0.028) holt 0.188 (0.074) ses 0.163 (0.094) comb 0.190 (0.112)
5 holt 0.179 (0.113) holt 0.159 (0.036) ses 0.212 (0.070) holt 0.171 (0.119) holt 0.195 (0.118)
6 rf 0.232 (0.150) rf 0.200 (0.025) rf 0.287 (0.083) rf 0.210 (0.094) rf 0.207 (0.137)
Table 3: Comparison between true and predicted model rankings, in ascending order of sMAPE. Underlined values indicate pairs of forecasting models not significantly different, according to Wilcoxon test. Results with LeNet are omitted for lack of space.

Overall, the obtained rankings are consistent with the ground truth, proving the ability of the method to carry out model selection, by identifying the model with the lowest error measure. Moreover, comparing our approach with the cv-baseline, we find that our framework largely outperforms the latter, in that the ranking resulting from the cv-baseline is very different from the true one. Even in predictions with a small forecasting horizon (), the cv-baseline’s ranking performance remains sub-optimal . Looking at the three monitoring models, we find that they have a different behavior depending on the dataset. Specifically, GPs result to be slightly more reliable than Bayesian-LeNet, as the latter in some cases swapped the first and second model of the ranking. LSTM’s performance is close to the two probabilistic models, although the latter two globally have a better performance in terms of RMSE (see Table 2).

Having showed the reliability of the rankings, we evaluate if these can be effectively used to maintain accurate forecasts over time by doing model selection at fixed periods of time. Specifically, given a forecasting horizon, we divide it in smaller periods. At each time point, we use the predicted forecasting error to rank the monitored models and thus perform model selection by picking the best ranked model. We use the public benchmark data to guarantee curated data and we limit the experiments to the best two monitoring models, Bayesian-LeNet and GPs (Table 2). We compare our model selection with the results obtained using the same monitored model along the forecasting horizon. Figure 3 left shows the average forecasting performance, measured through the real sMAPE, on the weekly dataset. The proposed model selection scheme allows to have the lower forecasting errors, i.e. a better performance, along the whole forecasting horizon. Among the two monitoring models, GPs result in smoother curves.

cc &

Figure 3: Left: Measured average forecasting performance in terms of sMAPE using the predicted forecasting error to perform model selection in the weekly dataset. Results with Bayes-LeNet and GPs as monitoring models, and using fixed forecasting models over the whole horizon. Error bars denote standard deviation. Right: Worst (top) and best (bottom) model selection performances in comparison with ADE and FFORMS. GPs is used as monitoring model with six (GP-6) and ten monitored models (GP-10).

Finally, we compare with two state-of-the-art meta-learning methods, arbitrated dynamic ensembler [cerqueira2017arbitrated], ADE, and Feature-based FORecast-Model Selection [RePEc:msh:ebswps:2018-6], FFORMS, with the best performing monitoring model in our approach. The characteristics of these two methods allows them to be used to achieve good forecasting model’s performance. FFORMS uses 12 different base models, whereas ADE uses up to 40 different models. To remain competitive with these two methods that use a larger number of base models, we add three standard forecasting models, Arima (arima), Random Walk (rwf) and TBATS (tbats[de2011forecasting]

, and a feed-forward neural network (

nn), to our set of monitored models. Figure 3 right illustrates sMAPE results over two time-series from the weekly dataset: one where our method performs worst and the one where it performs best. We show the results of our approach using the original six monitored models and the enlarged set. Using the original six monitored models, our performance is worse than the two meta-learning models. However, by enlarging the set of monitored models, our method performs better than FFORMS and achieves a performance comparable to ADE with much less monitored/base models.

6.3 Model monitoring and selection performance

Finally, we illustrate the performance of the proposed model monitoring and selection framework by using it to guarantee continuous price forecasting accuracy of our two travel products: flights and hotels. In this context, the predicted sMAPE is used as a surrogate measure of the quality of the forecasts estimated by the monitored models. When the predicted sMAPE surpasses a given threshold, model selection is performed. Otherwise, the monitored model is kept. We use the best performing monitoring model, GPs. Since this is a probabilistic method, in addition to having a high predicted sMAPE, we add the condition of having a low uncertainty in the prediction. In our experiments, we set the sMAPE threshold at 0.02 for flights and 0.01 for hotels. The uncertainty was set at 0.01 for both. For this experiment, we removed rf from the monitored models pool as it is the method giving the poorest performance. It is important to remark that differently from other approaches removing a method from the monitored models pool simply requires to stop generating forecasts with the removed model. No re-training of the monitoring models is required.

Figure 4 illustrates the results obtained in terms of the average performance (sMAPE) for hotels with forecasting horizon . Our experiment here is quite restrictive, in the sense that no monitored model

is re-trained along the forecasting period. In this way, we show that even under this restrictive setting the proposed framework is able to improve the performance of simple models. This suggests that through the use of this framework it is possible to extend the moment where

monitored models need to be re-trained by simply using the ranking information to pick a new model. Delaying model re-training represents important cost savings.

Figure 4: Average forecasting performance in terms of sMAPE using the proposed model monitoring and selection framework (GPs as monitoring model) and using forecasting fixed models over the whole horizon. Error bars denote the standard deviation.

7 Conclusions

In this paper we introduce a data-driven framework to constantly monitor and compare the performance of deployed time-series forecasting models to guarantee accurate forecasts of travel products’ prices over time. The proposed approach predicts the forecasting error of a forecasting model and considers it as a surrogate of the model’s future performance. The estimated forecasting error is hence used to detect accuracy deterioration over time, but also to compare the performance of different models and carry out dynamic model selection by simply ranking the different forecasting models based on the predicted error measure and selecting the best. In this work, we have chosen to use the sMAPE as forecasting performance measure, since it is appropriate for our problem, but the framework is general enough that any other measure could be used instead.

The proposed framework has been designed to guarantee accurate price forecasts of different travel products price and it is conceived for travel applications that might be already deployed. As such, it was undesirable to propose a method that performs forecasting and monitoring altogether, as in meta-learning, since this would require deprecating already deployed models to implement a new system. Instead, thanks to the proposed fully data-driven approach, monitoring models are completely independent of those doing the forecasts, i.e. the monitored models, thus allowing a transparent implementation of the monitoring and selection framework.

Although our main objective is to guarantee continuous accurate price forecasts, the problem we address is relevant beyond our concrete application. Sculley et al. [Sculley2015] introduced the term hidden technical debt to formalize and help reason about the long term costs of maintainable ML systems. According to their terminology, the proposed model monitoring and selection framework addresses two problems: 1) the monitoring and testing of dynamic systems, which is the task of continuously assessing that a system is working as intended; and 2) the production management debt, which refers to the costs associated to the maintenance of a large number of models that run simultaneously. Our solution represents a simple, flexible and accurate alternative to these problems.