The total turnover in the US restaurants sector is projected to reach in 2019, contributing to of the gross domestic product . Currently, the restaurant industry employs million people at one million locations across the United States. At the same time, the industry accounts for
million tons of food waste that constitutes an estimated value ofgoing to waste  every year.
Food waste is a critical socioeconomic problem considering that of the households in the U.S. suffered from food insecurity in 2018 . Furthermore, production, transportation, and disposal of unused food significantly impact the environment [1, 15]. On the other hand, food costs represent to of sales in restaurants. Hence, reducing food waste can be critical to boost profitability and to reduce the environmental impact of operating a restaurant. Furthermore, also food service contractors face similar challenges when providing meals at staff canteens, hospitals, etc.
Accurate demand forecasting is one of the key aspects for successfully managing restaurants and, from an environmental point of view, maintaining a low level of pre-consumer food waste. In this work, we are interested in predicting future values of the daily sold quantities of a given menu item. Hence, we deal with time series that show multiple strong seasonalities, trend changes, and outliers. Traditionally, judgmental forecasting techniques, based on the managers experience, are applied to estimate future demand at restaurants. However, producing high quality forecasts is a time consuming and challenging task, especially for inexperienced managers. Hence, we aim for a data-driven approach that supports restaurant managers in their decision making.
Nowadays, most restaurants and canteens use (electronic) Point of Sales
(POS) systems to keep track of all their sales. Clearly, the resulting data inventories are valuable sources for various data science applications such as forecasting future the sales numbers of menu items. In general, many industries rely on POS data to predict future demand. However, especially in the retailing sector, it is hard to make beneficial use of these estimates due to complicated supply chains and long lead times. In contrast to that, the restaurant industry is characterized by short lead times when ordering food stock and the absence of a complex supply chain.
Hence, we propose a forecasting approach that requires only POS data, i.e., weather information and special promotions are not considered. Moreover, in order to ensure acceptance of the approach among restaurant managers, the models ability for a straightforward human interpretation is critical.
The remainder of this paper is organized as follows. First, in Section 2, we provide a concise problem description. In Section 3, we briefly discuss related literature. Then, in Section 4, we propose our forecasting approach. In Section 5, we evaluate the prediction quality of our approach using real-world data of sales in a restaurant and a canteen. Finally, Section 6 concludes the paper and outlines future research directions.
2 Problem Description
This work is concerned with predicting the daily sold quantities of given menu items. This information, together with the recipes of the items, is essential to effectively manage the ordering of food stock. In our use-case, an (electronic) Point of Sales (POS) system is the exclusive data source. Typically, such systems create a new line item for each sold menu item together with a time stamp. This way, large data inventories, consisting of thousands of line items, are created. Each restaurant, let it be a small burger joint or a large company canteen, usually has many different products on its menu. While some products are offered all year long, others are only seasonally available. Moreover, the demand for certain products differs over the year.
Each menu item is usually identified through an unique ID. From the stored records, we can query the accumulated number of sales of the different menu items for each day. Hence, gathering the data required for our forecasting approach causes little to no additional administrative burden. In some cases, it is reasonable to do forecasting for product groups rather than for individual menu items. For example, a canteen may have a daily vegetarian option. Each day a different dish, identified by its own ID, is served.
For now, we assume that the sales of individual menu items are independent of each other. By design, a POS system only documents sales. The absence of recorded sales of a certain menu item on a given day may be due to several reasons:
The restaurant was closed. In this case, there is no recorded sale of any item at all.
The item was not on the menu that day by choice or due to lack of stock.
Nobody bought the product.
The opposite way, we assume that the restaurant was opened for business on a given day if there is at least one recorded sale of any menu item. Therefore, the sold quantity of all menu items for which there is no recorded sale (line item) is assumed to be zero.
Usually, data available from a POS system contains no records about when a product was added to and when it was removed from the menu. Therefore, we assume that an item has been removed from the menu if there is no recorded sale for or more days.
3 Related Work
In this section, we give a brief overview of related literature. We focus on forecasting approaches used in the restaurant industry as well as on the use of POS data.
A general view of forecasting methods for restaurant sales and customer demand is given in the recent review paper by Lasek et al. . Ryu and Sanchez  compare several methods (moving average, multiple regression, exponential smoothing) in order to estimate the daily dinner counts at a university dinning center. In their case, a study shows that multiple regression gives the most accurate predictions. Also, Reynolds et al.  use multiple regression to predict the annual sales volume of the restaurant industry (and of certain subsegments). Forst  applies ARIMA and exponential smoothing models in order to forecast the weekly sales (in USD) at a small campus restaurant. Hu et al.  use ARIMA methods to predict the customer count at casino buffets. Cranage and Andrew  use exponential smoothing to predict the monthly sales (in USD) at a restaurant. Bujisic et al.  analyze the effects of weather factors onto restaurant sales. The paper also contains a review of forecasting approaches for restaurant sales (with and without weather factors). However, the authors point out that most work from the literature is concerned with aggregated forecasts, i.e., only predictions for product categories or weekly sales numbers are given, rather than forecasting sales of individual menu items per day.
Moreover, Tanizaki et al. 
propose to use machine learning techniques in order to forecast the daily number of customers at restaurants. Their predictions are based on POS data that is enriched with external data, e.g., weather information and event data. Similarly,Kaneko and Yada 
present a deep learning approach to construct a prediction model for the sales at supermarkets using POS data.
So far, this short review shows that existing work in the literature is mostly concerned with predicting aggregated numbers rather than the sale of individual items. Moreover, it reveals that using POS data is common practice in many industries. However, our review exposes a gap in the existing literature as, to the best of our knowledge, there exists no work concerned with predicting sales of individual menu items at restaurants (based on POS data).
Hence, we extend our review onto prediction approaches dealing with data showing similar features than ours. The problem considered in this work is clearly a time series prediction problem. Its most dominant characteristics are: multiple strong seasonalities, trend changes, data gaps, and outliers. Thus, the application of classical time series models such as ARIMA models or exponential smoothing 
is of limited use. These models can only account for one seasonality and, moreover, require missing data points to be interpolated.
. It consists of three main components: trend, seasonality, and holidays. The model is designed in such a way that it allows for a straightforward human interpretation for each parameter allowing an analyst to adjust the model if necessary. The approach is concerned with time series having features such as multiple strong seasonalities, trend changes, outliers, and holiday effects. As a motivating example they use a time series describing the number of events that are created on facebook every day. However, this number is quite large in comparison to the daily sales in restaurants and canteens. Thus, for facebook’s example it is valid to assign a normal distribution to their target variable. However, this might be sub-optimal in our use case, since small target values are quite common. Note that the normal distribution
has a shape similar to the Poisson distributionfor large . In contrast to the normal distribution, the Poisson distribution is a classical distribution for count data.
In this section, we propose two Bayesian generalized additive models (GAMs) for demand prediction. The first one assumes a normally distributed target (also called response) , the second one assumes that the response
follows a negative binomial distribution. Both models include a trend functionand a seasonality function . Sparsity inducing priors [30, 7] are assigned to the parameters of and . This allows for a good model regularization, selection of significant trend changes or influential seasonal effects, and, finally, enhanced model interpretability.
Considering that restaurants and canteens commonly offer many different menu items and that new data is recorded on a daily basis, multiple models have to be trained on a frequent basis. For this reason, the training process should require only little computation time. Additionally, the inference should take place automatically without the drawback that a human analyst has to investigate the convergence of the procedure. Therefore, we consider the application of full Bayesian inference as inappropriate and focus on the comparatively easy to obtain mode of the posterior, the so-calledmaximum a posteriori (MAP) estimate, instead.
4.1 Generalized Additive Models
A generalized additive model writes as
where denotes the target variable corresponding to some exponential family distribution,
denote the predictors (also called covariates), denotes a link function (bijective and twice differentiable), and denote some smooth functions. Typically, the functions are defined as weighted sums of basis functions, i.e., Sparsity inducing priors can be used to perform variable selection within the set and, further, appropriate priors can be used to control the smoothness of the functions .
4.2 Trend Function
To model the trend of a time series (i.e., a series of data points ordered in time) we use a polynomial spline of degree , see Fahrmeir et al. . Note: A mapping is called polynomial spline of degree (order) with knot points at , (), if
is a polynomial of degree on each of the intervals , ,…, , and
is times continuously differentiable provided that .
Let denote the set of all -th order splines with knots given by . Equipped with the operations of adding two functions and taking real multiples
is a real vector space. One can show that each element ofcan be uniquely written as linear combination of the functions
For this reason, the functions build a basis of the spline space . is called the truncated power series basis (TP-basis). The TP-basis allows for a simple interpretation of the trend model
The trend model consists of a global polynomial of degree which changes at each knot . The amount of change at a given knot point is determined by the absolute value of the corresponding coefficient . Thus, the knots of the polynomial spline are interpreted as (possible) change points in the trend.
According to domain experts, for most menu items the trend (regarding the number of sales) is quite constant over long periods, while significant changes appear only occasionally. Hence, we decide to specify a knot point every -th day between the first and the last date with observations. A possible value for could be , assuming that the trend changes at maximum each month.
As a consequence, the assignment of a sparsity inducing prior to the trend coefficients would allow for an identification of truly significant trend changes. This increases the interpretability of the model and also regularizes the trend function.
4.2.1 Prior for the Trend
In this section, we propose two different priors for the trend model. The first one is mainly inspired by the Bayesian Lasso . Here, we assign independent Laplace priors to the coefficients of the trend model which do not belong to the global polynom. In case that
is assumed to be normally distributed with given variancethe priors read as
The conditioning on is important, because it guarantees a unimodal full posterior . Similarly, in case of a negative binomial response the priors are given by
Assigning Laplace priors to coefficients results in a sparse MAP estimate of these parameters. The amount of sparsity is controlled by the tuning parameter
. As a consequence, true change points can be automatically detected while wrongly proposed ones are ignored. Moreover, as in Bayesian ridge regression we assign independent normal priors to the coefficients:
Thus, in the MAP estimate each of these coefficients is shrunken towards zero according to its importance in a matter of regularization. Finally, the improper and non-informative prior is assigned to the intercept of the trend function.
Besides the usage of Laplace and normal priors we also propose to assign the horseshoe prior, see Carvalho et al. , to all the coefficients of the trend function except of the intercept. In case of a Gaussian target variable the prior reads as (see ):
denotes the half-Cauchy distribution and. If a negative binomial response is assumed, has to be removed from above specification. The horseshoe prior is a shrinkage prior which enforces a global scale on the one hand and, on the other hand, allows for individual adaptions of the degree of overall shrinkage. In particular, this prior shows a pole at zero (function values get arbitrarily large in each neighborhood of zero in absolute values) and has polynomial tails, which, according to Polson and Scott , are important properties for shrinkage priors. Dependent on the specification of the parameter different levels of sparsity can be accommodated. For large , the prior becomes very diffuse and induces only very little shrinkage, while for , all the coefficients will be shrunken to zero. The assignment of half-Cauchy priors to the scale parameters results in aggressive shrinkage of small coefficients, i.e., noise, and almost no shrinkage of sufficiently large coefficients. Indeed, this is one of the main differences to the Bayesian Lasso, or Bayesian ridge regression, where the shrinkage effect is uniform across all coefficients.
In our view, the horseshoe prior is a notable alternative to the Bayesian lasso prior for the following reason. We noticed that for some menu items the trend of the corresponding time series changes drastically in a short period of time. According to domain experts, this can be explained by the introduction of new menu items, the removal of existing items from the menu, price changes and so on. While the horseshoe approach can easily accommodate for drastic changes due to the possibility of learning individual shrinkage strengths, the Lasso prior suffers some serious problems in this case. Due to the uniform shrinkage across all coefficients this prior either over-shrinks the heavy changes, or it does not shrink insignificant trend changes sufficiently. In case of over-shrinkage, the model learns a wrong trend and in case of insufficient shrinkage the model overfits, such that in both cases one has to expect poor predictions. However, we noticed that computing the MAP estimate of a model with the horseshoe prior requires a very good initialization, i.e., finding initial values for the optimization. Otherwise, especially in case of a negative binomial distributed response , the optimization often gets stuck in local minima which is far from being optimal. Unfortunately, preliminary experiments on real-world datasets showed that there is no clear pattern for good initializations. For this reason, we could not determine an automatic procedure for the initialization task. Consequently, we recommend to use the Bayesian Lasso prior as standard approach, and let an expert apply the horseshoe prior if the first one fails, i.e., the Bayesian Lasso prior produces poor predictions due to drastic trend changes.
Obviously, the trend model requires the time to be given as numeric values. Hence, we use a function that maps all possible dates (time points, days) to real numbers. We assume that observations are available at the dates . Further, let denote the number of days that lie within interval . Then, maps to zero and each other date to the number of days that differs from divided by . Finally, are elements of the interval .
4.3 Seasonality Function
The seasonality function is used to model periodic changes of the daily sold quantities
. In order to take account of the seasonal effects, we introduce dummy variables (taking only the valuesand ) for weekdays, months, and days of month. For instance, for the weekday six dummy variables are introduced. The variable is defined as,
The remaining variables are defined analogously. Note that the number of required dummy variables is always given by the number of possible categories minus one. The influence of the category for which no dummy variable is introduced (the so-called base category) is captured by the intercept. Let denote the vector consisting of all dummy variables corresponding to the considered seasonal effects. The seasonality function is then given by
4.3.1 Prior for the Seasonality
Inspired by the Bayesian Lasso , we assign independent Laplace priors to the coefficients of the seasonality model. For a normally distributed target , the priors are specified as
and for a negative binomial target the priors read as
The sparsity of the parameter’s MAP estimate induced by the Laplace priors enables an automatic selection of influential seasonal effects. We do not consider the usage of the horseshoe prior  for the seasonal model. As already mentioned in Section 4.2.1 this prior results in a more complicated model optimization. Since we discovered empirically that the Bayesian Lasso prior performs quite well for the seasonal model it appears unnecessary to accept the additional complexity.
4.3.2 Deciding the Granularity of the Seasonality Model
Let the number of days with observations be denoted by . Clearly, the size of the number influences the usefulness of modeling diverse seasonal effects. Hence, we apply the following strategy:
: model only the effect of weekdays,
: additionally model the effect of months,
: model all above mentioned seasonal effects.
4.4 Prediction Models
As already mentioned at the beginning of Section 4, we propose two Bayesian GAMs for predicting the daily sold quantities of certain menu items in restaurants and canteens. Both models include a trend function (see Section 4.2) and a seasonality function (see Section 4.3).
4.4.1 Normal Model
In the so-called normal model, the daily sold quantity at time , denoted by , is modeled as
where the denote noise terms that are assumed to be i.i.d. according to . Additionally, the non-informative and scale-invariant prior is assigned to the unknown error variance. Suggestions for priors used to regularize and are described in Section 4.2.1 and Section 4.3.1. Assume that we have observations at the time points , then Equation (5) translates to
according to Section 4.3 and
according to Section 4.2.
4.4.2 Negative Binomial Model
As can be seen in Equation (6), the assumption of additive and normally distributed noise directly implies that the target also follows a normal distribution. In our use-case this assumption cannot be satisfied since the target is a non-negative integer. Indeed, this can even lead to predicting negative values, in case that the trend is a monotonically decreasing function. More appropriate distributions for the target variable are the Poisson distribution and the negative binomial distribution, which are both well-established distributions for regression with count data. However, the normal distribution with mean and variance can be viewed at as an approximation of the Poisson distribution with mean . Indeed, the distribution can be thought of as the sum of independent
distributions and, thus, is approximately normal by the central limit theorem. The quality of the approximation improves for increasing. Therefore, it is common practice to assign a normal distribution to a non-negative discrete target, as long as it takes large values on average. However, this is not guaranteed in our application. Consequently, the normality assumption is not the ideal choice. Thus, we decide to model the target
, additionally to the normal distribution, also with a negative binomial distribution which is more flexible than the Poisson distribution. The probability mass function of a negative binomially distributed random variableis given by
with . If one computes the limit of for the second factor in (9) converges to and the third to the exponential function. Hence,
The right-hand side in Equation (10) is the probability mass function of a Poisson distribution with parameter . Thus, the negative binomial distribution converges to the Poisson distribution and the parameter controls the deviation from the Poisson distribution. One can show that mean and variance of a negative binomially distributed random variable evaluate as:
Since the variance of a Poisson distribution is equal to its mean , the term gives the additional variance of the negative binomial distribution compared to the Poisson distribution. In particular, corresponds to the amount of overdispersion scaled by the squared mean . Summing up, the negative binomial distribution can be considered as a generalization of the Poisson distribution which allows to model overdispersion and is not restricted to the limitation . Although, underdispersion cannot be modeled, we do not consider this limitation as a drawback of our choice. Underdispersed data is rather unlikely in practical applications, see . Domain experts have confirmed this statement with regards to the application considered in this work. In other applications, where underdispersion is common, the Conway–Maxwell–Poisson-distribution  is a notable alternative to the negative binomial distribution. This distribution allows for both underdispersed and overdispersed data, but at the price of significantly increased model complexity.
In the so-called negative binomial model the quantity sold on day is modeled as
with . Using the exponential function as response function ensures that the expected demand always stays positive, no matter how the estimates of and look like. Another side effect of this response functions is that the seasonal component acts in a multiplicative way on the trend, . We assign a standard half-normal prior to :
Note that by assigning the prior directly to , most of the prior mass would be on models with a large amount of overdispersion. In case of limited overdispersion, this can lead to a conflict between prior and data. Using the prior (12) can be motivated by considering that a negative binomially distributed random variable can be written as Gamma-Poisson mixture distribution :
Then the standard deviation ofis given by and . Gelman  also recommends to use prior (12) for the overdispersion parameter of a negative binomial distribution.
4.5 Choice of the Tuning Parameters
The priors proposed for the seasonality function (see Section 4.3.1
) depend on a hyperparameter. Further, in case that the priors (2) – (4) are used for the trend , they depend on hyperparameters denoted by and . These parameters can be considered as tuning parameters that control the amount of sparsity/regularization in the estimates of and . Thus, a good specification of and is essential for the model performance. Bad choices could either allow for too much flexibility within a given model (inclusion of too many seasonal effects, a trend function that nearly interpolates the data, etc.) and, thus, result in overfitting, or do not allow for enough flexibility such that some significant effects are ignored.
In principle, there exist three approaches to specify the hyperparameters :
The first one is to determine standard settings that perform quite well on average.
Another possibility is to determine useful settings via cross-validation. While the prediction quality of the resulting model should in general be better than for the standard settings, the process itself is computationally expensive.
The third option would be to specify hyper-priors for the tuning parameters.
In this work, we focus on the first two approaches and defer the last one to future work.
4.5.1 Standard Settings
Certainly, good specifications of the tuning parameters differ from data set to data set, i.e., in the context of this work from menu item to menu item. Nevertheless, it is useful to have a fixed specification of them which works quite well on average. For this purpose, we have tested diverse possible specifications of these parameters with several data sets of representative menu items. The following specifications led to the best results:
Cross-validation allows for an automatic identification of good and individual tuning parameter specifications for each data set. Applying cross-validation on a given data set means that the data set is partitioned into multiple train/test splits. On each of these splits the considered model is trained and tested with different specifications of its tuning parameters. Finally, the specification which on average goes along with the best predictions (according to some pre-defined quality measure) is selected. To take account of temporal dependencies, we consider an expanding window approach for defining the train/test splits. The approach is outlined in Figure 1. Looking at the figure the amount of training data is gradually reduced while the amount of test data is kept constant.
In detail, we apply the following strategy. At first a variable , which determines the maximum number of train/test splits considered by the cross validation, is introduced. Then, another variable , which defines the minimum amount of training data corresponding to a given train/test split, is introduced. Further, another variable , specifying the amount of test data used in each split, is initiated. Thus, the cross-validation can only be applied if and, further, is an upper bound of the number of train/test splits. Provided that the validation can be applied, at first the (in temporal order) last observations are used for testing, while the remaining ones are used for training. Then the test data set is deleted resulting in a new data set of size . If and again the last observations are used for testing, while the remaining ones are used for training. This procedure is repeated until one of the constraints induced by the parameters and is violated.
As can be seen in Section 4.3.1, the seasonality function of the model depends on the amount of dates with observations . In particular, for this function is always of the same structure. Obviously the cross validation for tuning parameter optimization is not useful when inside of it different models are applied. For this reason, we decide to set to and use the standard settings of the tuning parameters in the other case. Moreover, we set to and to . Thus, for sufficiently large about three months are used for testing in the cross-validation overall. The reason for assigning the value to is that restaurants and canteens commonly plan their purchases - days in advance.
4.5.3 Overall Strategy
As already mentioned at the beginning of Section 4.5, specifying the hyperparameters via cross-validation instead of using standard settings generally leads to a better model performance. However, the price to pay is an increased computational effort. Considering that restaurants and canteens offer many different menu items and, thus, require also many prediction models, a fast computation time is essential. For this reason, we recommend to use the standard settings proposed in Section 4.5.1, and perform a cross-validation only for the most important menu items if the model performance is insufficient. The restriction to the standard settings should not be considered too critical, since we discovered empirically that they perform quite well.
In case that cross-validation is performed in order to specify , we perform the process step-by-step for each single tuning parameter. Optimizing all three tuning parameters at once would result in considerable computational effort. Suppose that for each of the three tuning parameters different settings should be considered. As a result there are different combinations of these settings. Using only one cross-validation to identify the optimal setting (in terms of some criterion) would require the evaluation of models for each train/test split of the cross-validation. Even for small values of this is a large number of evaluations which is expensive in terms of time and computational power. For this reason, we decide to iteratively perform three cross-validations (one for each tuning parameter). At first a cross-validation is performed to detect a good specification of the tuning parameter which corresponds to the global polynomial of the trend model. The other tuning parameters and are fixed during the cross-validation and are given by the standard values provided in (14). Then a cross-validation is performed for the tuning parameter , which is responsible for the number of change points in the trend. Again the other two tuning parameters are fixed, but now the value of is given by the one obtained from the already performed cross-validation. Finally, a cross-validation is performed for .
It should be mentioned that the quality criterion used inside the cross-validation is the mean absolute deviation (MAD) between the predicted values and the corresponding true ones. However, in general, this procedure results in worse results than optimizing all three tuning parameters at once.
4.6 Prediction & Prediction Uncertainty
Assume that has to be predicted for with and, further, assume that observations are available at the time points with . Let , and denote MAP estimates corresponding either to the normal model (6) or to the negative binomial model (13). Additionally, treat the estimates as if they were the true values. Then the expected value
can be used for prediction. Note that the exponential function in Equation (15) is taken component-wise and, moreover, that and are defined analogously to and (Equations (7) and (8)), by replacing with .
Besides predicting a posteriori reasonable values for , providing some uncertainty information regarding the forecasts is essential for planning food stock orderings. Especially, intervals that contain the with a pre-defined probability are of particular interest. Then, dependent on the importance of the availability of a given menu item, the manager can decide to go along with the prediction , or adjust her or his orders closer to the upper or lower interval limit. Recall (Section 4.2 and Section 4.2.1) that the trend is assumed to be constant most of the times but changes occasionally. To model this assumption, the trend function includes a knot every -th day between the first and the last date with observations. Additionally, sparsity inducing priors are assigned to the coefficients which represent the trend changes at the knot points . The restriction of merely specifying knots in the observed time period, implies that the trend stays constant in unobserved periods, i.e., in the future. However, in case that prediction uncertainty information is required, possible future trend changes must be considered. Hence, we extend the trend function to
where denote additionally introduced knot points. The additional knots extend the approach of specifying a knot point every -th day within the observed period to the time interval . In case that the expansion is equal to the original trend function . Additionally, it is assumed that the coefficients are i.i.d. zero mean Laplace distributed:
Assuming that the scale of future trend changes is determined by the scale of the historically observed changes the parameter is estimated as
Note that given i.i.d. samples from a zero mean Laplace distribution with scale parameter , the mean of the absolute values of the is the maximum likelihood estimator of . Since the estimates , and are treated as if they were the true values, prediction intervals for the can be computed as follows:
Expand the trend function to , see Equation (4.6).
By replacing with :
Determine the model matrix analogous to the computation of (Equation (7)).
Determine the model matrix analogous to the computation of . Note that that now is used to model the trend and not .
Draw a sample from the conditional distribution of conditioned on . The distribution is given by:
In case that equals , the vector is empty and can be ignored.
Repeat step 3 and step 4 a pre-defined number of times.
Compute the and quantiles of the samples drawn from component-wise in order to obtain the desired prediction intervals.
It should be mentioned that the assumptions taken for the uncertainty estimation are strong and, therefore, one cannot expect the prediction intervals to have exact coverage. Hence, the intervals should rather be consider as an indicator for the level of uncertainty.
At first, the restriction to MAP estimates implies that uncertainty is underestimated. However, as already stated at the beginning of Section 4, we consider the application of full Bayesian inference as inappropriate due to the additional complexity. Moreover, it is restrictive to assume that future trend changes are independent and of the same average magnitude as historic changes. Nevertheless, considering that restaurants and canteens commonly plan their purchases - days in advance, the number of possible future trend changes that have to be taken account of is very limited. For this reason, using very sophisticated approaches to model future trend changes is not required.
Besides using prediction intervals to measure the uncertainty of future predictions they can also be used to validate if the model fits the observed data. In a first step, the intervals are computed for the time period with observations. Then we check if the fraction of days for which the target variable does not lie within the corresponding intervals is significantly larger than . If this is the case the model does not fit the data well.
5 Performance Evaluation
In this section, the prediction quality of our approach is evaluated. First, Subsection 5.1 provides a brief description of the test data used. Then, in Subsection 5.2 we briefly explain how the proposed models have been implemented. In Subsection 5.3, we evaluate how the models fit the features of the considered restaurant data. Finally, in Subsection 5.3.2, we compare our prediction models against other promising approaches from the literature.
Throughout this section the degree of the polynomial spline used to model the trend is specified as . It turned out that higher degrees lead to too strong decreases or increases during prediction. Moreover, the trend function is designed to have a knot point every -th day.
5.1 Testing Data
In this study, POS data from two different restaurants, that was provided by a partnering restaurant consulting firm, is used. Restaurant A is a rather casual place located in Vienna, Austria, offering a traditional Austrian menu. Restaurant B is a large staff canteen in the Netherlands, operated by a major food services company. For both locations, the data covers a time span of about months. Moreover, the most representative menu items are selected, i.e., accounting for most sales. For restaurant A the menu items are categorized. An overview of the number of time series considered per category is given in Table 1. For restaurant B a categorization is not possible, since for this restaurant we only obtained data for which the product names and categories have been masked. However, the performance evaluation includes time series data belonging to product groups.
|Category||# time series|
5.2 Implementation of the Models
We have implemented the models proposed in Section 4 using RStan , the interface between the programming languages R  and Stan . Stan is known as a state-of-the-art platform for statistical modeling and high-performance statistical computation. In particular, Stan can be used to compute the MAP-estimate of the model parameters and also to perform full Bayesian inference via Markov chain Monte Carlo (MCMC) sampling. Hence, a few lines of Stan code, complemented by a R script to compute the matrices and , suffice to express our Bayesian models.
5.3 Insights on the Proposed Models
In this subsection, we illustrate how well the proposed models fit the properties of provided real-world data. The evaluation is based on representative datasets from the above described restaurants A and B. In Section 5.3.1, the priors given by the Equations (2) – (4) are assigned to the trend functions of the considered models. Further, the standard settings specified in (14) are assigned to the tuning parameters of the models. These are our recommendations in case that the model optimization has to take place automatically and at low computational cost. In Section 5.3.2, we illustrate the superior performance of more advanced approaches for a time series associated with restaurant B.
5.3.1 Standard Approaches
At first, a representative time series T1 corresponding to restaurant A is considered. In Figure 3, the model fit is visualized for the normal model (5) and in Figure 3, the model fit of the negative binomial model is shown. The trend function is plotted in red, the seasonal function is colored green, the expected value , see Equation (15), is drawn in dark blue and, finally, the prediction intervals for the components of are plotted in light blue. In particular, for the negative binomial model, the exponential function of the trend and the season are plotted. In comparison to the normal model, the seasonal function of the negative binomial model takes small values. The reason for this is that the seasonal function acts in a multiplicative way on the trend, , in the negative binomial model and in an additive way, , in the normal model.
For the normal model, of the observed sales lie within the prediction intervals. The prediction intervals of the negative binomial model cover of the sales. Thus, both models provide a reasonable fit of the observed time series. In Figures 5 and 5 the coefficients of the trend functions are visualized. In particular, each coefficient is plotted against the time point it has a non-zero effect on the model for the first time. Since the coefficients of the global polynom effect the model from the beginning, different symbols are used to visualize them (a square for the intercept , a triangle for the coefficient ). While the normal model does not consider any significant trend changes, the negative binomial model detects a slight change in the second half of the year 2019. The coefficients of the seasonality functions are shown in the Figures 7 and 7. While the negative binomial model considers more seasonal effects as significant than the normal model, both models agree on the effects with the highest impact (December, November, August, Thursday, Friday, eleven-th day of month, …).
Now a representative time series T2 belonging to restaurant B is considered. In the Figures 9 – 13 the model fit is shown for the normal model and for the negative binomial model. The figures can be interpreted in the same ways it has been done for time series T1. For the normal model, of the observed sales lie within the corresponding prediction intervals. The prediction intervals of the negative binomial model cover of the sales.
5.3.2 Advanced Approaches
In this section, the time series T3 corresponding to restaurant B is considered. This time series shows a drastic trend change within a short period of time. In Figure 14, the model fit of the negative binomial model is visualized with different prior specifications of the trend function. We use the prior (3) – (4), that is inspired by the Bayesian Lasso, and also the horseshoe prior. In case of prior (3) – (4) the model is trained once with the standard settings (14) and once according to the step-wise cross-validation (CV) described in Section 4.5.3. In Figure 14, we observe that the model has some serious problems in case that prior (3) – (4) is applied with the standard specification of the tuning parameters. The shrinkage of the trend coefficients is too strong, such that the trend function fails to sufficiently model the abrupt decrease at the end of 2017. If cross-validation is performed, a better model fit is obtained. However, significant trend changes are detected nearly every month which is an indication of overfitting. Due to the uniform shrinkage going along with the Lasso prior, insignificant trend changes cannot be shrunken sufficiently in order to allow for the appearance of drastic changes. Finally, an application of the horseshoe prior results in a good model fit. The trend function includes few large coefficients in absolute values to model the abrupt trend change, while most of the coefficients are equal to zero.
5.4 Comparison Against Other Prediction Methods
In this section, the performance of our models is compared against other well-established time series forecasting approaches. The priors given by the Equations (2) – (4) are assigned to the trend functions of our models and, further, the standard settings specified in (14) for the model tuning parameters are used.
For a given time series and a given accuracy measure the performance of the considered model is evaluated using a -fold cross-validation as illustrated in Figure 1. The size of the test datasets is specified as .
5.4.1 Forecast Accuracy Measures
Clearly, some measures to compare the prediction quality of our approach against others are required. A review of related work, see Section 3, shows that commonly used measures for point estimates are: mean absolute deviation (MAD), mean squared error (MSE), mean absolute percentage error (MAPE) (used in [39, 38]). In Section 5, our approach is compared against others from the literature on several different time series, i.e., different menu items and restaurants, in order to obtain a broad comparison. Hence, it is inevitable that the used prediction quality measures are scale invariant and that they can deal with zero values. While above mentioned measures lack these features, suitable measures are the mean arctangent absolute percentage error (MAAPE), proposed by Kim and Kim  and the weighted absolute percentage error (WAPE).
In order to evaluate the quality of prediction intervals the following measures are helpful: prediction interval coverage probability (PICP), prediction interval normalized average width (PINAW), coverage-width-based criterion (CWC) . It is desirable to have a small PINAW and a large PICP . The CWC criterion is designed to balance the relationship between PINAW and PICP:
where equals if PICP is greater or equal to the pre-defined confidence level . Otherwise, is equal to zero. The control parameter penalizes PICPs smaller than . In this comparison study, is specified as and, moreover, the value is assigned to .
5.4.2 Prediction Methods to Compare Against
Now, we briefly present the methods that we consider for comparison against our proposed model. Therefore, we adopt the selection taken by Taylor and Letham .
facebook’s prophet method. The prophet R package implements a modular regression model with interpretable parameters . We discuss the model in more detail in Section 3. Within the performance comparison the model is once used with automatic season detection (prophetA) and once with yearly and weekly seasonality (prophetS).
(Seasonal) ARIMA. The auto.arima function returns the best (seasonal) ARIMA model according to the AIC or BIC value. Here, we choose the Akaike information criterion (AIC) and specify the seasonlity to be weekly. Note that, when using ARIMA models, the data points must be regularly spaced, and missing values must be interpolated.
Exponential Smoothing. The ets function implements an exponential smoothing state space model. It fits a collection of exponential smoothing models and selects the best, see Hyndman et al. . In this performance comparison the seasonality of the exponential smoothing model is defined to be weekly.
TBATS. Trigonometric seasonality based on Fourier series, Box-Cox transformation, ARMA errors, Trend, and Seasonal components, see Livera et al. . In this comparison study, the TBATS model is specified to learn weekly, monthly, and yearly seasonalities.
Finally, we present how our approach performs compared to the ones described above. In case that the target values are assumed to be negative binomial distributed our method is abbreviated by NegBinom and in case that the assumption is normally distributed data it is abbreviated with Normal. In Tables 2 and 3 averages of the performance measures described in Section 5.4.1 are given, respectively per dataset/category. For each single time series the measures are computed based on a -fold cross-validation.
A careful inspection of the Tables 2 and 3 reveals that our approach with the negative Binomial distribution provides the best point estimates overall. Sometimes other approaches show lower errors, but the differences are negligibly small. In particular, this approach is the only one which always performs well independent of the considered category/dataset and, thus, provides the most robust point estimates. For restaurant A the prediction intervals of our approach with the negative Binomial distribution have by far the lowest CWC values. Reason for the superior performance is a good coverage probability paired with tight interval widths. For restaurant B the CWC values of our method can be found in the midfield. This can be explained by the fact that the PINAW values are again quite small, probably too small for the often quite irregular data we observed at this restaurant.
6 Conclusion & Future Work
In this work, a new approach for predicting future sales of menu items in restaurants and staff canteens was proposed. In particular, two Bayesian generalized additive models were presented. The first one assumes future sales to be normally distributed, while the second one uses the more appropriate negative Binomial distribution. Both approaches use shrinkage priors to automatically learn significant multiple seasonal effects and trend changes. The features learned by the models have a straightforward human interpretation which helps potential users to create the necessary trust and confidence in the methodology. The performance of our approach was extensively evaluated and compared to other well-established forecasting methods. Basis of the analysis were two data sets retrieved from (electronic) Point of Sales (POS) systems collected at a restaurant and a staff canteen. The evaluations have shown that our approach provides the best and most robust point predictions overall. For one of the two datasets, the prediction intervals were notably more accurate than the ones obtained from the comparison methods.
Currently, our approach only takes account of POS data. This makes it universally applicable, but also limits the prediction quality that can be achieved. In future work, we plan to enrich the data source by weather data, information regarding special events and holidays, and, last but not least, expert knowledge of restaurant managers. Additionally, we want to extend our approach such that it can also predict on hourly basis. Accurate sales predictions on hourly basis will allow an optimization of workforce planning.
- TON  Environmental impacts of food waste: Learnings and challenges from a case study on UK. Waste Management, 76:744–766, 2018. ISSN 0956-053X. doi: 10.1016/j.wasman.2018.03.032.
- NRA  2019 restaurant industry factbook, 2019. URL https://restaurant.org/Downloads/PDFs/Research/SOI/restaurant_industry_fact_sheet_2019.pdf. [Online; accessed 29-Oct-2019].
- ReF  Restaurant food waste action guide, 2019. URL https://www.refed.com/downloads/Restaurant_Guide_Web.pdf. [Online; accessed 29-Oct-2019].
- Bhattacharya et al.  A. Bhattacharya, D. Pati, N. S. Pillai, and D. B. Dunson. Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512):1479–1490, 2015. doi: 10.1080/01621459.2014.960967.
- Bujisic et al.  M. Bujisic, V. Bogicevic, and H. G. Parsa. The effect of weather factors on restaurant sales. Journal of Foodservice Business Research, 20(3):350–370, May 2017. ISSN 1537-8020, 1537-8039. doi: 10.1080/15378020.2016.1209723.
- Carpenter et al.  B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. ISSN 1548-7660. doi: 10.18637/jss.v076.i01.
- Carvalho et al.  C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. doi: 10.2307/25734098.
- Coleman-Jensen et al.  A. Coleman-Jensen, M. P. Rabbitt, C. A. Gregory, and A. Singh. Household food security in the United States in 2018, err-270,. U.S. Department of Agriculture, Economic Research Service, September 2019. URL https://www.ers.usda.gov/webdocs/publications/94849/err-270.pdf?v=963.1.
- Conway and Maxwell  R. W. Conway and W. L. Maxwell. A queuing model with state dependent service rates. Journal of Industrial Engineering, 12:132–136, 1962.
- Cranage and Andrew  D. A. Cranage and W. P. Andrew. A comparison of time series and econometric models for forecasting restaurant sales. International Journal of Hospitality Management, 11(2):129–142, 1992. doi: 10.1016/0278-4319(92)90006-H.
- Fahrmeir et al.  L. Fahrmeir, T. Kneib, S. Lang, and B. Marx. Regression - Models, Methods and Applications. Springer, Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-34333-9.
- Forst  F. G. Forst. Forecasting restaurant sales using multiple regression and Box-Jenkins analysis. Journal of Applied Business Research, 8(2):15–19, 1992. doi: 10.19030/jabr.v8i2.6157.
- Gelman  A. Gelman. Prior choice recommendations. https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations, 2019.
- Hastie and Tibshirani  T. Hastie and R. Tibshirani. Generalized additive models: Some applications. Journal of the American Statistical Association, 82(398):371–386, 1987. ISSN 01621459. doi: 10.2307/2289439.
- Hickey and Ozbay  M. E. Hickey and G. Ozbay. Food waste in the United States: A contributing factor toward environmental instability. Frontiers in Environmental Science, 2:51, 2014. ISSN 2296-665X. doi: 10.3389/fenvs.2014.00051.
- Hilbe  J. M. Hilbe. Negative Binomial Regression. Cambridge University Press, 2 edition, 2011. doi: 10.1017/CBO9780511973420.
- Hilbe et al.  J. M. Hilbe, R. S. de Souza, and E. E. O. Ishida. Bayesian Models for Astrophysical Data: Using R, JAGS, Python, and Stan. Cambridge University Press, 2017. doi: 10.1017/CBO9781316459515.
- Holt  C. C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 20(1):5–10, 2004. doi: 10.1016/j.ijforecast.2003.09.015.
- Hu et al.  C. Hu, M. Chen, and S.-L. C. McCain. Forecasting in short-term planning and management for a casino buffet restaurant. Journal of Travel & Tourism Marketing, 16(2-3):79–98, July 2004. doi: 10.1300/J073v16n02_07.
- Hyndman et al.  R. Hyndman, G. Athanasopoulos, C. Bergmeir, G. Caceres, L. Chhay, M. O’Hara-Wild, F. Petropoulos, S. Razbash, E. Wang, and F. Yasmeen. forecast: Forecasting functions for time series and linear models, 2019. URL http://pkg.robjhyndman.com/forecast. R package version 8.9.
- Hyndman and Khandakar  R. J. Hyndman and Y. Khandakar. Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 26(3):1–22, 2008. URL http://www.jstatsoft.org/article/view/v027i03.
- Hyndman et al.  R. J. Hyndman, A. B. Koehler, R. D. Snyder, and S. Grose. A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting, 18(3):439–454, 2002. ISSN 0169-2070. doi: 10.1016/S0169-2070(01)00110-8.
- Kaneko and Yada  Y. Kaneko and K. Yada. A deep learning approach for the prediction of retail store sales. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 531–537, Dec 2016. doi: 10.1109/ICDMW.2016.0082.
Khosravi et al. 
A. Khosravi, S. Nahavandi, and D. Creighton.
A prediction interval-based approach to determine optimal structures of neural network metamodels.Expert Systems with Applications, 37(3):2377–2387, 2010. ISSN 0957-4174. doi: 10.1016/j.eswa.2009.07.059.
- Khosravi et al.  A. Khosravi, S. Nahavandi, and D. Creighton. Construction of optimal prediction intervals for load forecasting problems. IEEE Transactions on Power Systems, 25(3):1496–1503, Aug 2010. ISSN 1558-0679. doi: 10.1109/TPWRS.2010.2042309.
- Kim and Kim  S. Kim and H. Kim. A new metric of absolute percentage error for intermittent demand forecasts. International Journal of Forecasting, 32(3):669–679, 2016. ISSN 0169-2070. doi: 10.1016/j.ijforecast.2015.12.003.
- Lasek et al.  A. Lasek, N. Cercone, and J. Saunders. Restaurant sales and customer demand forecasting: Literature survey and categorization of methods. In A. Leon-Garcia, R. Lenort, D. Holman, D. Staš, V. Krutilova, P. Wicher, D. Cagáňová, D. Špirková, J. Golej, and K. Nguyen, editors, Smart City 360, pages 479–491, Cham, 2016. Springer International Publishing. ISBN 978-3-319-33681-7. doi: 10.1007/978-3-319-33681-7_40.
- Livera et al.  A. M. D. Livera, R. J. Hyndman, and R. D. Snyder. Forecasting time series with complex seasonal patterns using exponential smoothing. Journal of the American Statistical Association, 106(496):1513–1527, 2011. doi: 10.1198/jasa.2011.tm09771.
- Makalic and Schmidt  E. Makalic and D. F. Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2016. doi: 10.1109/LSP.2015.2503725.
- Park and Casella  T. Park and G. Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. doi: 10.1198/016214508000000337.
- Polson and Scott  N. G. Polson and J. G. Scott. Local shrinkage rules, Lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):287–311, 2012. doi: 10.1111/j.1467-9868.2011.01015.x.
- R Core Team  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019. URL https://www.R-project.org.
- Reynolds et al.  D. Reynolds, I. Rahman, and W. Balinbin. Econometric modeling of the U.S. restaurant industry. International Journal of Hospitality Management, 34:317–323, Sept. 2013. doi: 10.1016/j.ijhm.2013.04.003.
- Rockova and George  V. Rockova and E. I. George. The spike-and-slab lasso. Journal of the American Statistical Association, 113(521):431–444, 2018. doi: 10.1080/01621459.2016.1260469.
- Ryu and Sanchez  K. Ryu and A. Sanchez. The evaluation of forecasting methods at an institutional foodservice dining facility. The Journal of Hospitality Financial Management, 11(1):27–45, 2003.
- Stan Development Team  Stan Development Team. RStan: The R interface to Stan, 2018. URL http://mc-stan.org/. R package version 2.18.2.
- Tanizaki et al.  T. Tanizaki, T. Hoshino, T. Shimmura, and T. Takenaka. Demand forecasting in restaurants using machine learning and statistical analysis. Procedia CIRP, 79:679–683, 2019. ISSN 2212-8271. doi: 10.1016/j.procir.2019.02.042. 12th CIRP Conference on Intelligent Computation in Manufacturing Engineering, 18-20 July 2018, Gulf of Naples, Italy.
- Taylor and Letham  S. J. Taylor and B. Letham. Forecasting at scale. The American Statistician, 72(1):37–45, 2018. doi: 10.1080/00031305.2017.1380080.
- Xinliang and Dandan  L. Xinliang and S. Dandan. University restaurant sales forecast based on BP neural network – in Shanghai Jiao Tong university case. In Y. Tan, H. Takagi, Y. Shi, and B. Niu, editors, Advances in Swarm Intelligence, pages 338–347, Cham, 2017. Springer International Publishing. ISBN 978-3-319-61833-3. doi: 10.1007/978-3-319-61833-3_36.